2010-12-05 17:29:37

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 0/7] f/madivse(DONTNEED) support

Recently there is a report about working set page eviction due to rsync
workload. application programmers want to use fadvise but it's not easy.
You could see detailed description on [2/7].
- [1/7] is to remove checkpatch's reporting in mm/swap.c
- [2/7] is to move invalidated page which is dirty/writeback on active list
into inactive list's head.
- [3/7] is to move memcg reclaimable page on inactive's tail.
- [4/7] is for moving invalidated page into inactive list's tail when the
page's writeout is completed for reclaim asap.
- [5/7] is to add profing information for evaluation.
- [6/7] is to remove zap_detail NULL dependency to some functions. It is for
next patch.
- [7/7] is to not calling mark_page_accessed in case of madvise(DONTNEED)

This patches are based on mmotm-12-02
Before applying the series, Please, remove below patches.
mm-deactivate-invalidated-pages.patch
mm-deactivate-invalidated-pages-fix.patch

Minchan Kim (7):
Fix checkpatch's report in swap.c
deactivate invalidated pages
move memcg reclaimable page into tail of inactive list
Reclaim invalidated page ASAP
add profile information for invalidated page reclaim
Remove zap_details NULL dependency
Prevent activation of page in madvise_dontneed

include/linux/memcontrol.h | 6 ++
include/linux/mm.h | 10 ++++
include/linux/swap.h | 1 +
include/linux/vmstat.h | 4 +-
mm/madvise.c | 13 +++--
mm/memcontrol.c | 27 +++++++++
mm/memory.c | 19 ++++---
mm/mmap.c | 6 ++-
mm/page-writeback.c | 12 ++++-
mm/swap.c | 127 +++++++++++++++++++++++++++++++++++++++++---
mm/truncate.c | 17 +++++--
mm/vmstat.c | 3 +
12 files changed, 216 insertions(+), 29 deletions(-)


2010-12-05 17:29:53

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 1/7] Fix checkpatch's report in swap.c

checkpatch reports following problems.
It's a very annoying. This patch fixes it.

barrios@barrios-desktop:~/linux-2.6$ ./scripts/checkpatch.pl -f mm/swap.c
WARNING: line over 80 characters
+ if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {

WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
+EXPORT_SYMBOL(mark_page_accessed);

ERROR: code indent should use tabs where possible
+ ^I^I}$

WARNING: please, no space before tabs
+ ^I^I}$

WARNING: please, no spaces at the start of a line
+ ^I^I}$

WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
+EXPORT_SYMBOL(__pagevec_release);

WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
+EXPORT_SYMBOL(____pagevec_lru_add);

WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
+EXPORT_SYMBOL(pagevec_lookup);

WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
+EXPORT_SYMBOL(pagevec_lookup_tag);

total: 1 errors, 8 warnings, 517 lines checked

Signed-off-by: Minchan Kim <[email protected]>
---
mm/swap.c | 10 +++-------
1 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 3f48542..d5822b0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -118,7 +118,8 @@ static void pagevec_move_tail(struct pagevec *pvec)
zone = pagezone;
spin_lock(&zone->lru_lock);
}
- if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+ if (PageLRU(page) && !PageActive(page) &&
+ !PageUnevictable(page)) {
int lru = page_lru_base_type(page);
list_move_tail(&page->lru, &zone->lru[lru].list);
pgmoved++;
@@ -212,7 +213,6 @@ void mark_page_accessed(struct page *page)
SetPageReferenced(page);
}
}
-
EXPORT_SYMBOL(mark_page_accessed);

void __lru_cache_add(struct page *page, enum lru_list lru)
@@ -371,7 +371,7 @@ void release_pages(struct page **pages, int nr, int cold)
}
__pagevec_free(&pages_to_free);
pagevec_reinit(&pages_to_free);
- }
+ }
}
if (zone)
spin_unlock_irqrestore(&zone->lru_lock, flags);
@@ -396,7 +396,6 @@ void __pagevec_release(struct pagevec *pvec)
release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
pagevec_reinit(pvec);
}
-
EXPORT_SYMBOL(__pagevec_release);

/*
@@ -438,7 +437,6 @@ void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
release_pages(pvec->pages, pvec->nr, pvec->cold);
pagevec_reinit(pvec);
}
-
EXPORT_SYMBOL(____pagevec_lru_add);

/*
@@ -481,7 +479,6 @@ unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
pvec->nr = find_get_pages(mapping, start, nr_pages, pvec->pages);
return pagevec_count(pvec);
}
-
EXPORT_SYMBOL(pagevec_lookup);

unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping,
@@ -491,7 +488,6 @@ unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping,
nr_pages, pvec->pages);
return pagevec_count(pvec);
}
-
EXPORT_SYMBOL(pagevec_lookup_tag);

/*
--
1.7.0.4

2010-12-05 17:30:06

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 2/7] deactivate invalidated pages

Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2)
It happens by backup workloads(ex, nightly rsync).
That's because the workload makes just use-once pages
and touches pages twice. It promotes the page into
active list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE.
But other OSes don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED.
But it has a problem. If kernel meets page is writing
during invalidate_mapping_pages, it can't work.
It is very hard for application programmer to use it.
Because they always have to sync data before calling
fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
be discardable. At last, they can't use deferred write of kernel
so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.
It means we don't use the page any more. So let's move
the writing page into inactive list's head.

Why I need the page to head, Dirty/Writeback page would be flushed
sooner or later. It can prevent writeout of pageout which is less
effective than flusher's writeout.

Originally, I reused lru_demote of Peter with some change so added
his Signed-off-by.

Reported-by: Ben Gamari <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>

Andrew. Before applying this series, please drop below two patches.
mm-deactivate-invalidated-pages.patch
mm-deactivate-invalidated-pages-fix.patch

Changelog since v3:
- Change function comments - suggested by Johannes
- Change function name - suggested by Johannes
- add only dirty/writeback pages to deactive pagevec

Changelog since v2:
- mapped page leaves alone - suggested by Mel
- pass part related PG_reclaim in next patch.

Changelog since v1:
- modify description
- correct typo
- add some comment

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/swap.h | 1 +
mm/swap.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++
mm/truncate.c | 17 ++++++++--
3 files changed, 92 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index eba53e7..605ab62 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -213,6 +213,7 @@ extern void mark_page_accessed(struct page *);
extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
+extern void deactivate_page(struct page *page);
extern void swap_setup(void);

extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index d5822b0..1f36f6f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -39,6 +39,8 @@ int page_cluster;

static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
+

/*
* This path almost never happens for VM activity - pages are normally
@@ -267,6 +269,59 @@ void add_page_to_unevictable_list(struct page *page)
}

/*
+ * If the page can not be invalidated, it is moved to the
+ * inactive list to speed up its reclaim. It is moved to the
+ * head of the list, rather than the tail, to give the flusher
+ * threads some time to write it out, as this is much more
+ * effective than the single-page writeout from reclaim.
+ */
+static void lru_deactivate(struct page *page, struct zone *zone)
+{
+ int lru, file;
+
+ if (!PageLRU(page) || !PageActive(page))
+ return;
+
+ /* Some processes are using the page */
+ if (page_mapped(page))
+ return;
+
+ file = page_is_file_cache(page);
+ lru = page_lru_base_type(page);
+ del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ add_page_to_lru_list(zone, page, lru);
+ __count_vm_event(PGDEACTIVATE);
+
+ update_page_reclaim_stat(zone, page, file, 0);
+}
+
+static void ____pagevec_lru_deactivate(struct pagevec *pvec)
+{
+ int i;
+ struct zone *zone = NULL;
+
+ for (i = 0; i < pagevec_count(pvec); i++) {
+ struct page *page = pvec->pages[i];
+ struct zone *pagezone = page_zone(page);
+
+ if (pagezone != zone) {
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ zone = pagezone;
+ spin_lock_irq(&zone->lru_lock);
+ }
+ lru_deactivate(page, zone);
+ }
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+
+ release_pages(pvec->pages, pvec->nr, pvec->cold);
+ pagevec_reinit(pvec);
+}
+
+/*
* Drain pages out of the cpu's pagevecs.
* Either "cpu" is the current CPU, and preemption has already been
* disabled; or "cpu" is being hot-unplugged, and is already dead.
@@ -292,6 +347,29 @@ static void drain_cpu_pagevecs(int cpu)
pagevec_move_tail(pvec);
local_irq_restore(flags);
}
+
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ ____pagevec_lru_deactivate(pvec);
+}
+
+/**
+ * deactivate_page - forcefully deactivate a page
+ * @page: page to deactivate
+ *
+ * This function hints the VM that @page is a good reclaim candidate,
+ * for example if its invalidation fails due to the page being dirty
+ * or under writeback.
+ */
+void deactivate_page(struct page *page)
+{
+ if (likely(get_page_unless_zero(page))) {
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+ if (!pagevec_add(pvec, page))
+ ____pagevec_lru_deactivate(pvec);
+ put_cpu_var(lru_deactivate_pvecs);
+ }
}

void lru_add_drain(void)
diff --git a/mm/truncate.c b/mm/truncate.c
index cd94607..ef03cbc 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -332,7 +332,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
{
struct pagevec pvec;
pgoff_t next = start;
- unsigned long ret = 0;
+ unsigned long ret;
+ unsigned long count = 0;
int i;

pagevec_init(&pvec, 0);
@@ -359,8 +360,16 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
if (lock_failed)
continue;

- ret += invalidate_inode_page(page);
-
+ ret = invalidate_inode_page(page);
+ /*
+ * If the page is dirty or under writeback, we can not
+ * invalidate it now. But we assume that attempted
+ * invalidation is a hint that the page is no longer
+ * of interest and try to speed up its reclaim.
+ */
+ if (!ret && (PageDirty(page) || PageWriteback(page)))
+ deactivate_page(page);
+ count += ret;
unlock_page(page);
if (next > end)
break;
@@ -369,7 +378,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
mem_cgroup_uncharge_end();
cond_resched();
}
- return ret;
+ return count;
}
EXPORT_SYMBOL(invalidate_mapping_pages);

--
1.7.0.4

2010-12-05 17:30:16

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 3/7] move memcg reclaimable page into tail of inactive list

Golbal page reclaim moves reclaimalbe pages into inactive list
to reclaim asap. This patch apply the rule in memcg.
It can help to prevent unnecessary working page eviction of memcg.

Cc: Balbir Singh <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/memcontrol.h | 6 ++++++
mm/memcontrol.c | 27 +++++++++++++++++++++++++++
mm/swap.c | 3 ++-
3 files changed, 35 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 067115c..8317f5c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -62,6 +62,7 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
extern void mem_cgroup_del_lru(struct page *page);
extern void mem_cgroup_move_lists(struct page *page,
@@ -207,6 +208,11 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
return ;
}

+static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+ return ;
+}
+
static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
{
return ;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 729beb7..f9435be 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -829,6 +829,33 @@ void mem_cgroup_del_lru(struct page *page)
mem_cgroup_del_lru_list(page, page_lru(page));
}

+/*
+ * Writeback is about to end against a page which has been marked for immediate
+ * reclaim. If it still appears to be reclaimable, move it to the tail of the
+ * inactive list.
+ */
+void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+ struct mem_cgroup_per_zone *mz;
+ struct page_cgroup *pc;
+ enum lru_list lru = page_lru_base_type(page);
+
+ if (mem_cgroup_disabled())
+ return;
+
+ pc = lookup_page_cgroup(page);
+ /*
+ * Used bit is set without atomic ops but after smp_wmb().
+ * For making pc->mem_cgroup visible, insert smp_rmb() here.
+ */
+ smp_rmb();
+ /* unused or root page is not rotated. */
+ if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
+ return;
+ mz = page_cgroup_zoneinfo(pc);
+ list_move_tail(&pc->lru, &mz->lists[lru]);
+}
+
void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
{
struct mem_cgroup_per_zone *mz;
diff --git a/mm/swap.c b/mm/swap.c
index 1f36f6f..0fe98e7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -122,8 +122,9 @@ static void pagevec_move_tail(struct pagevec *pvec)
}
if (PageLRU(page) && !PageActive(page) &&
!PageUnevictable(page)) {
- int lru = page_lru_base_type(page);
+ enum lru_list lru = page_lru_base_type(page);
list_move_tail(&page->lru, &zone->lru[lru].list);
+ mem_cgroup_rotate_reclaimable_page(page);
pgmoved++;
}
}
--
1.7.0.4

2010-12-05 17:30:26

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 5/7] add profile information for invalidated page reclaim

This patch adds profile information about invalidated page reclaim.
It's just for profiling for test so it would be discard when the series
are merged.

Signed-off-by: Minchan Kim <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>
---
include/linux/vmstat.h | 4 ++--
mm/swap.c | 3 +++
mm/vmstat.c | 3 +++
3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 833e676..c38ad95 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -30,8 +30,8 @@

enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
- PGFREE, PGACTIVATE, PGDEACTIVATE,
- PGFAULT, PGMAJFAULT,
+ PGFREE, PGACTIVATE, PGDEACTIVATE, PGINVALIDATE,
+ PGRECLAIM, PGFAULT, PGMAJFAULT,
FOR_ALL_ZONES(PGREFILL),
FOR_ALL_ZONES(PGSTEAL),
FOR_ALL_ZONES(PGSCAN_KSWAPD),
diff --git a/mm/swap.c b/mm/swap.c
index 0f23998..2f21e6e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -317,6 +317,7 @@ static void lru_deactivate(struct page *page, struct zone *zone)
* is _really_ small and it's non-critical problem.
*/
SetPageReclaim(page);
+ __count_vm_event(PGRECLAIM);
} else {
/*
* The page's writeback ends up during pagevec
@@ -328,6 +329,8 @@ static void lru_deactivate(struct page *page, struct zone *zone)

if (active)
__count_vm_event(PGDEACTIVATE);
+
+ __count_vm_event(PGINVALIDATE);
update_page_reclaim_stat(zone, page, file, 0);
}

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3555636..ef6102d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -818,6 +818,9 @@ static const char * const vmstat_text[] = {
"pgactivate",
"pgdeactivate",

+ "pginvalidate",
+ "pgreclaim",
+
"pgfault",
"pgmajfault",

--
1.7.0.4

2010-12-05 17:30:34

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 7/7] Prevent activation of page in madvise_dontneed

Now zap_pte_range alwayas activates pages which are pte_young &&
!VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
it's unnecessary since the page wouldn't use any more.

Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Hugh Dickins <[email protected]>

Changelog since v3:
- Change variable name - suggested by Johannes
- Union ignore_references with zap_details - suggested by Hugh

Changelog since v2:
- remove unnecessary description

Changelog since v1:
- change word from promote to activate
- add activate argument to zap_pte_range and family function
---
include/linux/mm.h | 4 +++-
mm/madvise.c | 6 +++---
mm/memory.c | 5 ++++-
3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6522ae4..e57190f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -771,12 +771,14 @@ struct zap_details {
pgoff_t last_index; /* Highest page->index to unmap */
spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */
unsigned long truncate_count; /* Compare vm_truncate_count */
+ bool ignore_references; /* For page activation */
};

#define __ZAP_DETAILS_INITIALIZER(name) \
{ .nonlinear_vma = NULL \
, .check_mapping = NULL \
- , .i_mmap_lock = NULL }
+ , .i_mmap_lock = NULL \
+ , .ignore_references = false }

#define DEFINE_ZAP_DETAILS(name) \
struct zap_details name = __ZAP_DETAILS_INITIALIZER(name)
diff --git a/mm/madvise.c b/mm/madvise.c
index bfa17aa..8e7aba3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -163,6 +163,7 @@ static long madvise_dontneed(struct vm_area_struct * vma,
unsigned long start, unsigned long end)
{
DEFINE_ZAP_DETAILS(details);
+ details.ignore_references = true;

*prev = vma;
if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
@@ -173,10 +174,9 @@ static long madvise_dontneed(struct vm_area_struct * vma,
details.last_index = ULONG_MAX;

zap_page_range(vma, start, end - start, &details);
- } else {
-
+ } else
zap_page_range(vma, start, end - start, &details);
- }
+
return 0;
}

diff --git a/mm/memory.c b/mm/memory.c
index c0879bb..44d87e1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -897,6 +897,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
pte_t *pte;
spinlock_t *ptl;
int rss[NR_MM_COUNTERS];
+ bool ignore_references = details->ignore_references;

init_rss_vec(rss);

@@ -952,7 +953,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (pte_dirty(ptent))
set_page_dirty(page);
if (pte_young(ptent) &&
- likely(!VM_SequentialReadHint(vma)))
+ likely(!VM_SequentialReadHint(vma)) &&
+ !ignore_references)
mark_page_accessed(page);
rss[MM_FILEPAGES]--;
}
@@ -1218,6 +1220,7 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size)
{
DEFINE_ZAP_DETAILS(details);
+ details.ignore_references = true;
if (address < vma->vm_start || address + size > vma->vm_end ||
!(vma->vm_flags & VM_PFNMAP))
return -1;
--
1.7.0.4

2010-12-05 17:30:24

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 6/7] Remove zap_details NULL dependency

Some functions used zap_details depends on assumption that
zap_details parameter should be NULLed if some fields are 0.

This patch removes that dependency for next patch easy review/merge.
It should not chanage behavior.

Signed-off-by: Minchan Kim <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Hugh Dickins <[email protected]>
---
include/linux/mm.h | 8 ++++++++
mm/madvise.c | 15 +++++++++------
mm/memory.c | 14 ++++++++------
mm/mmap.c | 6 ++++--
4 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e097df6..6522ae4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -773,6 +773,14 @@ struct zap_details {
unsigned long truncate_count; /* Compare vm_truncate_count */
};

+#define __ZAP_DETAILS_INITIALIZER(name) \
+ { .nonlinear_vma = NULL \
+ , .check_mapping = NULL \
+ , .i_mmap_lock = NULL }
+
+#define DEFINE_ZAP_DETAILS(name) \
+ struct zap_details name = __ZAP_DETAILS_INITIALIZER(name)
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);

diff --git a/mm/madvise.c b/mm/madvise.c
index 319528b..bfa17aa 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -162,18 +162,21 @@ static long madvise_dontneed(struct vm_area_struct * vma,
struct vm_area_struct ** prev,
unsigned long start, unsigned long end)
{
+ DEFINE_ZAP_DETAILS(details);
+
*prev = vma;
if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
return -EINVAL;

if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
- struct zap_details details = {
- .nonlinear_vma = vma,
- .last_index = ULONG_MAX,
- };
+ details.nonlinear_vma = vma;
+ details.last_index = ULONG_MAX;
+
zap_page_range(vma, start, end - start, &details);
- } else
- zap_page_range(vma, start, end - start, NULL);
+ } else {
+
+ zap_page_range(vma, start, end - start, &details);
+ }
return 0;
}

diff --git a/mm/memory.c b/mm/memory.c
index ebfeedf..c0879bb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -900,6 +900,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,

init_rss_vec(rss);

+ if (!details->check_mapping && !details->nonlinear_vma)
+ details = NULL;
+
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
do {
@@ -1038,9 +1041,6 @@ static unsigned long unmap_page_range(struct mmu_gather *tlb,
pgd_t *pgd;
unsigned long next;

- if (details && !details->check_mapping && !details->nonlinear_vma)
- details = NULL;
-
BUG_ON(addr >= end);
mem_cgroup_uncharge_start();
tlb_start_vma(tlb, vma);
@@ -1102,7 +1102,7 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
unsigned long tlb_start = 0; /* For tlb_finish_mmu */
int tlb_start_valid = 0;
unsigned long start = start_addr;
- spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+ spinlock_t *i_mmap_lock = details->i_mmap_lock;
int fullmm = (*tlbp)->fullmm;
struct mm_struct *mm = vma->vm_mm;

@@ -1217,10 +1217,11 @@ unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size)
{
+ DEFINE_ZAP_DETAILS(details);
if (address < vma->vm_start || address + size > vma->vm_end ||
!(vma->vm_flags & VM_PFNMAP))
return -1;
- zap_page_range(vma, address, size, NULL);
+ zap_page_range(vma, address, size, &details);
return 0;
}
EXPORT_SYMBOL_GPL(zap_vma_ptes);
@@ -2577,7 +2578,8 @@ restart:
void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows)
{
- struct zap_details details;
+ DEFINE_ZAP_DETAILS(details);
+
pgoff_t hba = holebegin >> PAGE_SHIFT;
pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;

diff --git a/mm/mmap.c b/mm/mmap.c
index b179abb..31d2594 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1900,11 +1900,12 @@ static void unmap_region(struct mm_struct *mm,
struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
struct mmu_gather *tlb;
unsigned long nr_accounted = 0;
+ DEFINE_ZAP_DETAILS(details);

lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
- unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
+ unmap_vmas(&tlb, vma, start, end, &nr_accounted, &details);
vm_unacct_memory(nr_accounted);
free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
@@ -2254,6 +2255,7 @@ void exit_mmap(struct mm_struct *mm)
struct vm_area_struct *vma;
unsigned long nr_accounted = 0;
unsigned long end;
+ DEFINE_ZAP_DETAILS(details);

/* mm's last user has gone, and its about to be pulled down */
mmu_notifier_release(mm);
@@ -2278,7 +2280,7 @@ void exit_mmap(struct mm_struct *mm)
tlb = tlb_gather_mmu(mm, 1);
/* update_hiwater_rss(mm) here? but nobody should be looking */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
- end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, &details);
vm_unacct_memory(nr_accounted);

free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
--
1.7.0.4

2010-12-05 17:30:14

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 4/7] Reclaim invalidated page ASAP

invalidate_mapping_pages is very big hint to reclaimer.
It means user doesn't want to use the page any more.
So in order to prevent working set page eviction, this patch
move the page into tail of inactive list by PG_reclaim.

Please, remember that pages in inactive list are working set
as well as active list. If we don't move pages into inactive list's
tail, pages near by tail of inactive list can be evicted although
we have a big clue about useless pages. It's totally bad.

Now PG_readahead/PG_reclaim is shared.
fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
preventing fast reclaiming readahead marker page.

In this series, PG_reclaim is used by invalidated page, too.
If VM find the page is invalidated and it's dirty, it sets PG_reclaim
to reclaim asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally.
It disturbs this serie's goal.

I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead.

Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>

Changelog since v3:
- move page which ends up writeback in pagevec on inactive's tail
- suggested by Johannes

Changelog since v2:
- put ClearPageReclaim in set_page_dirty - suggested by Wu.

Changelog since v1:
- make the invalidated page reclaim asap - suggested by Andrew.
---
mm/page-writeback.c | 12 +++++++++++-
mm/swap.c | 39 ++++++++++++++++++++++++++++++++++++---
2 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fc93802..88587a5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1250,6 +1250,17 @@ int set_page_dirty(struct page *page)
{
struct address_space *mapping = page_mapping(page);

+ /*
+ * readahead/lru_deactivate_page could remain
+ * PG_readahead/PG_reclaim due to race with end_page_writeback
+ * About readahead, if the page is written, the flags would be
+ * reset. So no problem.
+ * About lru_deactivate_page, if the page is redirty, the flag
+ * will be reset. So no problem. but if the page is used by readahead
+ * it will confuse readahead and make it restart the size rampup
+ * process. But it's a trivial problem.
+ */
+ ClearPageReclaim(page);
if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
#ifdef CONFIG_BLOCK
@@ -1307,7 +1318,6 @@ int clear_page_dirty_for_io(struct page *page)

BUG_ON(!PageLocked(page));

- ClearPageReclaim(page);
if (mapping && mapping_cap_account_dirty(mapping)) {
/*
* Yes, Virginia, this is indeed insane.
diff --git a/mm/swap.c b/mm/swap.c
index 0fe98e7..0f23998 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -275,26 +275,59 @@ void add_page_to_unevictable_list(struct page *page)
* head of the list, rather than the tail, to give the flusher
* threads some time to write it out, as this is much more
* effective than the single-page writeout from reclaim.
+ *
+ * If the page isn't page_mapped and dirty/writeback, the page
+ * could reclaim asap using PG_reclaim.
+ *
+ * 1. active, mapped page -> none
+ * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
+ * 3. inactive, mapped page -> none
+ * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
+ * 5. Others -> none
+ *
+ * In 4, why it moves inactive's head, the VM expects the page would
+ * be write it out by flusher threads as this is much more effective
+ * than the single-page writeout from reclaim.
*/
static void lru_deactivate(struct page *page, struct zone *zone)
{
int lru, file;
+ bool active;

- if (!PageLRU(page) || !PageActive(page))
+ if (!PageLRU(page))
return;

/* Some processes are using the page */
if (page_mapped(page))
return;

+ active = PageActive(page);
+
file = page_is_file_cache(page);
lru = page_lru_base_type(page);
- del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
+ del_page_from_lru_list(zone, page, lru + active);
ClearPageActive(page);
ClearPageReferenced(page);
add_page_to_lru_list(zone, page, lru);
- __count_vm_event(PGDEACTIVATE);

+ if (PageWriteback(page) || PageDirty(page)) {
+ /*
+ * PG_reclaim could be raced with end_page_writeback
+ * It can make readahead confusing. But race window
+ * is _really_ small and it's non-critical problem.
+ */
+ SetPageReclaim(page);
+ } else {
+ /*
+ * The page's writeback ends up during pagevec
+ * We moves tha page into tail of inactive.
+ */
+ list_move_tail(&page->lru, &zone->lru[lru].list);
+ mem_cgroup_rotate_reclaimable_page(page);
+ }
+
+ if (active)
+ __count_vm_event(PGDEACTIVATE);
update_page_reclaim_stat(zone, page, file, 0);
}

--
1.7.0.4

2010-12-06 00:09:51

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] move memcg reclaimable page into tail of inactive list

On Mon, 6 Dec 2010 02:29:11 +0900
Minchan Kim <[email protected]> wrote:

> Golbal page reclaim moves reclaimalbe pages into inactive list
> to reclaim asap. This patch apply the rule in memcg.
> It can help to prevent unnecessary working page eviction of memcg.
>
> Cc: Balbir Singh <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2010-12-06 01:48:06

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v4 1/7] Fix checkpatch's report in swap.c

On 12/05/2010 12:29 PM, Minchan Kim wrote:
> checkpatch reports following problems.
> It's a very annoying. This patch fixes it.

> total: 1 errors, 8 warnings, 517 lines checked
>
> Signed-off-by: Minchan Kim<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-12-06 03:05:08

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] move memcg reclaimable page into tail of inactive list

On 12/05/2010 12:29 PM, Minchan Kim wrote:
> Golbal page reclaim moves reclaimalbe pages into inactive list
> to reclaim asap. This patch apply the rule in memcg.
> It can help to prevent unnecessary working page eviction of memcg.

The patch is right, but the description is wrong.

The rotate_reclaimable_page function moves just written out
pages, which the VM wanted to reclaim, to the end of the
inactive list. That way the VM will find those pages first
next time it needs to free memory.

> Cc: Balbir Singh<[email protected]>
> Cc: KAMEZAWA Hiroyuki<[email protected]>
> Cc: Rik van Riel<[email protected]>
> Cc: KOSAKI Motohiro<[email protected]>
> Signed-off-by: Minchan Kim<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-12-06 03:25:41

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] add profile information for invalidated page reclaim

On 12/05/2010 12:29 PM, Minchan Kim wrote:
> This patch adds profile information about invalidated page reclaim.
> It's just for profiling for test so it would be discard when the series
> are merged.
>
> Signed-off-by: Minchan Kim<[email protected]>
> Cc: Rik van Riel<[email protected]>
> Cc: KOSAKI Motohiro<[email protected]>
> Cc: Wu Fengguang<[email protected]>
> Cc: Johannes Weiner<[email protected]>
> Cc: Nick Piggin<[email protected]>
> Cc: Mel Gorman<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-12-06 03:26:29

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] Remove zap_details NULL dependency

On 12/05/2010 12:29 PM, Minchan Kim wrote:
> Some functions used zap_details depends on assumption that
> zap_details parameter should be NULLed if some fields are 0.
>
> This patch removes that dependency for next patch easy review/merge.
> It should not chanage behavior.
>
> Signed-off-by: Minchan Kim<[email protected]>
> Cc: Rik van Riel<[email protected]>
> Cc: KOSAKI Motohiro<[email protected]>
> Cc: Johannes Weiner<[email protected]>
> Cc: Nick Piggin<[email protected]>
> Cc: Mel Gorman<[email protected]>
> Cc: Wu Fengguang<[email protected]>
> Cc: Hugh Dickins<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-12-06 03:52:24

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] move memcg reclaimable page into tail of inactive list

* MinChan Kim <[email protected]> [2010-12-06 02:29:11]:

> Golbal page reclaim moves reclaimalbe pages into inactive list

Some typos here and Rik already pointed out some other changes.

> to reclaim asap. This patch apply the rule in memcg.
> It can help to prevent unnecessary working page eviction of memcg.
>
> Cc: Balbir Singh <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> include/linux/memcontrol.h | 6 ++++++
> mm/memcontrol.c | 27 +++++++++++++++++++++++++++
> mm/swap.c | 3 ++-
> 3 files changed, 35 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 067115c..8317f5c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -62,6 +62,7 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
> extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
> +extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
> extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
> extern void mem_cgroup_del_lru(struct page *page);
> extern void mem_cgroup_move_lists(struct page *page,
> @@ -207,6 +208,11 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
> return ;
> }
>
> +static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
> +{
> + return ;
> +}
> +
> static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
> {
> return ;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 729beb7..f9435be 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -829,6 +829,33 @@ void mem_cgroup_del_lru(struct page *page)
> mem_cgroup_del_lru_list(page, page_lru(page));
> }
>
> +/*
> + * Writeback is about to end against a page which has been marked for immediate
> + * reclaim. If it still appears to be reclaimable, move it to the tail of the
> + * inactive list.
> + */
> +void mem_cgroup_rotate_reclaimable_page(struct page *page)
> +{
> + struct mem_cgroup_per_zone *mz;
> + struct page_cgroup *pc;
> + enum lru_list lru = page_lru_base_type(page);
> +
> + if (mem_cgroup_disabled())
> + return;
> +
> + pc = lookup_page_cgroup(page);
> + /*
> + * Used bit is set without atomic ops but after smp_wmb().
> + * For making pc->mem_cgroup visible, insert smp_rmb() here.
> + */
> + smp_rmb();
> + /* unused or root page is not rotated. */
> + if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
> + return;
> + mz = page_cgroup_zoneinfo(pc);
> + list_move_tail(&pc->lru, &mz->lists[lru]);
> +}
> +
> void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
> {
> struct mem_cgroup_per_zone *mz;
> diff --git a/mm/swap.c b/mm/swap.c
> index 1f36f6f..0fe98e7 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -122,8 +122,9 @@ static void pagevec_move_tail(struct pagevec *pvec)
> }
> if (PageLRU(page) && !PageActive(page) &&
> !PageUnevictable(page)) {
> - int lru = page_lru_base_type(page);
> + enum lru_list lru = page_lru_base_type(page);
> list_move_tail(&page->lru, &zone->lru[lru].list);
> + mem_cgroup_rotate_reclaimable_page(page);
> pgmoved++;
> }
> }

Looks good, do you have any numbers, workloads that benefit? I agree
that keeping both global and memcg reclaim in sync is a good idea.

Acked-by: Balbir Singh <[email protected]>


--
Three Cheers,
Balbir

2010-12-06 14:53:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] deactivate invalidated pages

On Mon, Dec 06, 2010 at 02:29:10AM +0900, Minchan Kim wrote:
> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
>
> Some app developer want to support POSIX_FADV_NOREUSE.
> But other OSes don't support it, either.
> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>
> By other approach, app developers use POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It is very hard for application programmer to use it.
> Because they always have to sync data before calling
> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> be discardable. At last, they can't use deferred write of kernel
> so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)
>
> In fact, invalidation is very big hint to reclaimer.
> It means we don't use the page any more. So let's move
> the writing page into inactive list's head.
>
> Why I need the page to head, Dirty/Writeback page would be flushed
> sooner or later. It can prevent writeout of pageout which is less
> effective than flusher's writeout.
>
> Originally, I reused lru_demote of Peter with some change so added
> his Signed-off-by.
>
> Reported-by: Ben Gamari <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Reviewed-by: KOSAKI Motohiro <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
>
> Andrew. Before applying this series, please drop below two patches.
> mm-deactivate-invalidated-pages.patch
> mm-deactivate-invalidated-pages-fix.patch
>
> Changelog since v3:
> - Change function comments - suggested by Johannes
> - Change function name - suggested by Johannes
> - add only dirty/writeback pages to deactive pagevec
>
> Changelog since v2:
> - mapped page leaves alone - suggested by Mel
> - pass part related PG_reclaim in next patch.
>
> Changelog since v1:
> - modify description
> - correct typo
> - add some comment
>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> include/linux/swap.h | 1 +
> mm/swap.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++
> mm/truncate.c | 17 ++++++++--
> 3 files changed, 92 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index eba53e7..605ab62 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -213,6 +213,7 @@ extern void mark_page_accessed(struct page *);
> extern void lru_add_drain(void);
> extern int lru_add_drain_all(void);
> extern void rotate_reclaimable_page(struct page *page);
> +extern void deactivate_page(struct page *page);
> extern void swap_setup(void);
>
> extern void add_page_to_unevictable_list(struct page *page);
> diff --git a/mm/swap.c b/mm/swap.c
> index d5822b0..1f36f6f 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -39,6 +39,8 @@ int page_cluster;
>
> static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> +
>
> /*
> * This path almost never happens for VM activity - pages are normally

Unnecessary whitespace change there but otherwise;

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-12-07 00:17:22

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] move memcg reclaimable page into tail of inactive list

On Mon, Dec 6, 2010 at 12:04 PM, Rik van Riel <[email protected]> wrote:
> On 12/05/2010 12:29 PM, Minchan Kim wrote:
>>
>> Golbal page reclaim moves reclaimalbe pages into inactive list
>> to reclaim asap. This patch apply the rule in memcg.
>> It can help to prevent unnecessary working page eviction of memcg.
>
> The patch is right, but the description is wrong.
>
> The rotate_reclaimable_page function moves just written out
> pages, which the VM wanted to reclaim, to the end of the
> inactive list. ?That way the VM will find those pages first
> next time it needs to free memory.

Will fix.

>
>> Cc: Balbir Singh<[email protected]>
>> Cc: KAMEZAWA Hiroyuki<[email protected]>
>> Cc: Rik van Riel<[email protected]>
>> Cc: KOSAKI Motohiro<[email protected]>
>> Signed-off-by: Minchan Kim<[email protected]>
>
> Reviewed-by: Rik van Riel <[email protected]>

Thanks, Rik.

>
> --
> All rights reversed
>



--
Kind regards,
Minchan Kim

2010-12-07 00:21:01

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] move memcg reclaimable page into tail of inactive list

Hi Balbir,

On Mon, Dec 6, 2010 at 12:34 PM, Balbir Singh <[email protected]> wrote:
> * MinChan Kim <[email protected]> [2010-12-06 02:29:11]:
>
>> Golbal page reclaim moves reclaimalbe pages into inactive list
>
> Some typos here and Rik already pointed out some other changes.
>
>> to reclaim asap. This patch apply the rule in memcg.
>> It can help to prevent unnecessary working page eviction of memcg.
>>
>> Cc: Balbir Singh <[email protected]>
>> Cc: KAMEZAWA Hiroyuki <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>> ---
>> ?include/linux/memcontrol.h | ? ?6 ++++++
>> ?mm/memcontrol.c ? ? ? ? ? ?| ? 27 +++++++++++++++++++++++++++
>> ?mm/swap.c ? ? ? ? ? ? ? ? ?| ? ?3 ++-
>> ?3 files changed, 35 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 067115c..8317f5c 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -62,6 +62,7 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask);
>> ?extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
>> ?extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
>> +extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
>> ?extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
>> ?extern void mem_cgroup_del_lru(struct page *page);
>> ?extern void mem_cgroup_move_lists(struct page *page,
>> @@ -207,6 +208,11 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
>> ? ? ? return ;
>> ?}
>>
>> +static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
>> +{
>> + ? ? return ;
>> +}
>> +
>> ?static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
>> ?{
>> ? ? ? return ;
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 729beb7..f9435be 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -829,6 +829,33 @@ void mem_cgroup_del_lru(struct page *page)
>> ? ? ? mem_cgroup_del_lru_list(page, page_lru(page));
>> ?}
>>
>> +/*
>> + * Writeback is about to end against a page which has been marked for immediate
>> + * reclaim. ?If it still appears to be reclaimable, move it to the tail of the
>> + * inactive list.
>> + */
>> +void mem_cgroup_rotate_reclaimable_page(struct page *page)
>> +{
>> + ? ? struct mem_cgroup_per_zone *mz;
>> + ? ? struct page_cgroup *pc;
>> + ? ? enum lru_list lru = page_lru_base_type(page);
>> +
>> + ? ? if (mem_cgroup_disabled())
>> + ? ? ? ? ? ? return;
>> +
>> + ? ? pc = lookup_page_cgroup(page);
>> + ? ? /*
>> + ? ? ?* Used bit is set without atomic ops but after smp_wmb().
>> + ? ? ?* For making pc->mem_cgroup visible, insert smp_rmb() here.
>> + ? ? ?*/
>> + ? ? smp_rmb();
>> + ? ? /* unused or root page is not rotated. */
>> + ? ? if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
>> + ? ? ? ? ? ? return;
>> + ? ? mz = page_cgroup_zoneinfo(pc);
>> + ? ? list_move_tail(&pc->lru, &mz->lists[lru]);
>> +}
>> +
>> ?void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
>> ?{
>> ? ? ? struct mem_cgroup_per_zone *mz;
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 1f36f6f..0fe98e7 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -122,8 +122,9 @@ static void pagevec_move_tail(struct pagevec *pvec)
>> ? ? ? ? ? ? ? }
>> ? ? ? ? ? ? ? if (PageLRU(page) && !PageActive(page) &&
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? !PageUnevictable(page)) {
>> - ? ? ? ? ? ? ? ? ? ? int lru = page_lru_base_type(page);
>> + ? ? ? ? ? ? ? ? ? ? enum lru_list lru = page_lru_base_type(page);
>> ? ? ? ? ? ? ? ? ? ? ? list_move_tail(&page->lru, &zone->lru[lru].list);
>> + ? ? ? ? ? ? ? ? ? ? mem_cgroup_rotate_reclaimable_page(page);
>> ? ? ? ? ? ? ? ? ? ? ? pgmoved++;
>> ? ? ? ? ? ? ? }
>> ? ? ? }
>
> Looks good, do you have any numbers, workloads that benefit? I agree
> that keeping both global and memcg reclaim in sync is a good idea.

This patch series for Ben's fadvise problem in rsync.
As I fix the global reclaim, I found this patch could help memcg, too.
If Ben is busy, I will measure the benefit.

>
> Acked-by: Balbir Singh <[email protected]>

Thanks, Balbir.

>
>
> --
> ? ? ? ?Three Cheers,
> ? ? ? ?Balbir
>



--
Kind regards,
Minchan Kim

2010-12-07 04:27:04

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] Remove zap_details NULL dependency

On Mon, 6 Dec 2010, Minchan Kim wrote:

> Some functions used zap_details depends on assumption that
> zap_details parameter should be NULLed if some fields are 0.
>
> This patch removes that dependency for next patch easy review/merge.
> It should not chanage behavior.
>
> Signed-off-by: Minchan Kim <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Hugh Dickins <[email protected]>

Sorry, while I do like that you're now using the details block,
you seem to be adding overhead in various places without actually
simplifying anything - you insist that everything passes down an
initialized details block, and then in the end force the pointer
to NULL again in all the common cases.

Which seems odd. I could understand if you were going to scrap
the NULL details optimization altogether; but I think that (for
the original optimization reasons) you're right to force it to NULL
in the end, so then why initialize the block at all those call sites?

> ---
> include/linux/mm.h | 8 ++++++++
> mm/madvise.c | 15 +++++++++------
> mm/memory.c | 14 ++++++++------
> mm/mmap.c | 6 ++++--
> 4 files changed, 29 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e097df6..6522ae4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -773,6 +773,14 @@ struct zap_details {
> unsigned long truncate_count; /* Compare vm_truncate_count */
> };
>
> +#define __ZAP_DETAILS_INITIALIZER(name) \
> + { .nonlinear_vma = NULL \
> + , .check_mapping = NULL \
> + , .i_mmap_lock = NULL }
> +
> +#define DEFINE_ZAP_DETAILS(name) \
> + struct zap_details name = __ZAP_DETAILS_INITIALIZER(name)

Okay.

> +
> struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> pte_t pte);
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 319528b..bfa17aa 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -162,18 +162,21 @@ static long madvise_dontneed(struct vm_area_struct * vma,
> struct vm_area_struct ** prev,
> unsigned long start, unsigned long end)
> {
> + DEFINE_ZAP_DETAILS(details);
> +
> *prev = vma;
> if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> return -EINVAL;
>
> if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
> - struct zap_details details = {
> - .nonlinear_vma = vma,
> - .last_index = ULONG_MAX,
> - };
> + details.nonlinear_vma = vma;
> + details.last_index = ULONG_MAX;
> +
> zap_page_range(vma, start, end - start, &details);
> - } else
> - zap_page_range(vma, start, end - start, NULL);
> + } else {
> +
> + zap_page_range(vma, start, end - start, &details);
> + }

You end up with two identical zap_page_range() lines:
better have one after the if {} without an else.

> return 0;
> }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index ebfeedf..c0879bb 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -900,6 +900,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>
> init_rss_vec(rss);
>
> + if (!details->check_mapping && !details->nonlinear_vma)
> + details = NULL;
> +

Aside from its necessity in the next patch, I thoroughly approve of
your moving this optimization here: it is confusing, and better that
it be done near where the fields are used, than off at the higher level.

> pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> arch_enter_lazy_mmu_mode();
> do {
> @@ -1038,9 +1041,6 @@ static unsigned long unmap_page_range(struct mmu_gather *tlb,
> pgd_t *pgd;
> unsigned long next;
>
> - if (details && !details->check_mapping && !details->nonlinear_vma)
> - details = NULL;
> -

Yes, I put it there because that was the highest point at which
it could then be done, so it was optimal from a do-it-fewest-times
point of view; but not at all helpful in understanding what's going
on, much better as you have it.

> BUG_ON(addr >= end);
> mem_cgroup_uncharge_start();
> tlb_start_vma(tlb, vma);
> @@ -1102,7 +1102,7 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
> unsigned long tlb_start = 0; /* For tlb_finish_mmu */
> int tlb_start_valid = 0;
> unsigned long start = start_addr;
> - spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
> + spinlock_t *i_mmap_lock = details->i_mmap_lock;

This appears to be the sole improvement from insisting that everywhere
sets up an initialized details block. I don't think this is worth it.

> int fullmm = (*tlbp)->fullmm;
> struct mm_struct *mm = vma->vm_mm;
>
> @@ -1217,10 +1217,11 @@ unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
> int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
> unsigned long size)
> {
> + DEFINE_ZAP_DETAILS(details);

Overhead.

> if (address < vma->vm_start || address + size > vma->vm_end ||
> !(vma->vm_flags & VM_PFNMAP))
> return -1;
> - zap_page_range(vma, address, size, NULL);
> + zap_page_range(vma, address, size, &details);
> return 0;
> }
> EXPORT_SYMBOL_GPL(zap_vma_ptes);
> @@ -2577,7 +2578,8 @@ restart:
> void unmap_mapping_range(struct address_space *mapping,
> loff_t const holebegin, loff_t const holelen, int even_cows)
> {
> - struct zap_details details;
> + DEFINE_ZAP_DETAILS(details);
> +
> pgoff_t hba = holebegin >> PAGE_SHIFT;
> pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index b179abb..31d2594 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1900,11 +1900,12 @@ static void unmap_region(struct mm_struct *mm,
> struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
> struct mmu_gather *tlb;
> unsigned long nr_accounted = 0;
> + DEFINE_ZAP_DETAILS(details);

Overhead.

>
> lru_add_drain();
> tlb = tlb_gather_mmu(mm, 0);
> update_hiwater_rss(mm);
> - unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
> + unmap_vmas(&tlb, vma, start, end, &nr_accounted, &details);
> vm_unacct_memory(nr_accounted);
> free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
> next? next->vm_start: 0);
> @@ -2254,6 +2255,7 @@ void exit_mmap(struct mm_struct *mm)
> struct vm_area_struct *vma;
> unsigned long nr_accounted = 0;
> unsigned long end;
> + DEFINE_ZAP_DETAILS(details);

Overhead.

>
> /* mm's last user has gone, and its about to be pulled down */
> mmu_notifier_release(mm);
> @@ -2278,7 +2280,7 @@ void exit_mmap(struct mm_struct *mm)
> tlb = tlb_gather_mmu(mm, 1);
> /* update_hiwater_rss(mm) here? but nobody should be looking */
> /* Use -1 here to ensure all VMAs in the mm are unmapped */
> - end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
> + end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, &details);
> vm_unacct_memory(nr_accounted);
>
> free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
> --

Am I being too fussy?

Hugh

2010-12-07 04:48:27

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] Prevent activation of page in madvise_dontneed

On Mon, 6 Dec 2010, Minchan Kim wrote:

> Now zap_pte_range alwayas activates pages which are pte_young &&
> !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
> it's unnecessary since the page wouldn't use any more.
>
> Signed-off-by: Minchan Kim <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Reviewed-by: KOSAKI Motohiro <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Hugh Dickins <[email protected]>
>
> Changelog since v3:
> - Change variable name - suggested by Johannes
> - Union ignore_references with zap_details - suggested by Hugh
>
> Changelog since v2:
> - remove unnecessary description
>
> Changelog since v1:
> - change word from promote to activate
> - add activate argument to zap_pte_range and family function
> ---
> include/linux/mm.h | 4 +++-
> mm/madvise.c | 6 +++---
> mm/memory.c | 5 ++++-
> 3 files changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6522ae4..e57190f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -771,12 +771,14 @@ struct zap_details {
> pgoff_t last_index; /* Highest page->index to unmap */
> spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */
> unsigned long truncate_count; /* Compare vm_truncate_count */
> + bool ignore_references; /* For page activation */
> };
>
> #define __ZAP_DETAILS_INITIALIZER(name) \
> { .nonlinear_vma = NULL \
> , .check_mapping = NULL \
> - , .i_mmap_lock = NULL }
> + , .i_mmap_lock = NULL \
> + , .ignore_references = false }

Okay.

>
> #define DEFINE_ZAP_DETAILS(name) \
> struct zap_details name = __ZAP_DETAILS_INITIALIZER(name)
> diff --git a/mm/madvise.c b/mm/madvise.c
> index bfa17aa..8e7aba3 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -163,6 +163,7 @@ static long madvise_dontneed(struct vm_area_struct * vma,
> unsigned long start, unsigned long end)
> {
> DEFINE_ZAP_DETAILS(details);
> + details.ignore_references = true;
>
> *prev = vma;
> if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> @@ -173,10 +174,9 @@ static long madvise_dontneed(struct vm_area_struct * vma,
> details.last_index = ULONG_MAX;
>
> zap_page_range(vma, start, end - start, &details);
> - } else {
> -
> + } else
> zap_page_range(vma, start, end - start, &details);
> - }
> +

As in the previous patch, you have the same in the if {} and the else.

> return 0;
> }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c0879bb..44d87e1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -897,6 +897,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> pte_t *pte;
> spinlock_t *ptl;
> int rss[NR_MM_COUNTERS];
> + bool ignore_references = details->ignore_references;
>
> init_rss_vec(rss);
>
> @@ -952,7 +953,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> if (pte_dirty(ptent))
> set_page_dirty(page);
> if (pte_young(ptent) &&
> - likely(!VM_SequentialReadHint(vma)))
> + likely(!VM_SequentialReadHint(vma)) &&
> + !ignore_references)

I think ignore_references is about as likely as VM_SequentialReadHint:
I'd probably just omit that "likely()" nowadays, but you might prefer
to put your "|| !ignore_references" inside.

Hmm, actually it would probably be better to say something like

mark_accessed = true;
if (VM_SequentialReadHint(vma) ||
(details && details->ignore_references))
mark_accessed = false;

on entry to zap_pte_range().

> mark_page_accessed(page);
> rss[MM_FILEPAGES]--;
> }
> @@ -1218,6 +1220,7 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
> unsigned long size)
> {
> DEFINE_ZAP_DETAILS(details);
> + details.ignore_references = true;
> if (address < vma->vm_start || address + size > vma->vm_end ||
> !(vma->vm_flags & VM_PFNMAP))
> return -1;
> --

Unnecessary here (would make more sense in the truncation case,
but not necessary there either): zap_vma_ptes() is only being used on
GRU's un-cowable VM_PFNMAP area, so vm_normal_page() won't even give
you a non-NULL page to mark.

Hugh

2010-12-07 05:30:56

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] Remove zap_details NULL dependency

Hi Hugh,

On Tue, Dec 7, 2010 at 1:26 PM, Hugh Dickins <[email protected]> wrote:
> On Mon, 6 Dec 2010, Minchan Kim wrote:
>
>> Some functions used zap_details depends on assumption that
>> zap_details parameter should be NULLed if some fields are 0.
>>
>> This patch removes that dependency for next patch easy review/merge.
>> It should not chanage behavior.
>>
>> Signed-off-by: Minchan Kim <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Wu Fengguang <[email protected]>
>> Cc: Hugh Dickins <[email protected]>
>
> Sorry, while I do like that you're now using the details block,
> you seem to be adding overhead in various places without actually
> simplifying anything - you insist that everything passes down an
> initialized details block, and then in the end force the pointer
> to NULL again in all the common cases.
>
> Which seems odd. ?I could understand if you were going to scrap
> the NULL details optimization altogether; but I think that (for
> the original optimization reasons) you're right to force it to NULL
> in the end, so then why initialize the block at all those call sites?
>
>> ---
>> ?include/linux/mm.h | ? ?8 ++++++++
>> ?mm/madvise.c ? ? ? | ? 15 +++++++++------
>> ?mm/memory.c ? ? ? ?| ? 14 ++++++++------
>> ?mm/mmap.c ? ? ? ? ?| ? ?6 ++++--
>> ?4 files changed, 29 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index e097df6..6522ae4 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -773,6 +773,14 @@ struct zap_details {
>> ? ? ? unsigned long truncate_count; ? ? ? ? ? /* Compare vm_truncate_count */
>> ?};
>>
>> +#define __ZAP_DETAILS_INITIALIZER(name) \
>> + ? ? ? ? ? ? { .nonlinear_vma = NULL \
>> + ? ? ? ? ? ? , .check_mapping = NULL \
>> + ? ? ? ? ? ? , .i_mmap_lock = NULL }
>> +
>> +#define DEFINE_ZAP_DETAILS(name) ? ? ? ? ? ? \
>> + ? ? struct zap_details name = __ZAP_DETAILS_INITIALIZER(name)
>
> Okay.
>
>> +
>> ?struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>> ? ? ? ? ? ? ? pte_t pte);
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 319528b..bfa17aa 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -162,18 +162,21 @@ static long madvise_dontneed(struct vm_area_struct * vma,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct vm_area_struct ** prev,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long start, unsigned long end)
>> ?{
>> + ? ? DEFINE_ZAP_DETAILS(details);
>> +
>> ? ? ? *prev = vma;
>> ? ? ? if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
>> ? ? ? ? ? ? ? return -EINVAL;
>>
>> ? ? ? if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
>> - ? ? ? ? ? ? struct zap_details details = {
>> - ? ? ? ? ? ? ? ? ? ? .nonlinear_vma = vma,
>> - ? ? ? ? ? ? ? ? ? ? .last_index = ULONG_MAX,
>> - ? ? ? ? ? ? };
>> + ? ? ? ? ? ? details.nonlinear_vma = vma;
>> + ? ? ? ? ? ? details.last_index = ULONG_MAX;
>> +
>> ? ? ? ? ? ? ? zap_page_range(vma, start, end - start, &details);
>> - ? ? } else
>> - ? ? ? ? ? ? zap_page_range(vma, start, end - start, NULL);
>> + ? ? } else {
>> +
>> + ? ? ? ? ? ? zap_page_range(vma, start, end - start, &details);
>> + ? ? }
>
> You end up with two identical zap_page_range() lines:
> better have one after the if {} without an else.
>

Okay. Will fix.

>> ? ? ? return 0;
>> ?}
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index ebfeedf..c0879bb 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -900,6 +900,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>
>> ? ? ? init_rss_vec(rss);
>>
>> + ? ? if (!details->check_mapping && !details->nonlinear_vma)
>> + ? ? ? ? ? ? details = NULL;
>> +
>
> Aside from its necessity in the next patch, I thoroughly approve of
> your moving this optimization here: it is confusing, and better that
> it be done near where the fields are used, than off at the higher level.

Thanks.

>
>> ? ? ? pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>> ? ? ? arch_enter_lazy_mmu_mode();
>> ? ? ? do {
>> @@ -1038,9 +1041,6 @@ static unsigned long unmap_page_range(struct mmu_gather *tlb,
>> ? ? ? pgd_t *pgd;
>> ? ? ? unsigned long next;
>>
>> - ? ? if (details && !details->check_mapping && !details->nonlinear_vma)
>> - ? ? ? ? ? ? details = NULL;
>> -
>
> Yes, I put it there because that was the highest point at which
> it could then be done, so it was optimal from a do-it-fewest-times
> point of view; but not at all helpful in understanding what's going
> on, much better as you have it.
>
>> ? ? ? BUG_ON(addr >= end);
>> ? ? ? mem_cgroup_uncharge_start();
>> ? ? ? tlb_start_vma(tlb, vma);
>> @@ -1102,7 +1102,7 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
>> ? ? ? unsigned long tlb_start = 0; ? ?/* For tlb_finish_mmu */
>> ? ? ? int tlb_start_valid = 0;
>> ? ? ? unsigned long start = start_addr;
>> - ? ? spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
>> + ? ? spinlock_t *i_mmap_lock = details->i_mmap_lock;
>
> This appears to be the sole improvement from insisting that everywhere
> sets up an initialized details block. ?I don't think this is worth it.
>
>> ? ? ? int fullmm = (*tlbp)->fullmm;
>> ? ? ? struct mm_struct *mm = vma->vm_mm;
>>
>> @@ -1217,10 +1217,11 @@ unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
>> ?int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
>> ? ? ? ? ? ? ? unsigned long size)
>> ?{
>> + ? ? DEFINE_ZAP_DETAILS(details);
>
> Overhead.
>
>> ? ? ? if (address < vma->vm_start || address + size > vma->vm_end ||
>> ? ? ? ? ? ? ? ? ? ? ? !(vma->vm_flags & VM_PFNMAP))
>> ? ? ? ? ? ? ? return -1;
>> - ? ? zap_page_range(vma, address, size, NULL);
>> + ? ? zap_page_range(vma, address, size, &details);
>> ? ? ? return 0;
>> ?}
>> ?EXPORT_SYMBOL_GPL(zap_vma_ptes);
>> @@ -2577,7 +2578,8 @@ restart:
>> ?void unmap_mapping_range(struct address_space *mapping,
>> ? ? ? ? ? ? ? loff_t const holebegin, loff_t const holelen, int even_cows)
>> ?{
>> - ? ? struct zap_details details;
>> + ? ? DEFINE_ZAP_DETAILS(details);
>> +
>> ? ? ? pgoff_t hba = holebegin >> PAGE_SHIFT;
>> ? ? ? pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index b179abb..31d2594 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -1900,11 +1900,12 @@ static void unmap_region(struct mm_struct *mm,
>> ? ? ? struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
>> ? ? ? struct mmu_gather *tlb;
>> ? ? ? unsigned long nr_accounted = 0;
>> + ? ? DEFINE_ZAP_DETAILS(details);
>
> Overhead.
>
>>
>> ? ? ? lru_add_drain();
>> ? ? ? tlb = tlb_gather_mmu(mm, 0);
>> ? ? ? update_hiwater_rss(mm);
>> - ? ? unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
>> + ? ? unmap_vmas(&tlb, vma, start, end, &nr_accounted, &details);
>> ? ? ? vm_unacct_memory(nr_accounted);
>> ? ? ? free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?next? next->vm_start: 0);
>> @@ -2254,6 +2255,7 @@ void exit_mmap(struct mm_struct *mm)
>> ? ? ? struct vm_area_struct *vma;
>> ? ? ? unsigned long nr_accounted = 0;
>> ? ? ? unsigned long end;
>> + ? ? DEFINE_ZAP_DETAILS(details);
>
> Overhead.
>
>>
>> ? ? ? /* mm's last user has gone, and its about to be pulled down */
>> ? ? ? mmu_notifier_release(mm);
>> @@ -2278,7 +2280,7 @@ void exit_mmap(struct mm_struct *mm)
>> ? ? ? tlb = tlb_gather_mmu(mm, 1);
>> ? ? ? /* update_hiwater_rss(mm) here? but nobody should be looking */
>> ? ? ? /* Use -1 here to ensure all VMAs in the mm are unmapped */
>> - ? ? end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
>> + ? ? end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, &details);
>> ? ? ? vm_unacct_memory(nr_accounted);
>>
>> ? ? ? free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
>> --
>
> Am I being too fussy?

Never. It's a good comment.
I don't want add overhead unnecessary.

will fix and resend.

Thanks for the review, Hugh.

>
> Hugh
>



--
Kind regards,
Minchan Kim

2010-12-07 05:44:33

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] Prevent activation of page in madvise_dontneed

On Tue, Dec 7, 2010 at 1:48 PM, Hugh Dickins <[email protected]> wrote:
> On Mon, 6 Dec 2010, Minchan Kim wrote:
>
>> Now zap_pte_range alwayas activates pages which are pte_young &&
>> !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
>> it's unnecessary since the page wouldn't use any more.
>>
>> Signed-off-by: Minchan Kim <[email protected]>
>> Acked-by: Rik van Riel <[email protected]>
>> Reviewed-by: KOSAKI Motohiro <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Wu Fengguang <[email protected]>
>> Cc: Hugh Dickins <[email protected]>
>>
>> Changelog since v3:
>> ?- Change variable name - suggested by Johannes
>> ?- Union ignore_references with zap_details - suggested by Hugh
>>
>> Changelog since v2:
>> ?- remove unnecessary description
>>
>> Changelog since v1:
>> ?- change word from promote to activate
>> ?- add activate argument to zap_pte_range and family function
>> ---
>> ?include/linux/mm.h | ? ?4 +++-
>> ?mm/madvise.c ? ? ? | ? ?6 +++---
>> ?mm/memory.c ? ? ? ?| ? ?5 ++++-
>> ?3 files changed, 10 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 6522ae4..e57190f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -771,12 +771,14 @@ struct zap_details {
>> ? ? ? pgoff_t last_index; ? ? ? ? ? ? ? ? ? ? /* Highest page->index to unmap */
>> ? ? ? spinlock_t *i_mmap_lock; ? ? ? ? ? ? ? ?/* For unmap_mapping_range: */
>> ? ? ? unsigned long truncate_count; ? ? ? ? ? /* Compare vm_truncate_count */
>> + ? ? bool ignore_references; ? ? ? ? ? ? ? ? /* For page activation */
>> ?};
>>
>> ?#define __ZAP_DETAILS_INITIALIZER(name) \
>> ? ? ? ? ? ? ? ? ?{ .nonlinear_vma = NULL \
>> ? ? ? ? ? ? ? , .check_mapping = NULL \
>> - ? ? ? ? ? ? , .i_mmap_lock = NULL }
>> + ? ? ? ? ? ? , .i_mmap_lock = NULL ? \
>> + ? ? ? ? ? ? , .ignore_references = false }
>
> Okay.
>
>>
>> ?#define DEFINE_ZAP_DETAILS(name) ? ? ? ? ? ? \
>> ? ? ? struct zap_details name = __ZAP_DETAILS_INITIALIZER(name)
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index bfa17aa..8e7aba3 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -163,6 +163,7 @@ static long madvise_dontneed(struct vm_area_struct * vma,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long start, unsigned long end)
>> ?{
>> ? ? ? DEFINE_ZAP_DETAILS(details);
>> + ? ? details.ignore_references = true;
>>
>> ? ? ? *prev = vma;
>> ? ? ? if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
>> @@ -173,10 +174,9 @@ static long madvise_dontneed(struct vm_area_struct * vma,
>> ? ? ? ? ? ? ? details.last_index = ULONG_MAX;
>>
>> ? ? ? ? ? ? ? zap_page_range(vma, start, end - start, &details);
>> - ? ? } else {
>> -
>> + ? ? } else
>> ? ? ? ? ? ? ? zap_page_range(vma, start, end - start, &details);
>> - ? ? }
>> +
>
> As in the previous patch, you have the same in the if {} and the else.
>
>> ? ? ? return 0;
>> ?}
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index c0879bb..44d87e1 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -897,6 +897,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>> ? ? ? pte_t *pte;
>> ? ? ? spinlock_t *ptl;
>> ? ? ? int rss[NR_MM_COUNTERS];
>> + ? ? bool ignore_references = details->ignore_references;
>>
>> ? ? ? init_rss_vec(rss);
>>
>> @@ -952,7 +953,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (pte_dirty(ptent))
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? set_page_dirty(page);
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (pte_young(ptent) &&
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? likely(!VM_SequentialReadHint(vma)))
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? likely(!VM_SequentialReadHint(vma)) &&
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? !ignore_references)
>
> I think ignore_references is about as likely as VM_SequentialReadHint:
> I'd probably just omit that "likely()" nowadays, but you might prefer
> to put your "|| !ignore_references" inside.
>
> Hmm, actually it would probably be better to say something like
>
> ? ? ? ?mark_accessed = true;
> ? ? ? ?if (VM_SequentialReadHint(vma) ||
> ? ? ? ? ? ?(details && details->ignore_references))
> ? ? ? ? ? ? ? ?mark_accessed = false;
>
> on entry to zap_pte_range().
>
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mark_page_accessed(page);
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? rss[MM_FILEPAGES]--;
>> ? ? ? ? ? ? ? ? ? ? ? }
>> @@ -1218,6 +1220,7 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
>> ? ? ? ? ? ? ? unsigned long size)
>> ?{
>> ? ? ? DEFINE_ZAP_DETAILS(details);
>> + ? ? details.ignore_references = true;
>> ? ? ? if (address < vma->vm_start || address + size > vma->vm_end ||
>> ? ? ? ? ? ? ? ? ? ? ? !(vma->vm_flags & VM_PFNMAP))
>> ? ? ? ? ? ? ? return -1;
>> --
>
> Unnecessary here (would make more sense in the truncation case,
> but not necessary there either): zap_vma_ptes() is only being used on
> GRU's un-cowable VM_PFNMAP area, so vm_normal_page() won't even give
> you a non-NULL page to mark.

Thanks for the notice.

How about this? Although it doesn't remove null dependency, it meet my
goal without big overhead.
It's just quick patch. If you agree, I will resend this version as formal patch.
(If you suffered from seeing below word-wrapped source, see the
attachment. I asked to google two time to support text-plain mode in
gmail web but I can't receive any response until now. ;(. Lots of
kernel developer in google. Please support this mode for us who can't
use SMTP although it's a very small VOC)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e097df6..14ae918 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -771,6 +771,7 @@ struct zap_details {
pgoff_t last_index; /* Highest page->index
to unmap */
spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */
unsigned long truncate_count; /* Compare vm_truncate_count */
+ int ignore_reference;
};

struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/madvise.c b/mm/madvise.c
index 319528b..fdb0253 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -162,18 +162,22 @@ static long madvise_dontneed(struct vm_area_struct * vma,
struct vm_area_struct ** prev,
unsigned long start, unsigned long end)
{
+ struct zap_details details ;
+
*prev = vma;
if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
return -EINVAL;

if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
- struct zap_details details = {
- .nonlinear_vma = vma,
- .last_index = ULONG_MAX,
- };
- zap_page_range(vma, start, end - start, &details);
- } else
- zap_page_range(vma, start, end - start, NULL);
+ details.nonlinear_vma = vma;
+ details.last_index = ULONG_MAX;
+ } else {
+ details.nonlinear_vma = NULL;
+ details.last_index = NULL;
+ }
+
+ details.ignore_references = true;
+ zap_page_range(vma, start, end - start, &details);
return 0;
}

diff --git a/mm/memory.c b/mm/memory.c
index ebfeedf..d46ac42 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -897,9 +897,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
pte_t *pte;
spinlock_t *ptl;
int rss[NR_MM_COUNTERS];
-
+ bool ignore_reference = false;
init_rss_vec(rss);

+ if (details && ((!details->check_mapping && !details->nonlinear_vma)
+ || !details->ignore_reference))
+ details = NULL;
+
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
do {
@@ -949,7 +955,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (pte_dirty(ptent))
set_page_dirty(page);
if (pte_young(ptent) &&
- likely(!VM_SequentialReadHint(vma)))
+ likely(!VM_SequentialReadHint(vma)) &&
+ likely(!ignore_reference))
mark_page_accessed(page);
rss[MM_FILEPAGES]--;
}
@@ -1038,8 +1045,6 @@ static unsigned long unmap_page_range(struct
mmu_gather *tlb,
pgd_t *pgd;
unsigned long next;

- if (details && !details->check_mapping && !details->nonlinear_vma)
- details = NULL;

BUG_ON(addr >= end);
mem_cgroup_uncharge_start();
@@ -1102,7 +1107,8 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
unsigned long tlb_start = 0; /* For tlb_finish_mmu */
int tlb_start_valid = 0;
unsigned long start = start_addr;
- spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+ spinlock_t *i_mmap_lock = details ?
+ (detais->check_mapping ? details->i_mmap_lock: NULL) : NULL;
int fullmm = (*tlbp)->fullmm;
struct mm_struct *mm = vma->vm_mm;



>
> Hugh
>



--
Kind regards,
Minchan Kim


Attachments:
madvise.patch (2.88 kB)

2010-12-07 14:37:56

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v4 1/7] Fix checkpatch's report in swap.c

On Mon, Dec 06, 2010 at 02:29:09AM +0900, Minchan Kim wrote:
> checkpatch reports following problems.
> It's a very annoying. This patch fixes it.
>
> barrios@barrios-desktop:~/linux-2.6$ ./scripts/checkpatch.pl -f mm/swap.c
> WARNING: line over 80 characters
> + if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
>
> WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
> +EXPORT_SYMBOL(mark_page_accessed);
>
> ERROR: code indent should use tabs where possible
> + ^I^I}$
>
> WARNING: please, no space before tabs
> + ^I^I}$
>
> WARNING: please, no spaces at the start of a line
> + ^I^I}$
>
> WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
> +EXPORT_SYMBOL(__pagevec_release);
>
> WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
> +EXPORT_SYMBOL(____pagevec_lru_add);
>
> WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
> +EXPORT_SYMBOL(pagevec_lookup);
>
> WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
> +EXPORT_SYMBOL(pagevec_lookup_tag);
>
> total: 1 errors, 8 warnings, 517 lines checked
>
> Signed-off-by: Minchan Kim <[email protected]>

Reviewed-by: Johannes Weiner <[email protected]>

2010-12-07 14:49:53

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] deactivate invalidated pages

On Mon, Dec 06, 2010 at 02:29:10AM +0900, Minchan Kim wrote:
> Changelog since v3:
> - Change function comments - suggested by Johannes
> - Change function name - suggested by Johannes
> - add only dirty/writeback pages to deactive pagevec

Why the extra check?

> @@ -359,8 +360,16 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> if (lock_failed)
> continue;
>
> - ret += invalidate_inode_page(page);
> -
> + ret = invalidate_inode_page(page);
> + /*
> + * If the page is dirty or under writeback, we can not
> + * invalidate it now. But we assume that attempted
> + * invalidation is a hint that the page is no longer
> + * of interest and try to speed up its reclaim.
> + */
> + if (!ret && (PageDirty(page) || PageWriteback(page)))
> + deactivate_page(page);

The writeback completion handler does not take the page lock, so you
can still miss pages that finish writeback before this test, no?

Can you explain why you felt the need to add these checks?

Thanks!

Hannes

2010-12-07 14:52:36

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] move memcg reclaimable page into tail of inactive list

On Mon, Dec 06, 2010 at 02:29:11AM +0900, Minchan Kim wrote:
> Golbal page reclaim moves reclaimalbe pages into inactive list
> to reclaim asap. This patch apply the rule in memcg.
> It can help to prevent unnecessary working page eviction of memcg.
>
> Cc: Balbir Singh <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>

Reviewed-by: Johannes Weiner <[email protected]>

2010-12-07 15:05:44

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] Reclaim invalidated page ASAP

On Mon, Dec 06, 2010 at 02:29:12AM +0900, Minchan Kim wrote:
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -275,26 +275,59 @@ void add_page_to_unevictable_list(struct page *page)
> * head of the list, rather than the tail, to give the flusher
> * threads some time to write it out, as this is much more
> * effective than the single-page writeout from reclaim.
> + *
> + * If the page isn't page_mapped and dirty/writeback, the page
> + * could reclaim asap using PG_reclaim.
> + *
> + * 1. active, mapped page -> none
> + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> + * 3. inactive, mapped page -> none
> + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim

inactive, clean -> inactive, tail

> + * 5. Others -> none
> + *
> + * In 4, why it moves inactive's head, the VM expects the page would
> + * be write it out by flusher threads as this is much more effective
> + * than the single-page writeout from reclaim.
> */
> static void lru_deactivate(struct page *page, struct zone *zone)
> {
> int lru, file;
> + bool active;
>
> - if (!PageLRU(page) || !PageActive(page))
> + if (!PageLRU(page))
> return;
>
> /* Some processes are using the page */
> if (page_mapped(page))
> return;
>
> + active = PageActive(page);
> +
> file = page_is_file_cache(page);
> lru = page_lru_base_type(page);
> - del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
> + del_page_from_lru_list(zone, page, lru + active);
> ClearPageActive(page);
> ClearPageReferenced(page);
> add_page_to_lru_list(zone, page, lru);
> - __count_vm_event(PGDEACTIVATE);
>
> + if (PageWriteback(page) || PageDirty(page)) {
> + /*
> + * PG_reclaim could be raced with end_page_writeback
> + * It can make readahead confusing. But race window
> + * is _really_ small and it's non-critical problem.
> + */
> + SetPageReclaim(page);
> + } else {
> + /*
> + * The page's writeback ends up during pagevec
> + * We moves tha page into tail of inactive.
> + */
> + list_move_tail(&page->lru, &zone->lru[lru].list);
> + mem_cgroup_rotate_reclaimable_page(page);

I think you also need to increase PGROTATED here.

Other than that,
Reviewed-by: Johannes Weiner <[email protected]>

2010-12-07 15:07:25

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] deactivate invalidated pages

On Tue, Dec 07, 2010 at 03:49:24PM +0100, Johannes Weiner wrote:
> On Mon, Dec 06, 2010 at 02:29:10AM +0900, Minchan Kim wrote:
> > Changelog since v3:
> > - Change function comments - suggested by Johannes
> > - Change function name - suggested by Johannes
> > - add only dirty/writeback pages to deactive pagevec
>
> Why the extra check?
>
> > @@ -359,8 +360,16 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > if (lock_failed)
> > continue;
> >
> > - ret += invalidate_inode_page(page);
> > -
> > + ret = invalidate_inode_page(page);
> > + /*
> > + * If the page is dirty or under writeback, we can not
> > + * invalidate it now. But we assume that attempted
> > + * invalidation is a hint that the page is no longer
> > + * of interest and try to speed up its reclaim.
> > + */
> > + if (!ret && (PageDirty(page) || PageWriteback(page)))
> > + deactivate_page(page);
>
> The writeback completion handler does not take the page lock, so you
> can still miss pages that finish writeback before this test, no?

Yes. but I think it's rare and even though it happens, it's not critical.
>
> Can you explain why you felt the need to add these checks?

invalidate_inode_page can return 0 although the pages is !{dirty|writeback}.
Look invalidate_complete_page. As easiest example, if the page has buffer and
try_to_release_page can't release the buffer, it could return 0.

I want to check this.

>
> Thanks!
>
> Hannes

--
Kind regards,
Minchan Kim

2010-12-07 15:20:08

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] deactivate invalidated pages

On Wed, Dec 08, 2010 at 12:07:10AM +0900, Minchan Kim wrote:
> On Tue, Dec 07, 2010 at 03:49:24PM +0100, Johannes Weiner wrote:
> > On Mon, Dec 06, 2010 at 02:29:10AM +0900, Minchan Kim wrote:
> > > Changelog since v3:
> > > - Change function comments - suggested by Johannes
> > > - Change function name - suggested by Johannes
> > > - add only dirty/writeback pages to deactive pagevec
> >
> > Why the extra check?
> >
> > > @@ -359,8 +360,16 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > > if (lock_failed)
> > > continue;
> > >
> > > - ret += invalidate_inode_page(page);
> > > -
> > > + ret = invalidate_inode_page(page);
> > > + /*
> > > + * If the page is dirty or under writeback, we can not
> > > + * invalidate it now. But we assume that attempted
> > > + * invalidation is a hint that the page is no longer
> > > + * of interest and try to speed up its reclaim.
> > > + */
> > > + if (!ret && (PageDirty(page) || PageWriteback(page)))
> > > + deactivate_page(page);
> >
> > The writeback completion handler does not take the page lock, so you
> > can still miss pages that finish writeback before this test, no?
>
> Yes. but I think it's rare and even though it happens, it's not critical.
> >
> > Can you explain why you felt the need to add these checks?
>
> invalidate_inode_page can return 0 although the pages is !{dirty|writeback}.
> Look invalidate_complete_page. As easiest example, if the page has buffer and
> try_to_release_page can't release the buffer, it could return 0.

Ok, but somebody still tried to truncate the page, so why shouldn't we
try to reclaim it? The reason for deactivating at this location is
that truncation is a strong hint for reclaim, not that it failed due
to dirty/writeback pages.

What's the problem with deactivating pages where try_to_release_page()
failed?

I don't think we should add more logic than necessary. If there is a
good reason for it, it needs to get a code comment at least.

2010-12-07 15:22:00

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] Reclaim invalidated page ASAP

On Tue, Dec 07, 2010 at 04:05:25PM +0100, Johannes Weiner wrote:
> On Mon, Dec 06, 2010 at 02:29:12AM +0900, Minchan Kim wrote:
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -275,26 +275,59 @@ void add_page_to_unevictable_list(struct page *page)
> > * head of the list, rather than the tail, to give the flusher
> > * threads some time to write it out, as this is much more
> > * effective than the single-page writeout from reclaim.
> > + *
> > + * If the page isn't page_mapped and dirty/writeback, the page
> > + * could reclaim asap using PG_reclaim.
> > + *
> > + * 1. active, mapped page -> none
> > + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> > + * 3. inactive, mapped page -> none
> > + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
>
> inactive, clean -> inactive, tail

Indeed. I missed it.

>
> > + * 5. Others -> none
> > + *
> > + * In 4, why it moves inactive's head, the VM expects the page would
> > + * be write it out by flusher threads as this is much more effective
> > + * than the single-page writeout from reclaim.
> > */
> > static void lru_deactivate(struct page *page, struct zone *zone)
> > {
> > int lru, file;
> > + bool active;
> >
> > - if (!PageLRU(page) || !PageActive(page))
> > + if (!PageLRU(page))
> > return;
> >
> > /* Some processes are using the page */
> > if (page_mapped(page))
> > return;
> >
> > + active = PageActive(page);
> > +
> > file = page_is_file_cache(page);
> > lru = page_lru_base_type(page);
> > - del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
> > + del_page_from_lru_list(zone, page, lru + active);
> > ClearPageActive(page);
> > ClearPageReferenced(page);
> > add_page_to_lru_list(zone, page, lru);
> > - __count_vm_event(PGDEACTIVATE);
> >
> > + if (PageWriteback(page) || PageDirty(page)) {
> > + /*
> > + * PG_reclaim could be raced with end_page_writeback
> > + * It can make readahead confusing. But race window
> > + * is _really_ small and it's non-critical problem.
> > + */
> > + SetPageReclaim(page);
> > + } else {
> > + /*
> > + * The page's writeback ends up during pagevec
> > + * We moves tha page into tail of inactive.
> > + */
> > + list_move_tail(&page->lru, &zone->lru[lru].list);
> > + mem_cgroup_rotate_reclaimable_page(page);
>
> I think you also need to increase PGROTATED here.

Absolutely.

Thanks, Hannes. :)
>
> Other than that,
> Reviewed-by: Johannes Weiner <[email protected]>

--
Kind regards,
Minchan Kim

2010-12-07 15:26:36

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] deactivate invalidated pages

On Tue, Dec 07, 2010 at 04:19:39PM +0100, Johannes Weiner wrote:
> On Wed, Dec 08, 2010 at 12:07:10AM +0900, Minchan Kim wrote:
> > On Tue, Dec 07, 2010 at 03:49:24PM +0100, Johannes Weiner wrote:
> > > On Mon, Dec 06, 2010 at 02:29:10AM +0900, Minchan Kim wrote:
> > > > Changelog since v3:
> > > > - Change function comments - suggested by Johannes
> > > > - Change function name - suggested by Johannes
> > > > - add only dirty/writeback pages to deactive pagevec
> > >
> > > Why the extra check?
> > >
> > > > @@ -359,8 +360,16 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > > > if (lock_failed)
> > > > continue;
> > > >
> > > > - ret += invalidate_inode_page(page);
> > > > -
> > > > + ret = invalidate_inode_page(page);
> > > > + /*
> > > > + * If the page is dirty or under writeback, we can not
> > > > + * invalidate it now. But we assume that attempted
> > > > + * invalidation is a hint that the page is no longer
> > > > + * of interest and try to speed up its reclaim.
> > > > + */
> > > > + if (!ret && (PageDirty(page) || PageWriteback(page)))
> > > > + deactivate_page(page);
> > >
> > > The writeback completion handler does not take the page lock, so you
> > > can still miss pages that finish writeback before this test, no?
> >
> > Yes. but I think it's rare and even though it happens, it's not critical.
> > >
> > > Can you explain why you felt the need to add these checks?
> >
> > invalidate_inode_page can return 0 although the pages is !{dirty|writeback}.
> > Look invalidate_complete_page. As easiest example, if the page has buffer and
> > try_to_release_page can't release the buffer, it could return 0.
>
> Ok, but somebody still tried to truncate the page, so why shouldn't we
> try to reclaim it? The reason for deactivating at this location is
> that truncation is a strong hint for reclaim, not that it failed due
> to dirty/writeback pages.
>
> What's the problem with deactivating pages where try_to_release_page()
> failed?

If try_to_release_page fails and the such pages stay long time in pagevec,
pagevec drain often happens. I think such pages are rare so skip such pages doesn't
hurt goal of this patch.

>
> I don't think we should add more logic than necessary. If there is a
> good reason for it, it needs to get a code comment at least.

Above my comment is enough to justify it? If you agree, I can add the comment.

Thanks for careful review, Hannes.

--
Kind regards,
Minchan Kim

2010-12-07 15:57:32

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] deactivate invalidated pages

On Wed, Dec 08, 2010 at 12:26:25AM +0900, Minchan Kim wrote:
> On Tue, Dec 07, 2010 at 04:19:39PM +0100, Johannes Weiner wrote:
> > On Wed, Dec 08, 2010 at 12:07:10AM +0900, Minchan Kim wrote:
> > > On Tue, Dec 07, 2010 at 03:49:24PM +0100, Johannes Weiner wrote:
> > > > On Mon, Dec 06, 2010 at 02:29:10AM +0900, Minchan Kim wrote:
> > > > > Changelog since v3:
> > > > > - Change function comments - suggested by Johannes
> > > > > - Change function name - suggested by Johannes
> > > > > - add only dirty/writeback pages to deactive pagevec
> > > >
> > > > Why the extra check?
> > > >
> > > > > @@ -359,8 +360,16 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > > > > if (lock_failed)
> > > > > continue;
> > > > >
> > > > > - ret += invalidate_inode_page(page);
> > > > > -
> > > > > + ret = invalidate_inode_page(page);
> > > > > + /*
> > > > > + * If the page is dirty or under writeback, we can not
> > > > > + * invalidate it now. But we assume that attempted
> > > > > + * invalidation is a hint that the page is no longer
> > > > > + * of interest and try to speed up its reclaim.
> > > > > + */
> > > > > + if (!ret && (PageDirty(page) || PageWriteback(page)))
> > > > > + deactivate_page(page);
> > > >
> > > > The writeback completion handler does not take the page lock, so you
> > > > can still miss pages that finish writeback before this test, no?
> > >
> > > Yes. but I think it's rare and even though it happens, it's not critical.
> > > >
> > > > Can you explain why you felt the need to add these checks?
> > >
> > > invalidate_inode_page can return 0 although the pages is !{dirty|writeback}.
> > > Look invalidate_complete_page. As easiest example, if the page has buffer and
> > > try_to_release_page can't release the buffer, it could return 0.
> >
> > Ok, but somebody still tried to truncate the page, so why shouldn't we
> > try to reclaim it? The reason for deactivating at this location is
> > that truncation is a strong hint for reclaim, not that it failed due
> > to dirty/writeback pages.
> >
> > What's the problem with deactivating pages where try_to_release_page()
> > failed?
>
> If try_to_release_page fails and the such pages stay long time in pagevec,
> pagevec drain often happens.

You mean because the pagevec becomes full more often? These are not
many pages you get extra without the checks, the race window is very
small after all.

> I think such pages are rare so skip such pages doesn't hurt goal of
> this patch.

Well, you add extra checks, extra detail to this mechanism. Instead
of just saying 'tried to truncate, failed, deactivate the page', you
add more ifs and buts.

There should be a real justification for it. 'It can not hurt' is not
a good justification for extra code and making a simple model more
complex.

'It will hurt without treating these pages differently' is a good
justification. Remember that we have to understand and maintain all
this. The less checks and operations we need to implement a certain
idea, the better.

Sorry for being so adamant about this, but I think these random checks
are a really sore point of mm code already.

[ For example, we tried discussing lumpy reclaim mode recently and
none of us could reliably remember how it actually behaved. There
are so many special conditions in there that we already end up with
some of them being dead code and the checks even contradicting each
other. ]

2010-12-07 22:51:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] deactivate invalidated pages

On Wed, Dec 8, 2010 at 12:56 AM, Johannes Weiner <[email protected]> wrote:
> On Wed, Dec 08, 2010 at 12:26:25AM +0900, Minchan Kim wrote:
>> On Tue, Dec 07, 2010 at 04:19:39PM +0100, Johannes Weiner wrote:
>> > On Wed, Dec 08, 2010 at 12:07:10AM +0900, Minchan Kim wrote:
>> > > On Tue, Dec 07, 2010 at 03:49:24PM +0100, Johannes Weiner wrote:
>> > > > On Mon, Dec 06, 2010 at 02:29:10AM +0900, Minchan Kim wrote:
>> > > > > Changelog since v3:
>> > > > > ?- Change function comments - suggested by Johannes
>> > > > > ?- Change function name - suggested by Johannes
>> > > > > ?- add only dirty/writeback pages to deactive pagevec
>> > > >
>> > > > Why the extra check?
>> > > >
>> > > > > @@ -359,8 +360,16 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>> > > > > ? ? ? ? ? ? ? ? ? ? ? if (lock_failed)
>> > > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>> > > > >
>> > > > > - ? ? ? ? ? ? ? ? ? ? ret += invalidate_inode_page(page);
>> > > > > -
>> > > > > + ? ? ? ? ? ? ? ? ? ? ret = invalidate_inode_page(page);
>> > > > > + ? ? ? ? ? ? ? ? ? ? /*
>> > > > > + ? ? ? ? ? ? ? ? ? ? ?* If the page is dirty or under writeback, we can not
>> > > > > + ? ? ? ? ? ? ? ? ? ? ?* invalidate it now. ?But we assume that attempted
>> > > > > + ? ? ? ? ? ? ? ? ? ? ?* invalidation is a hint that the page is no longer
>> > > > > + ? ? ? ? ? ? ? ? ? ? ?* of interest and try to speed up its reclaim.
>> > > > > + ? ? ? ? ? ? ? ? ? ? ?*/
>> > > > > + ? ? ? ? ? ? ? ? ? ? if (!ret && (PageDirty(page) || PageWriteback(page)))
>> > > > > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? deactivate_page(page);
>> > > >
>> > > > The writeback completion handler does not take the page lock, so you
>> > > > can still miss pages that finish writeback before this test, no?
>> > >
>> > > Yes. but I think it's rare and even though it happens, it's not critical.
>> > > >
>> > > > Can you explain why you felt the need to add these checks?
>> > >
>> > > invalidate_inode_page can return 0 although the pages is !{dirty|writeback}.
>> > > Look invalidate_complete_page. As easiest example, if the page has buffer and
>> > > try_to_release_page can't release the buffer, it could return 0.
>> >
>> > Ok, but somebody still tried to truncate the page, so why shouldn't we
>> > try to reclaim it? ?The reason for deactivating at this location is
>> > that truncation is a strong hint for reclaim, not that it failed due
>> > to dirty/writeback pages.
>> >
>> > What's the problem with deactivating pages where try_to_release_page()
>> > failed?
>>
>> If try_to_release_page fails and the such pages stay long time in pagevec,
>> pagevec drain often happens.
>
> You mean because the pagevec becomes full more often? ?These are not
> many pages you get extra without the checks, the race window is very
> small after all.

Right.
It was a totally bad answer. The work in midnight makes my mind to be hurt. :)

Another point is that we can move such pages(!try_to_release_page,
someone else holding the ref) into tail of inactive.
We can't expect such pages will be freed sooner or later and it can
stir lru pages unnecessary.
On the other hand it's a _really_ rare so couldn't we move the pages into tail?
If it can be justified, I will remove the check.
What do you think about it?

>
>> I think such pages are rare so skip such pages doesn't hurt goal of
>> this patch.
>
> Well, you add extra checks, extra detail to this mechanism. ?Instead
> of just saying 'tried to truncate, failed, deactivate the page', you
> add more ifs and buts.
>
> There should be a real justification for it. ?'It can not hurt' is not
> a good justification for extra code and making a simple model more
> complex.
>
> 'It will hurt without treating these pages differently' is a good
> justification. ?Remember that we have to understand and maintain all
> this. ?The less checks and operations we need to implement a certain
> idea, the better.
>
> Sorry for being so adamant about this, but I think these random checks
> are a really sore point of mm code already.

Never mind. I totally support your opinion.
It always make me confusing to review the mm codes.
Nowadays, many reviewers want detail comment and description. Given
that mm code changing are bigger, I believe it's a way to go

>
> [ For example, we tried discussing lumpy reclaim mode recently and
> ?none of us could reliably remember how it actually behaved. ?There
> ?are so many special conditions in there that we already end up with
> ?some of them being dead code and the checks even contradicting each
> ?other. ]
>

Thanks for good comment, Hannes.




--
Kind regards,
Minchan Kim

2010-12-08 01:26:38

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] deactivate invalidated pages

On Wed, 8 Dec 2010 07:51:25 +0900
Minchan Kim <[email protected]> wrote:

> On Wed, Dec 8, 2010 at 12:56 AM, Johannes Weiner <[email protected]> wrote:
> > On Wed, Dec 08, 2010 at 12:26:25AM +0900, Minchan Kim wrote:
> >> On Tue, Dec 07, 2010 at 04:19:39PM +0100, Johannes Weiner wrote:
> >> > On Wed, Dec 08, 2010 at 12:07:10AM +0900, Minchan Kim wrote:
> >> > > On Tue, Dec 07, 2010 at 03:49:24PM +0100, Johannes Weiner wrote:
> >> > > > On Mon, Dec 06, 2010 at 02:29:10AM +0900, Minchan Kim wrote:
> >> > > > > Changelog since v3:
> >> > > > >  - Change function comments - suggested by Johannes
> >> > > > >  - Change function name - suggested by Johannes
> >> > > > >  - add only dirty/writeback pages to deactive pagevec
> >> > > >
> >> > > > Why the extra check?
> >> > > >
> >> > > > > @@ -359,8 +360,16 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> >> > > > >                       if (lock_failed)
> >> > > > >                               continue;
> >> > > > >
> >> > > > > -                     ret += invalidate_inode_page(page);
> >> > > > > -
> >> > > > > +                     ret = invalidate_inode_page(page);
> >> > > > > +                     /*
> >> > > > > +                      * If the page is dirty or under writeback, we can not
> >> > > > > +                      * invalidate it now.  But we assume that attempted
> >> > > > > +                      * invalidation is a hint that the page is no longer
> >> > > > > +                      * of interest and try to speed up its reclaim.
> >> > > > > +                      */
> >> > > > > +                     if (!ret && (PageDirty(page) || PageWriteback(page)))
> >> > > > > +                             deactivate_page(page);
> >> > > >
> >> > > > The writeback completion handler does not take the page lock, so you
> >> > > > can still miss pages that finish writeback before this test, no?
> >> > >
> >> > > Yes. but I think it's rare and even though it happens, it's not critical.
> >> > > >
> >> > > > Can you explain why you felt the need to add these checks?
> >> > >
> >> > > invalidate_inode_page can return 0 although the pages is !{dirty|writeback}.
> >> > > Look invalidate_complete_page. As easiest example, if the page has buffer and
> >> > > try_to_release_page can't release the buffer, it could return 0.
> >> >
> >> > Ok, but somebody still tried to truncate the page, so why shouldn't we
> >> > try to reclaim it?  The reason for deactivating at this location is
> >> > that truncation is a strong hint for reclaim, not that it failed due
> >> > to dirty/writeback pages.
> >> >
> >> > What's the problem with deactivating pages where try_to_release_page()
> >> > failed?
> >>
> >> If try_to_release_page fails and the such pages stay long time in pagevec,
> >> pagevec drain often happens.
> >
> > You mean because the pagevec becomes full more often?  These are not
> > many pages you get extra without the checks, the race window is very
> > small after all.
>
> Right.
> It was a totally bad answer. The work in midnight makes my mind to be hurt. :)
>
> Another point is that we can move such pages(!try_to_release_page,
> someone else holding the ref) into tail of inactive.
> We can't expect such pages will be freed sooner or later and it can
> stir lru pages unnecessary.
> On the other hand it's a _really_ rare so couldn't we move the pages into tail?
> If it can be justified, I will remove the check.
> What do you think about it?
>

I wonder ...how about adding "victim" list for "Reclaim" pages ? Then, we don't need
extra LRU rotation.

Thanks,
-Kame

2010-12-08 01:48:54

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] deactivate invalidated pages

Hi Kame,

On Wed, Dec 8, 2010 at 9:56 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Wed, 8 Dec 2010 07:51:25 +0900
> Minchan Kim <[email protected]> wrote:
>
>> On Wed, Dec 8, 2010 at 12:56 AM, Johannes Weiner <[email protected]> wrote:
>> > On Wed, Dec 08, 2010 at 12:26:25AM +0900, Minchan Kim wrote:
>> >> On Tue, Dec 07, 2010 at 04:19:39PM +0100, Johannes Weiner wrote:
>> >> > On Wed, Dec 08, 2010 at 12:07:10AM +0900, Minchan Kim wrote:
>> >> > > On Tue, Dec 07, 2010 at 03:49:24PM +0100, Johannes Weiner wrote:
>> >> > > > On Mon, Dec 06, 2010 at 02:29:10AM +0900, Minchan Kim wrote:
>> >> > > > > Changelog since v3:
>> >> > > > > ?- Change function comments - suggested by Johannes
>> >> > > > > ?- Change function name - suggested by Johannes
>> >> > > > > ?- add only dirty/writeback pages to deactive pagevec
>> >> > > >
>> >> > > > Why the extra check?
>> >> > > >
>> >> > > > > @@ -359,8 +360,16 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>> >> > > > > ? ? ? ? ? ? ? ? ? ? ? if (lock_failed)
>> >> > > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>> >> > > > >
>> >> > > > > - ? ? ? ? ? ? ? ? ? ? ret += invalidate_inode_page(page);
>> >> > > > > -
>> >> > > > > + ? ? ? ? ? ? ? ? ? ? ret = invalidate_inode_page(page);
>> >> > > > > + ? ? ? ? ? ? ? ? ? ? /*
>> >> > > > > + ? ? ? ? ? ? ? ? ? ? ?* If the page is dirty or under writeback, we can not
>> >> > > > > + ? ? ? ? ? ? ? ? ? ? ?* invalidate it now. ?But we assume that attempted
>> >> > > > > + ? ? ? ? ? ? ? ? ? ? ?* invalidation is a hint that the page is no longer
>> >> > > > > + ? ? ? ? ? ? ? ? ? ? ?* of interest and try to speed up its reclaim.
>> >> > > > > + ? ? ? ? ? ? ? ? ? ? ?*/
>> >> > > > > + ? ? ? ? ? ? ? ? ? ? if (!ret && (PageDirty(page) || PageWriteback(page)))
>> >> > > > > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? deactivate_page(page);
>> >> > > >
>> >> > > > The writeback completion handler does not take the page lock, so you
>> >> > > > can still miss pages that finish writeback before this test, no?
>> >> > >
>> >> > > Yes. but I think it's rare and even though it happens, it's not critical.
>> >> > > >
>> >> > > > Can you explain why you felt the need to add these checks?
>> >> > >
>> >> > > invalidate_inode_page can return 0 although the pages is !{dirty|writeback}.
>> >> > > Look invalidate_complete_page. As easiest example, if the page has buffer and
>> >> > > try_to_release_page can't release the buffer, it could return 0.
>> >> >
>> >> > Ok, but somebody still tried to truncate the page, so why shouldn't we
>> >> > try to reclaim it? ?The reason for deactivating at this location is
>> >> > that truncation is a strong hint for reclaim, not that it failed due
>> >> > to dirty/writeback pages.
>> >> >
>> >> > What's the problem with deactivating pages where try_to_release_page()
>> >> > failed?
>> >>
>> >> If try_to_release_page fails and the such pages stay long time in pagevec,
>> >> pagevec drain often happens.
>> >
>> > You mean because the pagevec becomes full more often? ?These are not
>> > many pages you get extra without the checks, the race window is very
>> > small after all.
>>
>> Right.
>> It was a totally bad answer. The work in midnight makes my mind to be hurt. :)
>>
>> Another point is that we can move such pages(!try_to_release_page,
>> someone else holding the ref) into tail of inactive.
>> We can't expect such pages will be freed sooner or later and it can
>> stir lru pages unnecessary.
>> On the other hand it's a _really_ rare so couldn't we move the pages into tail?
>> If it can be justified, I will remove the check.
>> What do you think about it?
>>
>
> I wonder ...how about adding "victim" list for "Reclaim" pages ? Then, we don't need
> extra LRU rotation.

It can make the code clean.
As far as I think, victim list does following as.

1. select victim pages by strong hint
2. move the page from LRU to victim
3. reclaimer always peeks victim list before diving into LRU list.
4-1. If the victim pages is used by others or dirty, it can be moved
into LRU, again or remain the page in victim list.
If the page is remained victim, when do we move it into LRU again if
the reclaimer continues to fail the page?
We have to put the new rule.
4-2. If the victim pages isn't used by others and clean, we can
reclaim the page asap.

AFAIK, strong hints are just two(invalidation, readahead max window heuristic).
I am not sure it's valuable to add new hierarchy(ie, LRU, victim,
unevictable) for cleaning the minor codes.
In addition, we have to put the new rule so it would make the LRU code
complicated.
I remember how unevictable feature merge is hard.

But I am not against if we have more usecases. In this case, it's
valuable to implement it although it's not easy.

Thanks, Kame.

>
> Thanks,
> -Kame
>
>
>



--
Kind regards,
Minchan Kim

2010-12-08 02:02:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] deactivate invalidated pages

On Wed, 8 Dec 2010 10:43:08 +0900
Minchan Kim <[email protected]> wrote:

> Hi Kame,
>
Hi,

> > I wonder ...how about adding "victim" list for "Reclaim" pages ? Then, we don't need
> > extra LRU rotation.
>
> It can make the code clean.
> As far as I think, victim list does following as.
>
> 1. select victim pages by strong hint
> 2. move the page from LRU to victim
> 3. reclaimer always peeks victim list before diving into LRU list.
> 4-1. If the victim pages is used by others or dirty, it can be moved
> into LRU, again or remain the page in victim list.
> If the page is remained victim, when do we move it into LRU again if
> the reclaimer continues to fail the page?
When sometone touches it.

> We have to put the new rule.
> 4-2. If the victim pages isn't used by others and clean, we can
> reclaim the page asap.
>
> AFAIK, strong hints are just two(invalidation, readahead max window heuristic).
> I am not sure it's valuable to add new hierarchy(ie, LRU, victim,
> unevictable) for cleaning the minor codes.
> In addition, we have to put the new rule so it would make the LRU code
> complicated.
> I remember how unevictable feature merge is hard.
>
yes, it was hard.

> But I am not against if we have more usecases. In this case, it's
> valuable to implement it although it's not easy.
>

I wonder "victim list" can be used for something like Cleancache, when
we have very-low-latency backend devices.
And we may able to have page-cache-limit, which Balbir proposed as.

- kvictimed? will move unmappedd page caches to victim list
This may work like a InactiveClean list which we had before and make
sizing easy.

Thanks,
-Kame


2010-12-08 02:15:24

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] deactivate invalidated pages

On Wed, Dec 8, 2010 at 10:56 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Wed, 8 Dec 2010 10:43:08 +0900
> Minchan Kim <[email protected]> wrote:
>
>> Hi Kame,
>>
> Hi,
>
>> > I wonder ...how about adding "victim" list for "Reclaim" pages ? Then, we don't need
>> > extra LRU rotation.
>>
>> It can make the code clean.
>> As far as I think, victim list does following as.
>>
>> 1. select victim pages by strong hint
>> 2. move the page from LRU to victim
>> 3. reclaimer always peeks victim list before diving into LRU list.
>> 4-1. If the victim pages is used by others or dirty, it can be moved
>> into LRU, again or remain the page in victim list.
>> If the page is remained victim, when do we move it into LRU again if
>> the reclaimer continues to fail the page?
> When sometone touches it.
>
>> We have to put the new rule.
>> 4-2. If the victim pages isn't used by others and clean, we can
>> reclaim the page asap.
>>
>> AFAIK, strong hints are just two(invalidation, readahead max window heuristic).
>> I am not sure it's valuable to add new hierarchy(ie, LRU, victim,
>> unevictable) for cleaning the minor codes.
>> In addition, we have to put the new rule so it would make the LRU code
>> complicated.
>> I remember how unevictable feature merge is hard.
>>
> yes, it was hard.
>
>> But I am not against if we have more usecases. In this case, it's
>> valuable to implement it although it's not easy.
>>
>
> I wonder "victim list" can be used for something like Cleancache, when
> we have very-low-latency backend devices.
> And we may able to have page-cache-limit, which Balbir proposed as.

Yes, I thought that, too. I think it would be a good feature in embedded system.

>
> - kvictimed? will move unmappedd page caches to victim list
> This may work like a InactiveClean list which we had before and make
> sizing easy.
>

Before further discuss, we need customer's confirm.
We know very well it is very hard to merge if anyone doesn't use.

Balbir, What do think about it?


> Thanks,
> -Kame
>
>
>
>



--
Kind regards,
Minchan Kim

2010-12-08 06:56:58

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] deactivate invalidated pages

* MinChan Kim <[email protected]> [2010-12-08 11:15:19]:

> On Wed, Dec 8, 2010 at 10:56 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Wed, 8 Dec 2010 10:43:08 +0900
> > Minchan Kim <[email protected]> wrote:
> >
> >> Hi Kame,
> >>
> > Hi,
> >
> >> > I wonder ...how about adding "victim" list for "Reclaim" pages ? Then, we don't need
> >> > extra LRU rotation.
> >>
> >> It can make the code clean.
> >> As far as I think, victim list does following as.
> >>
> >> 1. select victim pages by strong hint
> >> 2. move the page from LRU to victim
> >> 3. reclaimer always peeks victim list before diving into LRU list.
> >> 4-1. If the victim pages is used by others or dirty, it can be moved
> >> into LRU, again or remain the page in victim list.
> >> If the page is remained victim, when do we move it into LRU again if
> >> the reclaimer continues to fail the page?
> > When sometone touches it.
> >
> >> We have to put the new rule.
> >> 4-2. If the victim pages isn't used by others and clean, we can
> >> reclaim the page asap.
> >>
> >> AFAIK, strong hints are just two(invalidation, readahead max window heuristic).
> >> I am not sure it's valuable to add new hierarchy(ie, LRU, victim,
> >> unevictable) for cleaning the minor codes.
> >> In addition, we have to put the new rule so it would make the LRU code
> >> complicated.
> >> I remember how unevictable feature merge is hard.
> >>
> > yes, it was hard.
> >
> >> But I am not against if we have more usecases. In this case, it's
> >> valuable to implement it although it's not easy.
> >>
> >
> > I wonder "victim list" can be used for something like Cleancache, when
> > we have very-low-latency backend devices.
> > And we may able to have page-cache-limit, which Balbir proposed as.
>
> Yes, I thought that, too. I think it would be a good feature in embedded system.
>
> >
> > - kvictimed? will move unmappedd page caches to victim list
> > This may work like a InactiveClean list which we had before and make
> > sizing easy.
> >
>
> Before further discuss, we need customer's confirm.
> We know very well it is very hard to merge if anyone doesn't use.
>
> Balbir, What do think about it?
>

The idea seems interesting, I am in the process of refreshing my
patches for unmapped page cache control. I presume the process of
filling the victim list will be similar to what I have or unmapped
page cache isolation.

>
> > Thanks,
> > -Kame
> >
> >
> >
> >
>
>
>
> --
> Kind regards,
> Minchan Kim

--
Three Cheers,
Balbir

2010-12-08 07:26:11

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] Prevent activation of page in madvise_dontneed

On Tue, 7 Dec 2010, Minchan Kim wrote:
>
> How about this? Although it doesn't remove null dependency, it meet my
> goal without big overhead.
> It's just quick patch.

Roughly, yes; by "just quick patch" I take you to mean that I should
not waste time on all the minor carelessnesses scattered through it.

> If you agree, I will resend this version as formal patch.
> (If you suffered from seeing below word-wrapped source, see the
> attachment. I asked to google two time to support text-plain mode in
> gmail web but I can't receive any response until now. ;(. Lots of
> kernel developer in google. Please support this mode for us who can't
> use SMTP although it's a very small VOC)

Tiresome. Seems not to be high on gmail's priorities.
It's sad to see even Linus attaching patches these days.

>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e097df6..14ae918 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -771,6 +771,7 @@ struct zap_details {
> pgoff_t last_index; /* Highest page->index
> to unmap */
> spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */
> unsigned long truncate_count; /* Compare vm_truncate_count */
> + int ignore_reference;
> };
>
> struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 319528b..fdb0253 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -162,18 +162,22 @@ static long madvise_dontneed(struct vm_area_struct * vma,
> struct vm_area_struct ** prev,
> unsigned long start, unsigned long end)
> {
> + struct zap_details details ;
> +
> *prev = vma;
> if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> return -EINVAL;
>
> if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
> - struct zap_details details = {
> - .nonlinear_vma = vma,
> - .last_index = ULONG_MAX,
> - };
> - zap_page_range(vma, start, end - start, &details);
> - } else
> - zap_page_range(vma, start, end - start, NULL);
> + details.nonlinear_vma = vma;
> + details.last_index = ULONG_MAX;
> + } else {
> + details.nonlinear_vma = NULL;
> + details.last_index = NULL;
> + }
> +
> + details.ignore_references = true;
> + zap_page_range(vma, start, end - start, &details);
> return 0;
> }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index ebfeedf..d46ac42 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -897,9 +897,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> pte_t *pte;
> spinlock_t *ptl;
> int rss[NR_MM_COUNTERS];
> -
> + bool ignore_reference = false;
> init_rss_vec(rss);
>
> + if (details && ((!details->check_mapping && !details->nonlinear_vma)
> + || !details->ignore_reference))
> + details = NULL;
> +

bool mark_accessed = true;

if (VM_SequentialReadHint(vma) ||
(details && details->ignore_reference))
mark_accessed = false;
if (details && !details->check_mapping && !details->nonlinear_vma)
details = NULL;


> pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> arch_enter_lazy_mmu_mode();
> do {
> @@ -949,7 +955,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> if (pte_dirty(ptent))
> set_page_dirty(page);
> if (pte_young(ptent) &&
> - likely(!VM_SequentialReadHint(vma)))
> + likely(!VM_SequentialReadHint(vma)) &&
> + likely(!ignore_reference))
> mark_page_accessed(page);

if (pte_young(ptent) && mark_accessed)
mark_page_accessed(page);


> rss[MM_FILEPAGES]--;
> }
> @@ -1038,8 +1045,6 @@ static unsigned long unmap_page_range(struct
> mmu_gather *tlb,
> pgd_t *pgd;
> unsigned long next;
>
> - if (details && !details->check_mapping && !details->nonlinear_vma)
> - details = NULL;
>
> BUG_ON(addr >= end);
> mem_cgroup_uncharge_start();
> @@ -1102,7 +1107,8 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
> unsigned long tlb_start = 0; /* For tlb_finish_mmu */
> int tlb_start_valid = 0;
> unsigned long start = start_addr;
> - spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
> + spinlock_t *i_mmap_lock = details ?
> + (detais->check_mapping ? details->i_mmap_lock: NULL) : NULL;

Why that change?

Hugh

2010-12-08 07:55:13

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] Prevent activation of page in madvise_dontneed

On Wed, Dec 8, 2010 at 4:26 PM, Hugh Dickins <[email protected]> wrote:
> On Tue, 7 Dec 2010, Minchan Kim wrote:
>>
>> How about this? Although it doesn't remove null dependency, it meet my
>> goal without big overhead.
>> It's just quick patch.
>
> Roughly, yes; by "just quick patch" I take you to mean that I should
> not waste time on all the minor carelessnesses scattered through it.
>
>> If you agree, I will resend this version as formal patch.
>> (If you suffered from seeing below word-wrapped source, see the
>> attachment. I asked to google two time to support text-plain mode in
>> gmail web but I can't receive any response until now. ;(. Lots of
>> kernel developer in google. Please support this mode for us who can't
>> use SMTP although it's a very small VOC)
>
> Tiresome. ?Seems not to be high on gmail's priorities.
> It's sad to see even Linus attaching patches these days.

That encourages me(But I don't mean I will use attachment again. :)).

>
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index e097df6..14ae918 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -771,6 +771,7 @@ struct zap_details {
>> ? ? ? ? pgoff_t last_index; ? ? ? ? ? ? ? ? ? ? /* Highest page->index
>> to unmap */
>> ? ? ? ? spinlock_t *i_mmap_lock; ? ? ? ? ? ? ? ?/* For unmap_mapping_range: */
>> ? ? ? ? unsigned long truncate_count; ? ? ? ? ? /* Compare vm_truncate_count */
>> + ? ? ? int ignore_reference;
>> ?};
>>
>> ?struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 319528b..fdb0253 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -162,18 +162,22 @@ static long madvise_dontneed(struct vm_area_struct * vma,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct vm_area_struct ** prev,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long start, unsigned long end)
>> ?{
>> + ? ? ? struct zap_details details ;
>> +
>> ? ? ? ? *prev = vma;
>> ? ? ? ? if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
>> ? ? ? ? ? ? ? ? return -EINVAL;
>>
>> ? ? ? ? if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
>> - ? ? ? ? ? ? ? struct zap_details details = {
>> - ? ? ? ? ? ? ? ? ? ? ? .nonlinear_vma = vma,
>> - ? ? ? ? ? ? ? ? ? ? ? .last_index = ULONG_MAX,
>> - ? ? ? ? ? ? ? };
>> - ? ? ? ? ? ? ? zap_page_range(vma, start, end - start, &details);
>> - ? ? ? } else
>> - ? ? ? ? ? ? ? zap_page_range(vma, start, end - start, NULL);
>> + ? ? ? ? ? ? ? details.nonlinear_vma = vma;
>> + ? ? ? ? ? ? ? details.last_index = ULONG_MAX;
>> + ? ? ? } else {
>> + ? ? ? ? ? ? ? details.nonlinear_vma = NULL;
>> + ? ? ? ? ? ? ? details.last_index = NULL;
>> + ? ? ? }
>> +
>> + ? ? ? details.ignore_references = true;
>> + ? ? ? zap_page_range(vma, start, end - start, &details);
>> ? ? ? ? return 0;
>> ?}
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index ebfeedf..d46ac42 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -897,9 +897,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>> ? ? ? ? pte_t *pte;
>> ? ? ? ? spinlock_t *ptl;
>> ? ? ? ? int rss[NR_MM_COUNTERS];
>> -
>> + ? ? ? bool ignore_reference = false;
>> ? ? ? ? init_rss_vec(rss);
>>
>> + ? ? ? if (details && ((!details->check_mapping && !details->nonlinear_vma)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?|| !details->ignore_reference))
>> + ? ? ? ? ? ? ? details = NULL;
>> +
>
> ? ? ? ?bool mark_accessed = true;
>
> ? ? ? ?if (VM_SequentialReadHint(vma) ||
> ? ? ? ? ? ?(details && details->ignore_reference))
> ? ? ? ? ? ? ? ?mark_accessed = false;
> ? ? ? ?if (details && !details->check_mapping && !details->nonlinear_vma)
> ? ? ? ? ? ? ? ?details = NULL;
>
>
>> ? ? ? ? pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>> ? ? ? ? arch_enter_lazy_mmu_mode();
>> ? ? ? ? do {
>> @@ -949,7 +955,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (pte_dirty(ptent))
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? set_page_dirty(page);
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (pte_young(ptent) &&
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? likely(!VM_SequentialReadHint(vma)))
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? likely(!VM_SequentialReadHint(vma)) &&
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? likely(!ignore_reference))
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mark_page_accessed(page);
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if (pte_young(ptent) && mark_accessed)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?mark_page_accessed(page);
>
>

Much clean.

>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? rss[MM_FILEPAGES]--;
>> ? ? ? ? ? ? ? ? ? ? ? ? }
>> @@ -1038,8 +1045,6 @@ static unsigned long unmap_page_range(struct
>> mmu_gather *tlb,
>> ? ? ? ? pgd_t *pgd;
>> ? ? ? ? unsigned long next;
>>
>> - ? ? ? if (details && !details->check_mapping && !details->nonlinear_vma)
>> - ? ? ? ? ? ? ? details = NULL;
>>
>> ? ? ? ? BUG_ON(addr >= end);
>> ? ? ? ? mem_cgroup_uncharge_start();
>> @@ -1102,7 +1107,8 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
>> ? ? ? ? unsigned long tlb_start = 0; ? ?/* For tlb_finish_mmu */
>> ? ? ? ? int tlb_start_valid = 0;
>> ? ? ? ? unsigned long start = start_addr;
>> - ? ? ? spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
>> + ? ? ? spinlock_t *i_mmap_lock = details ?
>> + ? ? ? ? ? ? ? (detais->check_mapping ? details->i_mmap_lock: NULL) : NULL;
>
> Why that change?

It has done very careless. Sorry for that. I thought i_mmap_lock
always is used with check_mapping. Clear wrong!
My concern is that if we don't have such routine, caller use only
ingore_reference should initialize i_mmap_lock with NULL.
It's bad.

Hmm...

>
> Hugh
>



--
Kind regards,
Minchan Kim

2010-12-08 08:02:48

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] add profile information for invalidated page reclaim

> This patch adds profile information about invalidated page reclaim.
> It's just for profiling for test so it would be discard when the series
> are merged.
>
> Signed-off-by: Minchan Kim <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
> ---
> include/linux/vmstat.h | 4 ++--
> mm/swap.c | 3 +++
> mm/vmstat.c | 3 +++
> 3 files changed, 8 insertions(+), 2 deletions(-)

Today, we have tracepoint. tracepoint has no overhead if it's unused.
but vmstat has a overhead even if unused.

Then, all new vmstat proposal should be described why you think it is
frequently used from administrators.




>
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 833e676..c38ad95 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -30,8 +30,8 @@
>
> enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> FOR_ALL_ZONES(PGALLOC),
> - PGFREE, PGACTIVATE, PGDEACTIVATE,
> - PGFAULT, PGMAJFAULT,
> + PGFREE, PGACTIVATE, PGDEACTIVATE, PGINVALIDATE,
> + PGRECLAIM, PGFAULT, PGMAJFAULT,
> FOR_ALL_ZONES(PGREFILL),
> FOR_ALL_ZONES(PGSTEAL),
> FOR_ALL_ZONES(PGSCAN_KSWAPD),
> diff --git a/mm/swap.c b/mm/swap.c
> index 0f23998..2f21e6e 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -317,6 +317,7 @@ static void lru_deactivate(struct page *page, struct zone *zone)
> * is _really_ small and it's non-critical problem.
> */
> SetPageReclaim(page);
> + __count_vm_event(PGRECLAIM);

Um. No.
This is not reclaim operation anyway. Userland folks shouldn't know
you override PG_reclaim. It's implementaion internal information.



> } else {
> /*
> * The page's writeback ends up during pagevec
> @@ -328,6 +329,8 @@ static void lru_deactivate(struct page *page, struct zone *zone)
>
> if (active)
> __count_vm_event(PGDEACTIVATE);
> +
> + __count_vm_event(PGINVALIDATE);
> update_page_reclaim_stat(zone, page, file, 0);

I have similar complains as above.
If you use PGINVALIDATE, other invalidate pass should update this counter too.


> }
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 3555636..ef6102d 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -818,6 +818,9 @@ static const char * const vmstat_text[] = {
> "pgactivate",
> "pgdeactivate",
>
> + "pginvalidate",
> + "pgreclaim",
> +
> "pgfault",
> "pgmajfault",
>
> --
> 1.7.0.4
>


2010-12-08 08:05:09

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] Reclaim invalidated page ASAP

> invalidate_mapping_pages is very big hint to reclaimer.
> It means user doesn't want to use the page any more.
> So in order to prevent working set page eviction, this patch
> move the page into tail of inactive list by PG_reclaim.
>
> Please, remember that pages in inactive list are working set
> as well as active list. If we don't move pages into inactive list's
> tail, pages near by tail of inactive list can be evicted although
> we have a big clue about useless pages. It's totally bad.
>
> Now PG_readahead/PG_reclaim is shared.
> fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> preventing fast reclaiming readahead marker page.
>
> In this series, PG_reclaim is used by invalidated page, too.
> If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> to reclaim asap. Then, when the dirty page will be writeback,
> clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> It disturbs this serie's goal.
>
> I think it's okay to clear PG_readahead when the page is dirty, not
> writeback time. So this patch moves ClearPageReadahead.
>
> Signed-off-by: Minchan Kim <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>

Until anyone should data, I will not ack this. This patch increase
VM state, but benefit is doubious.



2010-12-08 08:08:43

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] move memcg reclaimable page into tail of inactive list

> Golbal page reclaim moves reclaimalbe pages into inactive list
> to reclaim asap. This patch apply the rule in memcg.
> It can help to prevent unnecessary working page eviction of memcg.
>
> Cc: Balbir Singh <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>

Good catch!
Reviewed-by: KOSAKI Motohiro <[email protected]>


2010-12-08 08:13:37

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] add profile information for invalidated page reclaim

Hi KOSAKI,

On Wed, Dec 8, 2010 at 5:02 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> This patch adds profile information about invalidated page reclaim.
>> It's just for profiling for test so it would be discard when the series
>> are merged.
>>
>> Signed-off-by: Minchan Kim <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>
>> Cc: Wu Fengguang <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> ---
>> ?include/linux/vmstat.h | ? ?4 ++--
>> ?mm/swap.c ? ? ? ? ? ? ?| ? ?3 +++
>> ?mm/vmstat.c ? ? ? ? ? ?| ? ?3 +++
>> ?3 files changed, 8 insertions(+), 2 deletions(-)
>
> Today, we have tracepoint. tracepoint has no overhead if it's unused.
> but vmstat has a overhead even if unused.
>
> Then, all new vmstat proposal should be described why you think it is
> frequently used from administrators.

It's just for easy gathering the data when Ben will test.
I never want to merge it in upstream and even mmtom.

If you don't like it for just testing, I am happy to change it with tracepoint.

Thanks.
--
Kind regards,
Minchan Kim

2010-12-08 08:24:22

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] Reclaim invalidated page ASAP

On Wed, Dec 8, 2010 at 5:04 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> invalidate_mapping_pages is very big hint to reclaimer.
>> It means user doesn't want to use the page any more.
>> So in order to prevent working set page eviction, this patch
>> move the page into tail of inactive list by PG_reclaim.
>>
>> Please, remember that pages in inactive list are working set
>> as well as active list. If we don't move pages into inactive list's
>> tail, pages near by tail of inactive list can be evicted although
>> we have a big clue about useless pages. It's totally bad.
>>
>> Now PG_readahead/PG_reclaim is shared.
>> fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
>> preventing fast reclaiming readahead marker page.
>>
>> In this series, PG_reclaim is used by invalidated page, too.
>> If VM find the page is invalidated and it's dirty, it sets PG_reclaim
>> to reclaim asap. Then, when the dirty page will be writeback,
>> clear_page_dirty_for_io will clear PG_reclaim unconditionally.
>> It disturbs this serie's goal.
>>
>> I think it's okay to clear PG_readahead when the page is dirty, not
>> writeback time. So this patch moves ClearPageReadahead.
>>
>> Signed-off-by: Minchan Kim <[email protected]>
>> Acked-by: Rik van Riel <[email protected]>
>> Acked-by: Mel Gorman <[email protected]>
>> Cc: Wu Fengguang <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>
> Until anyone should data, I will not ack this. This patch increase
> VM state, but benefit is doubious.

Make sense to me. If Ben is busy, I will measure it and send the result.
Thanks!

--
Kind regards,
Minchan Kim

2010-12-08 08:36:52

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] add profile information for invalidated page reclaim

> Hi KOSAKI,
>
> On Wed, Dec 8, 2010 at 5:02 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> >> This patch adds profile information about invalidated page reclaim.
> >> It's just for profiling for test so it would be discard when the series
> >> are merged.
> >>
> >> Signed-off-by: Minchan Kim <[email protected]>
> >> Cc: Rik van Riel <[email protected]>
> >> Cc: KOSAKI Motohiro <[email protected]>
> >> Cc: Wu Fengguang <[email protected]>
> >> Cc: Johannes Weiner <[email protected]>
> >> Cc: Nick Piggin <[email protected]>
> >> Cc: Mel Gorman <[email protected]>
> >> ---
> >> ?include/linux/vmstat.h | ? ?4 ++--
> >> ?mm/swap.c ? ? ? ? ? ? ?| ? ?3 +++
> >> ?mm/vmstat.c ? ? ? ? ? ?| ? ?3 +++
> >> ?3 files changed, 8 insertions(+), 2 deletions(-)
> >
> > Today, we have tracepoint. tracepoint has no overhead if it's unused.
> > but vmstat has a overhead even if unused.
> >
> > Then, all new vmstat proposal should be described why you think it is
> > frequently used from administrators.
>
> It's just for easy gathering the data when Ben will test.
> I never want to merge it in upstream and even mmtom.

Ok, I had not understand your intention. Thank you.



> If you don't like it for just testing, I am happy to change it with tracepoint.




2010-12-08 13:01:29

by Ben Gamari

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] Reclaim invalidated page ASAP

> Make sense to me. If Ben is busy, I will measure it and send the result.

I've done measurements on the patched kernel. All that remains is to do
measurements on the baseline unpached case. To summarize the results
thusfar,

Times:
=======
user sys %cpu inputs outputs
Patched, drop 142 64 46 13557744 14052744
Patched, nodrop 55 57 33 13557936 13556680

vmstat:
========
free_pages inact_anon act_anon inact_file act_file dirtied written reclaim
Patched, drop, pre 306043 37541 185463 276266 153955 3689674 3604959 1550641
Patched, drop, post 13233 38462 175252 536346 178792 5527564 5371563 3169155

Patched, nodrop, pre 475211 38602 175242 81979 178820 5527592 5371554 3169155
Patched, nodrop, post 7697 38959 176986 547984 180855 7324836 7132158 3169155

Altogether, it seems that something is horribly wrong, most likely with
my test (or rsync patch). I'll do the baseline benchmarks today.

Thoughts?

Thanks,

- Ben

2010-12-08 23:10:20

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] Reclaim invalidated page ASAP

On Wed, Dec 8, 2010 at 10:01 PM, Ben Gamari <[email protected]> wrote:
>> Make sense to me. If Ben is busy, I will measure it and send the result.
>
> I've done measurements on the patched kernel. All that remains is to do
> measurements on the baseline unpached case. To summarize the results
> thusfar,
>
> Times:
> =======
> ? ? ? ? ? ? ? ? ? ? ? user ? ?sys ? ? %cpu ? ?inputs ? ? ? ? ? outputs
> Patched, drop ? ? ? ? ?142 ? ? 64 ? ? ?46 ? ? ?13557744 ? ? ? ? 14052744
> Patched, nodrop ? ? ? ?55 ? ? ?57 ? ? ?33 ? ? ?13557936 ? ? ? ? 13556680
>
> vmstat:
> ========
> ? ? ? ? ? ? ? ? ? ? ? ?free_pages ? ? ?inact_anon ? ? ?act_anon ? ? ? ?inact_file ? ? ?act_file ? ? ? ?dirtied ? ? ?written ?reclaim
> Patched, drop, pre ? ? ?306043 ? ? ? ? ?37541 ? ? ? ? ? 185463 ? ? ? ? ?276266 ? ? ? ? ?153955 ? ? ? ? ?3689674 ? ? ?3604959 ?1550641
> Patched, drop, post ? ? 13233 ? ? ? ? ? 38462 ? ? ? ? ? 175252 ? ? ? ? ?536346 ? ? ? ? ?178792 ? ? ? ? ?5527564 ? ? ?5371563 ?3169155
>
> Patched, nodrop, pre ? ?475211 ? ? ? ? ?38602 ? ? ? ? ? 175242 ? ? ? ? ?81979 ? ? ? ? ? 178820 ? ? ? ? ?5527592 ? ? ?5371554 ?3169155
> Patched, nodrop, post ? 7697 ? ? ? ? ? ?38959 ? ? ? ? ? 176986 ? ? ? ? ?547984 ? ? ? ? ?180855 ? ? ? ? ?7324836 ? ? ?7132158 ?3169155
>
> Altogether, it seems that something is horribly wrong, most likely with
> my test (or rsync patch). I'll do the baseline benchmarks today.
>
> Thoughts?


How do you test it?
I think patch's effect would be good in big memory pressure environment.

Quickly I did it on my desktop environment.(2G DRAM)
So it's not completed result. I will test more when out of office.

Used kernel : mmotm-12-02 + my patch series
Used rsync :
1. rsync_normal : v3.0.7 vanilla
2. rsync_patch : v3.0.7 + Ben's patch(fadvise)

Test scenario :
* kernel full compile
* git clone linux-kernel
* rsync local host directory to local host dst directory


1) rsync_normal : 89.08user 127.48system 33:22.24elapsed
2) rsync_patch : 88.42user 135.26system 31:30.56elapsed

1) rsync_normal vmstat :
pgfault : 45538203
pgmajfault : 4181

pgactivate 377416
pgdeactivate 34183
pginvalidate 0
pgreclaim 0

pgsteal_dma 0
pgsteal_normal 2144469
pgsteal_high 2884412
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_normal 2149739
pgscan_kswapd_high 2909140
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_normal 647
pgscan_direct_high 716
pgscan_direct_movable 0
pginodesteal 0
slabs_scanned 1737344
kswapd_steal 5028353
kswapd_inodesteal 438910
pageoutrun 81208
allocstall 9
pgrotated 1642

2) rsync_patch vmstat:

pgfault : 47570231
pgmajfault : 2669

pgactivate 391806
pgdeactivate 36861
pginvalidate 1685065
pgreclaim 1685065

pgrefill_dma 0
pgrefill_normal 32025
pgrefill_high 9619
pgrefill_movable 0
pgsteal_dma 0
pgsteal_normal 744904
pgsteal_high 1079709
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_normal 745017
pgscan_kswapd_high 1096660
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_normal 0
pgscan_direct_high 0
pgscan_direct_movable 0
pginodesteal 0
slabs_scanned 1896960
kswapd_steal 1824613
kswapd_inodesteal 703499
pageoutrun 26828
allocstall 0
pgrotated 1681570

In summary,
Unfortunately, the number of fault is increased (47570231 - 45538203)
but pgmajfault is reduced (4181 - 2669).

The number of scanning is much reduced. 2149739 -> 745017, 2909140 ->
1096660 and even no direct reclaim in patched rsync.

The number of steal is much reduced. 2144469 -> 744904, 2884412 ->
1079709, 5028353 -> 1824613.

The elapsed time is reduced 2 minutes.

I think result is good. Reduced the steal number could imply prevent
eviction of working set pages.

It has a good result with small effort(small scanning).

I will resend with more exact measurement after repeated test.

> Thanks,
>
> - Ben
>
>



--
Kind regards,
Minchan Kim

2010-12-09 00:19:48

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] deactivate invalidated pages

On Wed, Dec 8, 2010 at 3:56 PM, Balbir Singh <[email protected]> wrote:
> * MinChan Kim <[email protected]> [2010-12-08 11:15:19]:
>
>> On Wed, Dec 8, 2010 at 10:56 AM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>> > On Wed, 8 Dec 2010 10:43:08 +0900
>> > Minchan Kim <[email protected]> wrote:
>> >
>> >> Hi Kame,
>> >>
>> > Hi,
>> >
>> >> > I wonder ...how about adding "victim" list for "Reclaim" pages ? Then, we don't need
>> >> > extra LRU rotation.
>> >>
>> >> It can make the code clean.
>> >> As far as I think, victim list does following as.
>> >>
>> >> 1. select victim pages by strong hint
>> >> 2. move the page from LRU to victim
>> >> 3. reclaimer always peeks victim list before diving into LRU list.
>> >> 4-1. If the victim pages is used by others or dirty, it can be moved
>> >> into LRU, again or remain the page in victim list.
>> >> If the page is remained victim, when do we move it into LRU again if
>> >> the reclaimer continues to fail the page?
>> > When sometone touches it.
>> >
>> >> We have to put the new rule.
>> >> 4-2. If the victim pages isn't used by others and clean, we can
>> >> reclaim the page asap.
>> >>
>> >> AFAIK, strong hints are just two(invalidation, readahead max window heuristic).
>> >> I am not sure it's valuable to add new hierarchy(ie, LRU, victim,
>> >> unevictable) for cleaning the minor codes.
>> >> In addition, we have to put the new rule so it would make the LRU code
>> >> complicated.
>> >> I remember how unevictable feature merge is hard.
>> >>
>> > yes, it was hard.
>> >
>> >> But I am not against if we have more usecases. In this case, it's
>> >> valuable to implement it although it's not easy.
>> >>
>> >
>> > I wonder "victim list" can be used for something like Cleancache, when
>> > we have very-low-latency backend devices.
>> > And we may able to have page-cache-limit, which Balbir proposed as.
>>
>> Yes, I thought that, too. I think it would be a good feature in embedded system.
>>
>> >
>> > ?- kvictimed? will move unmappedd page caches to victim list
>> > This may work like a InactiveClean list which we had before and make
>> > sizing easy.
>> >
>>
>> Before further discuss, we need customer's confirm.
>> We know very well it is very hard to merge if anyone doesn't use.
>>
>> Balbir, What do think about it?
>>
>
> The idea seems interesting, I am in the process of refreshing my
> patches for unmapped page cache control. I presume the process of
> filling the victim list will be similar to what I have or unmapped
> page cache isolation.

I saw your previous implementation. It doesn't have any benefit from
victim list.
It needs scanning pfns, select unmapped page and move it into victim list.
I think we might need kvictimd as Kame said but I am not convinced.
If I have a trouble with implementing my series, I might think it. But
until now, I think it's not bad and rough test result isn't bad.

To be honest, I think victim list(or cleanlist) is to be another
project. If it is completed, maybe we can make my patches simple.
I approve page cache limit control POV but it should be another project.
So I want to merge this series then if we need really victim list,
let's consider at that time.

Anyway, I will see your next version to find needs of victim list.

>
>>
>> > Thanks,
>> > -Kame
>> >
>> >
>> >
>> >
>>
>>
>>
>> --
>> Kind regards,
>> Minchan Kim
>
> --
> ? ? ? ?Three Cheers,
> ? ? ? ?Balbir
>



--
Kind regards,
Minchan Kim

2010-12-13 15:31:18

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] Reclaim invalidated page ASAP

On Thu, Dec 09, 2010 at 08:10:17AM +0900, Minchan Kim wrote:
> On Wed, Dec 8, 2010 at 10:01 PM, Ben Gamari <[email protected]> wrote:
> >> Make sense to me. If Ben is busy, I will measure it and send the result.
> >
> > I've done measurements on the patched kernel. All that remains is to do
> > measurements on the baseline unpached case. To summarize the results
> > thusfar,
> >
> > Times:
> > =======
> > ? ? ? ? ? ? ? ? ? ? ? user ? ?sys ? ? %cpu ? ?inputs ? ? ? ? ? outputs
> > Patched, drop ? ? ? ? ?142 ? ? 64 ? ? ?46 ? ? ?13557744 ? ? ? ? 14052744
> > Patched, nodrop ? ? ? ?55 ? ? ?57 ? ? ?33 ? ? ?13557936 ? ? ? ? 13556680
> >
> > vmstat:
> > ========
> > ? ? ? ? ? ? ? ? ? ? ? ?free_pages ? ? ?inact_anon ? ? ?act_anon ? ? ? ?inact_file ? ? ?act_file ? ? ? ?dirtied ? ? ?written ?reclaim
> > Patched, drop, pre ? ? ?306043 ? ? ? ? ?37541 ? ? ? ? ? 185463 ? ? ? ? ?276266 ? ? ? ? ?153955 ? ? ? ? ?3689674 ? ? ?3604959 ?1550641
> > Patched, drop, post ? ? 13233 ? ? ? ? ? 38462 ? ? ? ? ? 175252 ? ? ? ? ?536346 ? ? ? ? ?178792 ? ? ? ? ?5527564 ? ? ?5371563 ?3169155
> >
> > Patched, nodrop, pre ? ?475211 ? ? ? ? ?38602 ? ? ? ? ? 175242 ? ? ? ? ?81979 ? ? ? ? ? 178820 ? ? ? ? ?5527592 ? ? ?5371554 ?3169155
> > Patched, nodrop, post ? 7697 ? ? ? ? ? ?38959 ? ? ? ? ? 176986 ? ? ? ? ?547984 ? ? ? ? ?180855 ? ? ? ? ?7324836 ? ? ?7132158 ?3169155
> >
> > Altogether, it seems that something is horribly wrong, most likely with
> > my test (or rsync patch). I'll do the baseline benchmarks today.
> >
> > Thoughts?
>
>
> How do you test it?
> I think patch's effect would be good in big memory pressure environment.
>
> Quickly I did it on my desktop environment.(2G DRAM)
> So it's not completed result. I will test more when out of office.
>
> Used kernel : mmotm-12-02 + my patch series
> Used rsync :
> 1. rsync_normal : v3.0.7 vanilla
> 2. rsync_patch : v3.0.7 + Ben's patch(fadvise)
>
> Test scenario :
> * kernel full compile
> * git clone linux-kernel
> * rsync local host directory to local host dst directory
>
>
> 1) rsync_normal : 89.08user 127.48system 33:22.24elapsed
> 2) rsync_patch : 88.42user 135.26system 31:30.56elapsed
>
> 1) rsync_normal vmstat :
> pgfault : 45538203
> pgmajfault : 4181
>
> pgactivate 377416
> pgdeactivate 34183
> pginvalidate 0
> pgreclaim 0
>
> pgsteal_dma 0
> pgsteal_normal 2144469
> pgsteal_high 2884412
> pgsteal_movable 0
> pgscan_kswapd_dma 0
> pgscan_kswapd_normal 2149739
> pgscan_kswapd_high 2909140
> pgscan_kswapd_movable 0
> pgscan_direct_dma 0
> pgscan_direct_normal 647
> pgscan_direct_high 716
> pgscan_direct_movable 0
> pginodesteal 0
> slabs_scanned 1737344
> kswapd_steal 5028353
> kswapd_inodesteal 438910
> pageoutrun 81208
> allocstall 9
> pgrotated 1642
>
> 2) rsync_patch vmstat:
>
> pgfault : 47570231
> pgmajfault : 2669
>
> pgactivate 391806
> pgdeactivate 36861
> pginvalidate 1685065
> pgreclaim 1685065
>
> pgrefill_dma 0
> pgrefill_normal 32025
> pgrefill_high 9619
> pgrefill_movable 0
> pgsteal_dma 0
> pgsteal_normal 744904
> pgsteal_high 1079709
> pgsteal_movable 0
> pgscan_kswapd_dma 0
> pgscan_kswapd_normal 745017
> pgscan_kswapd_high 1096660
> pgscan_kswapd_movable 0
> pgscan_direct_dma 0
> pgscan_direct_normal 0
> pgscan_direct_high 0
> pgscan_direct_movable 0
> pginodesteal 0
> slabs_scanned 1896960
> kswapd_steal 1824613
> kswapd_inodesteal 703499
> pageoutrun 26828
> allocstall 0
> pgrotated 1681570
>
> In summary,
> Unfortunately, the number of fault is increased (47570231 - 45538203)
> but pgmajfault is reduced (4181 - 2669).
>
> The number of scanning is much reduced. 2149739 -> 745017, 2909140 ->
> 1096660 and even no direct reclaim in patched rsync.
>
> The number of steal is much reduced. 2144469 -> 744904, 2884412 ->
> 1079709, 5028353 -> 1824613.
>
> The elapsed time is reduced 2 minutes.
>
> I think result is good. Reduced the steal number could imply prevent
> eviction of working set pages.
>
> It has a good result with small effort(small scanning).
>
> I will resend with more exact measurement after repeated test.
>
> > Thanks,
> >
> > - Ben
> >
> >

Test Environment :
DRAM : 2G, CPU : Intel(R) Core(TM)2 CPU
Rsync backup directory size : 16G

rsync version is 3.0.7.
rsync patch is Ben's fadivse.
stress scenario do following jobs with parallel.

1. make all -j4 linux, git clone linux-kernel
2. git clone linux-kernel
3. rsync src dst

nrns : no-patched rsync + no stress
prns : patched rsync + no stress
nrs : no-patched rsync + stress
prs : patched rsync + stress

pginvalidate : the number of dirty/writeback pages which is invalidated by fadvise
pgreclaim : pages moved PG_reclaim trick in inactive's tail

In summary, my patch enhances a littie bit about elapsed time in
memory pressure environment and enhance reclaim effectivness(reclaim/reclaim)
with x2. It means reclaim latency is short and doesn't evict working set
pages due to invalidated pages.

Look at reclaim effectivness. Patched rsync enhances x2 about reclaim
effectiveness and compared to mmotm-12-03, mmotm-12-03-fadvise enhances
3 minute about elapsed time in stress environment.
I think it's due to reduce scanning, reclaim overhead.

In no-stress enviroment, fadivse makes program little bit slow.
I think because there are many pgfault. I don't know why it happens.
Could you guess why it happens?

Before futher work, I hope listen opinions.
Any comment is welcome.

Thanks.

== CUT_HERE ==

mmotm-12-03-fadvise

nrns prns nrs prs

27:29.49 28:29.31 41:19.64 40:31.80
pginvalidate 0 pginvalidate 1941654 pginvalidate 0 pginvalidate 1773948
pgreclaim 0 pgreclaim 1941654 pgreclaim 0 pgreclaim 1773947
pgfault 254691 pgfault 462927 pgfault 61865725 pgfault 61015497
pgmajfault 206 pgmajfault 234 pgmajfault 3552 pgmajfault 2240
pgsteal_normal 3047828 pgsteal_normal 1581276 pgsteal_normal 3142791 pgsteal_normal 1845419
pgsteal_high 4757751 pgsteal_high 2419625 pgsteal_high 5351655 pgsteal_high 2731594
pgscan_kswapd_dma 0 pgscan_kswapd_dma 0 pgscan_kswapd_dma 0 pgscan_kswapd_dma 0
pgscan_kswapd_normal 3047960 pgscan_kswapd_normal 1581780 pgscan_kswapd_normal 3146659 pgscan_kswapd_normal 1846071
pgscan_kswapd_high 4758492 pgscan_kswapd_high 2428427 pgscan_kswapd_high 5359960 pgscan_kswapd_high 2732502
pgscan_kswapd_movable 0 pgscan_kswapd_movable 0 pgscan_kswapd_movable 0 pgscan_kswapd_movable 0
pgscan_direct_normal 0 pgscan_direct_normal 0 pgscan_direct_normal 0 pgscan_direct_normal 0
pgscan_direct_high 0 pgscan_direct_high 0 pgscan_direct_high 0 pgscan_direct_high 0
slabs_scanned 1408512 slabs_scanned 1672704 slabs_scanned 1839360 slabs_scanned 2049792
kswapd_steal 7805579 kswapd_steal 4000901 kswapd_steal 8494446 kswapd_steal 4577013
kswapd_inodesteal 146398 kswapd_inodesteal 502321 kswapd_inodesteal 280462 kswapd_inodesteal 493576
pageoutrun 96384 pageoutrun 49907 pageoutrun 139530 pageoutrun 62029
allocstall 0 allocstall 0 allocstall 0 allocstall 0
pgrotated 0 pgrotated 1935875 pgrotated 2511 pgrotated 1768188

mmotm-12-03
nrns prns nrs prs

27:38.84 29:50.22 41:12.10 43:46.63
pgfault 256793 pgfault 415539 pgfault 61205117 pgfault 66495840
pgmajfault 216 pgmajfault 276 pgmajfault 3577 pgmajfault 3383
pgsteal_normal 2963119 pgsteal_normal 1743418 pgsteal_normal 3247443 pgsteal_normal 1904633
pgsteal_high 4832522 pgsteal_high 2718769 pgsteal_high 5202506 pgsteal_high 3124215
pgscan_kswapd_normal 2963164 pgscan_kswapd_normal 1743717 pgscan_kswapd_normal 3252646 pgscan_kswapd_normal 1908354
pgscan_kswapd_high 4832815 pgscan_kswapd_high 2727233 pgscan_kswapd_high 5213874 pgscan_kswapd_high 3132365
pgscan_direct_normal 0 pgscan_direct_normal 0 pgscan_direct_normal 70 pgscan_direct_normal 0
pgscan_direct_high 0 pgscan_direct_high 0 pgscan_direct_high 98 pgscan_direct_high 0
slabs_scanned 1409408 slabs_scanned 1360512 slabs_scanned 1837440 slabs_scanned 1897856
kswapd_steal 7795641 kswapd_steal 4462187 kswapd_steal 8449781 kswapd_steal 5028848
kswapd_inodesteal 142039 kswapd_inodesteal 178 kswapd_inodesteal 311758 kswapd_inodesteal 255974
pageoutrun 97538 pageoutrun 55949 pageoutrun 130136 pageoutrun 84663
allocstall 0 allocstall 0 allocstall 2 allocstall 0
pgrotated 0 pgrotated 0 pgrotated 0 pgrotated 0

>
> --
> Kind regards,
> Minchan Kim

--
Kind regards,
Minchan Kim

2010-12-13 20:06:48

by Ben Gamari

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] Reclaim invalidated page ASAP

On Tue, 14 Dec 2010 00:31:05 +0900, Minchan Kim <[email protected]> wrote:
> In summary, my patch enhances a littie bit about elapsed time in
> memory pressure environment and enhance reclaim effectivness(reclaim/reclaim)
> with x2. It means reclaim latency is short and doesn't evict working set
> pages due to invalidated pages.
>
Thank you very much for this testing! I'm very sorry I've been unable to
contribute more recently. My last exam is on Wednesday and besides some
grading that is the end of the semester. Is there anything you would
like me to do? Perhaps reproducing these results on my setup would be
useful?

> Look at reclaim effectivness. Patched rsync enhances x2 about reclaim
> effectiveness and compared to mmotm-12-03, mmotm-12-03-fadvise enhances
> 3 minute about elapsed time in stress environment.
> I think it's due to reduce scanning, reclaim overhead.
>
Good good. This looks quite promising.

> In no-stress enviroment, fadivse makes program little bit slow.
> I think because there are many pgfault. I don't know why it happens.
> Could you guess why it happens?
>
Hmm, nothing comes to mind. As I've said in the past, rsync should
require each page only once. Perhaps perf might offer some insight into
where this time is being spent?

- Ben

2010-12-14 02:13:05

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] Reclaim invalidated page ASAP

On Tue, 14 Dec 2010 00:31:05 +0900
Minchan Kim <[email protected]> wrote:

> Test Environment :
> DRAM : 2G, CPU : Intel(R) Core(TM)2 CPU
> Rsync backup directory size : 16G
>
> rsync version is 3.0.7.
> rsync patch is Ben's fadivse.
> stress scenario do following jobs with parallel.
>
> 1. make all -j4 linux, git clone linux-kernel
> 2. git clone linux-kernel
> 3. rsync src dst
>
> nrns : no-patched rsync + no stress
> prns : patched rsync + no stress
> nrs : no-patched rsync + stress
> prs : patched rsync + stress
>
> pginvalidate : the number of dirty/writeback pages which is invalidated by fadvise
> pgreclaim : pages moved PG_reclaim trick in inactive's tail
>
> In summary, my patch enhances a littie bit about elapsed time in
> memory pressure environment and enhance reclaim effectivness(reclaim/reclaim)
> with x2. It means reclaim latency is short and doesn't evict working set
> pages due to invalidated pages.
>
> Look at reclaim effectivness. Patched rsync enhances x2 about reclaim
> effectiveness and compared to mmotm-12-03, mmotm-12-03-fadvise enhances
> 3 minute about elapsed time in stress environment.
> I think it's due to reduce scanning, reclaim overhead.
>
> In no-stress enviroment, fadivse makes program little bit slow.
> I think because there are many pgfault. I don't know why it happens.
> Could you guess why it happens?
>
> Before futher work, I hope listen opinions.
> Any comment is welcome.
>

At first, the improvement seems great. Thank you for your effort.

> In no-stress enviroment, fadivse makes program little bit slow.
> I think because there are many pgfault. I don't know why it happens.
> Could you guess why it happens?
>

Are there no program which accesses a directory rsync'ed ?

Thanks,
-Kame

2010-12-14 02:34:13

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] Reclaim invalidated page ASAP

On Tue, Dec 14, 2010 at 11:07 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 14 Dec 2010 00:31:05 +0900
> Minchan Kim <[email protected]> wrote:
>
>> Test Environment :
>> DRAM : 2G, CPU : Intel(R) Core(TM)2 CPU
>> Rsync backup directory size : 16G
>>
>> rsync version is 3.0.7.
>> rsync patch is Ben's fadivse.
>> stress scenario do following jobs with parallel.
>>
>> 1. make all -j4 linux, git clone linux-kernel
>> 2. git clone linux-kernel
>> 3. rsync src dst
>>
>> nrns : no-patched rsync + no stress
>> prns : patched rsync + no stress
>> nrs ?: no-patched rsync + stress
>> prs ?: patched rsync + stress
>>
>> pginvalidate : the number of dirty/writeback pages which is invalidated by fadvise
>> pgreclaim : pages moved PG_reclaim trick in inactive's tail
>>
>> In summary, my patch enhances a littie bit about elapsed time in
>> memory pressure environment and enhance reclaim effectivness(reclaim/reclaim)
>> with x2. It means reclaim latency is short and doesn't evict working set
>> pages due to invalidated pages.
>>
>> Look at reclaim effectivness. Patched rsync enhances x2 about reclaim
>> effectiveness and compared to mmotm-12-03, mmotm-12-03-fadvise enhances
>> 3 minute about elapsed time in stress environment.
>> I think it's due to reduce scanning, reclaim overhead.
>>
>> In no-stress enviroment, fadivse makes program little bit slow.
>> I think because there are many pgfault. I don't know why it happens.
>> Could you guess why it happens?
>>
>> Before futher work, I hope listen opinions.
>> Any comment is welcome.
>>
>
> At first, the improvement seems great. Thank you for your effort.

Thanks, Kame.

>
>> In no-stress enviroment, fadivse makes program little bit slow.
>> I think because there are many pgfault. I don't know why it happens.
>> Could you guess why it happens?
>>
>
> Are there no program which accesses a directory rsync'ed ?

Maybe. some programs might have mmaped files.
But although it happens, deactivate_page found that and it does
nothing so the page can't be reclaimed.

>
> Thanks,
> -Kame
>
>



--
Kind regards,
Minchan Kim

2010-12-14 02:36:14

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] Reclaim invalidated page ASAP

Hi Ben,

On Tue, Dec 14, 2010 at 5:06 AM, Ben Gamari <[email protected]> wrote:
> On Tue, 14 Dec 2010 00:31:05 +0900, Minchan Kim <[email protected]> wrote:
>> In summary, my patch enhances a littie bit about elapsed time in
>> memory pressure environment and enhance reclaim effectivness(reclaim/reclaim)
>> with x2. It means reclaim latency is short and doesn't evict working set
>> pages due to invalidated pages.
>>
> Thank you very much for this testing! I'm very sorry I've been unable to
> contribute more recently. My last exam is on Wednesday and besides some
> grading that is the end of the semester. ?Is there anything you would

No worry. I hope you have great grade in your exam. :)

> like me to do? Perhaps reproducing these results on my setup would be
> useful?

Thanks very much if you do.

>
>> Look at reclaim effectivness. Patched rsync enhances x2 about reclaim
>> effectiveness and compared to mmotm-12-03, mmotm-12-03-fadvise enhances
>> 3 minute about elapsed time in stress environment.
>> I think it's due to reduce scanning, reclaim overhead.
>>
> Good good. This looks quite promising.

Thanks, Ben.

>
>> In no-stress enviroment, fadivse makes program little bit slow.
>> I think because there are many pgfault. I don't know why it happens.
>> Could you guess why it happens?
>>
> Hmm, nothing comes to mind. As I've said in the past, rsync should
> require each page only once. Perhaps perf might offer some insight into
> where this time is being spent?

Maybe. I will have a plan to look into that.

>
> - Ben
>



--
Kind regards,
Minchan Kim