2010-11-29 15:23:37

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 0/3] f/madivse(DONTNEED) support

Recently there is a report about working set page eviction due to rsync
workload. application programmers want to use fadvise but it's not easy.
You could see detailed description on [1/3].

- [1/3] is to move invalidated page which is dirty/writeback on active list
into inactive list's head.
- [2/3] is for moving invalidated page into inactive list's tail when the
page's writeout is completed.
- [3/3] is to not calling mark_page_accessed in case of madvise(DONTNEED).

Minchan Kim (3):
deactivate invalidated pages
Reclaim invalidated page ASAP
Prevent activation of page in madvise_dontneed

include/linux/mm.h | 4 +-
include/linux/swap.h | 1 +
mm/madvise.c | 4 +-
mm/memory.c | 38 +++++++++++-------
mm/mmap.c | 4 +-
mm/page-writeback.c | 12 +++++-
mm/swap.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++
mm/truncate.c | 16 ++++++--
8 files changed, 155 insertions(+), 26 deletions(-)


2010-11-29 15:23:43

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 1/3] deactivate invalidated pages

Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2)
It happens by backup workloads(ex, nightly rsync).
That's because the workload makes just use-once pages
and touches pages twice. It promotes the page into
active list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE.
But other OSes don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED.
But it has a problem. If kernel meets page is writing
during invalidate_mapping_pages, it can't work.
It is very hard for application programmer to use it.
Because they always have to sync data before calling
fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
be discardable. At last, they can't use deferred write of kernel
so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.
It means we don't use the page any more. So let's move
the writing page into inactive list's head.

Why I need the page to head, Dirty/Writeback page would be flushed
sooner or later. It can prevent writeout of pageout which is less
effective than flusher's writeout.

Originally, I reused lru_demote of Peter with some change so added
his Signed-off-by.

Reported-by: Ben Gamari <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>

Adnrew. Before applying this series, please drop below two patches.
mm-deactivate-invalidated-pages.patch
mm-deactivate-invalidated-pages-fix.patch

Changelog since v2:
- mapped page leaves alone - suggested by Mel
- pass part related PG_reclaim in next patch.

Changelog since v1:
- modify description
- correct typo
- add some comment
---
include/linux/swap.h | 1 +
mm/swap.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++
mm/truncate.c | 16 +++++++--
3 files changed, 93 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index eba53e7..84375e4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -213,6 +213,7 @@ extern void mark_page_accessed(struct page *);
extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
+extern void lru_deactivate_page(struct page *page);
extern void swap_setup(void);

extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 3f48542..19e0812 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -31,6 +31,7 @@
#include <linux/backing-dev.h>
#include <linux/memcontrol.h>
#include <linux/gfp.h>
+#include <linux/rmap.h>

#include "internal.h"

@@ -39,6 +40,8 @@ int page_cluster;

static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
+

/*
* This path almost never happens for VM activity - pages are normally
@@ -267,6 +270,63 @@ void add_page_to_unevictable_list(struct page *page)
}

/*
+ * This function is used by invalidate_mapping_pages.
+ * If the page can't be invalidated, this function moves the page
+ * into inative list's head. Because the VM expects the page would
+ * be writeout by flusher. The flusher's writeout is much effective
+ * than reclaimer's random writeout.
+ */
+static void __lru_deactivate(struct page *page, struct zone *zone)
+{
+ int lru, file;
+ unsigned long vm_flags;
+
+ if (!PageLRU(page) || !PageActive(page))
+ return;
+
+ /* Some processes are using the page */
+ if (page_mapped(page))
+ return;
+
+ file = page_is_file_cache(page);
+ lru = page_lru_base_type(page);
+ del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ add_page_to_lru_list(zone, page, lru);
+ __count_vm_event(PGDEACTIVATE);
+
+ update_page_reclaim_stat(zone, page, file, 0);
+}
+
+/*
+ * This function must be called with preemption disable.
+ */
+static void __pagevec_lru_deactivate(struct pagevec *pvec)
+{
+ int i;
+ struct zone *zone = NULL;
+
+ for (i = 0; i < pagevec_count(pvec); i++) {
+ struct page *page = pvec->pages[i];
+ struct zone *pagezone = page_zone(page);
+
+ if (pagezone != zone) {
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ zone = pagezone;
+ spin_lock_irq(&zone->lru_lock);
+ }
+ __lru_deactivate(page, zone);
+ }
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+
+ release_pages(pvec->pages, pvec->nr, pvec->cold);
+ pagevec_reinit(pvec);
+}
+
+/*
* Drain pages out of the cpu's pagevecs.
* Either "cpu" is the current CPU, and preemption has already been
* disabled; or "cpu" is being hot-unplugged, and is already dead.
@@ -292,6 +352,26 @@ static void drain_cpu_pagevecs(int cpu)
pagevec_move_tail(pvec);
local_irq_restore(flags);
}
+
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ __pagevec_lru_deactivate(pvec);
+}
+
+/*
+ * Forcefully deactivate a page.
+ * This function is used for reclaiming the page ASAP when the page
+ * can't be invalidated by Dirty/Writeback.
+ */
+void lru_deactivate_page(struct page *page)
+{
+ if (likely(get_page_unless_zero(page))) {
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+ if (!pagevec_add(pvec, page))
+ __pagevec_lru_deactivate(pvec);
+ put_cpu_var(lru_deactivate_pvecs);
+ }
}

void lru_add_drain(void)
diff --git a/mm/truncate.c b/mm/truncate.c
index cd94607..09b9748 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -332,7 +332,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
{
struct pagevec pvec;
pgoff_t next = start;
- unsigned long ret = 0;
+ unsigned long ret;
+ unsigned long count = 0;
int i;

pagevec_init(&pvec, 0);
@@ -359,8 +360,15 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
if (lock_failed)
continue;

- ret += invalidate_inode_page(page);
-
+ ret = invalidate_inode_page(page);
+ /*
+ * If the page was dirty or under writeback we cannot
+ * invalidate it now. Move it to the head of the
+ * inactive LRU for using deferred writeback of flusher.
+ */
+ if (!ret)
+ lru_deactivate_page(page);
+ count += ret;
unlock_page(page);
if (next > end)
break;
@@ -369,7 +377,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
mem_cgroup_uncharge_end();
cond_resched();
}
- return ret;
+ return count;
}
EXPORT_SYMBOL(invalidate_mapping_pages);

--
1.7.0.4

2010-11-29 15:23:50

by Minchan Kim

[permalink] [raw]
Subject: [PATCH 2/3] Reclaim invalidated page ASAP

invalidate_mapping_pages is very big hint to reclaimer.
It means user doesn't want to use the page any more.
So in order to prevent working set page eviction, this patch
move the page into tail of inactive list by PG_reclaim.

Please, remember that pages in inactive list are working set
as well as active list. If we don't move pages into inactive list's
tail, pages near by tail of inactive list can be evicted although
we have a big clue about useless pages. It's totally bad.

Now PG_readahead/PG_reclaim is shared.
fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
preventing fast reclaiming readahead marker page.

In this series, PG_reclaim is used by invalidated page, too.
If VM find the page is invalidated and it's dirty, it sets PG_reclaim
to reclaim asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally.
It disturbs this serie's goal.

I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead.

Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>

Changelog since v2:
- put ClearPageReclaim in set_page_dirty - suggested by Wu.

Changelog since v1:
- make the invalidated page reclaim asap - suggested by Andrew.
---
mm/page-writeback.c | 12 +++++++++++-
mm/swap.c | 48 +++++++++++++++++++++++++++++++++++-------------
2 files changed, 46 insertions(+), 14 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fc93802..88587a5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1250,6 +1250,17 @@ int set_page_dirty(struct page *page)
{
struct address_space *mapping = page_mapping(page);

+ /*
+ * readahead/lru_deactivate_page could remain
+ * PG_readahead/PG_reclaim due to race with end_page_writeback
+ * About readahead, if the page is written, the flags would be
+ * reset. So no problem.
+ * About lru_deactivate_page, if the page is redirty, the flag
+ * will be reset. So no problem. but if the page is used by readahead
+ * it will confuse readahead and make it restart the size rampup
+ * process. But it's a trivial problem.
+ */
+ ClearPageReclaim(page);
if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
#ifdef CONFIG_BLOCK
@@ -1307,7 +1318,6 @@ int clear_page_dirty_for_io(struct page *page)

BUG_ON(!PageLocked(page));

- ClearPageReclaim(page);
if (mapping && mapping_cap_account_dirty(mapping)) {
/*
* Yes, Virginia, this is indeed insane.
diff --git a/mm/swap.c b/mm/swap.c
index 19e0812..936b281 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -275,28 +275,50 @@ void add_page_to_unevictable_list(struct page *page)
* into inative list's head. Because the VM expects the page would
* be writeout by flusher. The flusher's writeout is much effective
* than reclaimer's random writeout.
+ *
+ * If the page isn't page_mapped and dirty/writeback, the page
+ * could reclaim asap using PG_reclaim.
+ *
+ * 1. active, mapped page -> none
+ * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
+ * 3. inactive, mapped page -> none
+ * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
+ * 5. Others -> none
+ *
+ * In 4, why it moves inactive's head, the VM expects the page would
+ * be writeout by flusher. The flusher's writeout is much effective than
+ * reclaimer's random writeout.
*/
static void __lru_deactivate(struct page *page, struct zone *zone)
{
int lru, file;
- unsigned long vm_flags;
+ int active = 0;

- if (!PageLRU(page) || !PageActive(page))
+ if (!PageLRU(page))
return;
-
/* Some processes are using the page */
if (page_mapped(page))
return;
-
- file = page_is_file_cache(page);
- lru = page_lru_base_type(page);
- del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
- ClearPageActive(page);
- ClearPageReferenced(page);
- add_page_to_lru_list(zone, page, lru);
- __count_vm_event(PGDEACTIVATE);
-
- update_page_reclaim_stat(zone, page, file, 0);
+ if (PageActive(page))
+ active = 1;
+
+ if (PageWriteback(page) || PageDirty(page)) {
+ /*
+ * PG_reclaim could be raced with end_page_writeback
+ * It can make readahead confusing. But race window
+ * is _really_ small and it's non-critical problem.
+ */
+ SetPageReclaim(page);
+
+ file = page_is_file_cache(page);
+ lru = page_lru_base_type(page);
+ del_page_from_lru_list(zone, page, lru + active);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ add_page_to_lru_list(zone, page, lru);
+ __count_vm_event(PGDEACTIVATE);
+ update_page_reclaim_stat(zone, page, file, 0);
+ }
}

/*
--
1.7.0.4

2010-11-29 15:23:56

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 3/3] Prevent activation of page in madvise_dontneed

Now zap_pte_range alwayas activates pages which are pte_young &&
!VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
it's unnecessary since the page wouldn't use any more.

Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Wu Fengguang <[email protected]>

Changelog since v2:
- remove unnecessary description
Changelog since v1:
- change word from promote to activate
- add activate argument to zap_pte_range and family function

---
include/linux/mm.h | 4 ++--
mm/madvise.c | 4 ++--
mm/memory.c | 38 +++++++++++++++++++++++---------------
mm/mmap.c | 4 ++--
4 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e097df6..6032881 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -779,11 +779,11 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
- unsigned long size, struct zap_details *);
+ unsigned long size, struct zap_details *, bool activate);
unsigned long unmap_vmas(struct mmu_gather **tlb,
struct vm_area_struct *start_vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
- struct zap_details *);
+ struct zap_details *, bool activate);

/**
* mm_walk - callbacks for walk_page_range
diff --git a/mm/madvise.c b/mm/madvise.c
index 319528b..8bc4b2d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -171,9 +171,9 @@ static long madvise_dontneed(struct vm_area_struct * vma,
.nonlinear_vma = vma,
.last_index = ULONG_MAX,
};
- zap_page_range(vma, start, end - start, &details);
+ zap_page_range(vma, start, end - start, &details, false);
} else
- zap_page_range(vma, start, end - start, NULL);
+ zap_page_range(vma, start, end - start, NULL, false);
return 0;
}

diff --git a/mm/memory.c b/mm/memory.c
index 2c989f3..b8d38bd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -891,7 +891,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
- long *zap_work, struct zap_details *details)
+ long *zap_work, struct zap_details *details,
+ bool activate)
{
struct mm_struct *mm = tlb->mm;
pte_t *pte;
@@ -949,7 +950,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (pte_dirty(ptent))
set_page_dirty(page);
if (pte_young(ptent) &&
- likely(!VM_SequentialReadHint(vma)))
+ likely(!VM_SequentialReadHint(vma)) &&
+ activate)
mark_page_accessed(page);
rss[MM_FILEPAGES]--;
}
@@ -989,7 +991,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pud_t *pud,
unsigned long addr, unsigned long end,
- long *zap_work, struct zap_details *details)
+ long *zap_work, struct zap_details *details,
+ bool activate)
{
pmd_t *pmd;
unsigned long next;
@@ -1002,7 +1005,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
continue;
}
next = zap_pte_range(tlb, vma, pmd, addr, next,
- zap_work, details);
+ zap_work, details, activate);
} while (pmd++, addr = next, (addr != end && *zap_work > 0));

return addr;
@@ -1011,7 +1014,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pgd_t *pgd,
unsigned long addr, unsigned long end,
- long *zap_work, struct zap_details *details)
+ long *zap_work, struct zap_details *details,
+ bool activate)
{
pud_t *pud;
unsigned long next;
@@ -1024,7 +1028,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
continue;
}
next = zap_pmd_range(tlb, vma, pud, addr, next,
- zap_work, details);
+ zap_work, details, activate);
} while (pud++, addr = next, (addr != end && *zap_work > 0));

return addr;
@@ -1033,7 +1037,8 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
static unsigned long unmap_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
- long *zap_work, struct zap_details *details)
+ long *zap_work, struct zap_details *details,
+ bool activate)
{
pgd_t *pgd;
unsigned long next;
@@ -1052,7 +1057,7 @@ static unsigned long unmap_page_range(struct mmu_gather *tlb,
continue;
}
next = zap_pud_range(tlb, vma, pgd, addr, next,
- zap_work, details);
+ zap_work, details, activate);
} while (pgd++, addr = next, (addr != end && *zap_work > 0));
tlb_end_vma(tlb, vma);
mem_cgroup_uncharge_end();
@@ -1075,6 +1080,7 @@ static unsigned long unmap_page_range(struct mmu_gather *tlb,
* @end_addr: virtual address at which to end unmapping
* @nr_accounted: Place number of unmapped pages in vm-accountable vma's here
* @details: details of nonlinear truncation or shared cache invalidation
+ * @activate: whether pages included in the vma should be activated or not
*
* Returns the end address of the unmapping (restart addr if interrupted).
*
@@ -1096,7 +1102,7 @@ static unsigned long unmap_page_range(struct mmu_gather *tlb,
unsigned long unmap_vmas(struct mmu_gather **tlbp,
struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
- struct zap_details *details)
+ struct zap_details *details, bool activate)
{
long zap_work = ZAP_BLOCK_SIZE;
unsigned long tlb_start = 0; /* For tlb_finish_mmu */
@@ -1149,8 +1155,8 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,

start = end;
} else
- start = unmap_page_range(*tlbp, vma,
- start, end, &zap_work, details);
+ start = unmap_page_range(*tlbp, vma, start,
+ end, &zap_work, details, activate);

if (zap_work > 0) {
BUG_ON(start != end);
@@ -1184,9 +1190,10 @@ out:
* @address: starting address of pages to zap
* @size: number of bytes to zap
* @details: details of nonlinear truncation or shared cache invalidation
+ * @activate: whether pages included in the vma should be activated or not
*/
unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
- unsigned long size, struct zap_details *details)
+ unsigned long size, struct zap_details *details, bool activate)
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather *tlb;
@@ -1196,7 +1203,8 @@ unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
- end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
+ end = unmap_vmas(&tlb, vma, address, end, &nr_accounted,
+ details, activate);
if (tlb)
tlb_finish_mmu(tlb, address, end);
return end;
@@ -1220,7 +1228,7 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
if (address < vma->vm_start || address + size > vma->vm_end ||
!(vma->vm_flags & VM_PFNMAP))
return -1;
- zap_page_range(vma, address, size, NULL);
+ zap_page_range(vma, address, size, NULL, false);
return 0;
}
EXPORT_SYMBOL_GPL(zap_vma_ptes);
@@ -2481,7 +2489,7 @@ again:
}

restart_addr = zap_page_range(vma, start_addr,
- end_addr - start_addr, details);
+ end_addr - start_addr, details, true);
need_break = need_resched() || spin_needbreak(details->i_mmap_lock);

if (restart_addr >= end_addr) {
diff --git a/mm/mmap.c b/mm/mmap.c
index b179abb..0ed5ab3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1904,7 +1904,7 @@ static void unmap_region(struct mm_struct *mm,
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
- unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
+ unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL, true);
vm_unacct_memory(nr_accounted);
free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
@@ -2278,7 +2278,7 @@ void exit_mmap(struct mm_struct *mm)
tlb = tlb_gather_mmu(mm, 1);
/* update_hiwater_rss(mm) here? but nobody should be looking */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
- end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL, true);
vm_unacct_memory(nr_accounted);

free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
--
1.7.0.4

2010-11-29 16:57:24

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] Reclaim invalidated page ASAP

On Tue, Nov 30, 2010 at 12:23:20AM +0900, Minchan Kim wrote:
> invalidate_mapping_pages is very big hint to reclaimer.
> It means user doesn't want to use the page any more.
> So in order to prevent working set page eviction, this patch
> move the page into tail of inactive list by PG_reclaim.
>
> Please, remember that pages in inactive list are working set
> as well as active list. If we don't move pages into inactive list's
> tail, pages near by tail of inactive list can be evicted although
> we have a big clue about useless pages. It's totally bad.
>
> Now PG_readahead/PG_reclaim is shared.
> fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> preventing fast reclaiming readahead marker page.
>
> In this series, PG_reclaim is used by invalidated page, too.
> If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> to reclaim asap. Then, when the dirty page will be writeback,
> clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> It disturbs this serie's goal.
>
> I think it's okay to clear PG_readahead when the page is dirty, not
> writeback time. So this patch moves ClearPageReadahead.
>
> Signed-off-by: Minchan Kim <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
>
> Changelog since v2:
> - put ClearPageReclaim in set_page_dirty - suggested by Wu.
>
> Changelog since v1:
> - make the invalidated page reclaim asap - suggested by Andrew.
> ---
> mm/page-writeback.c | 12 +++++++++++-
> mm/swap.c | 48 +++++++++++++++++++++++++++++++++++-------------
> 2 files changed, 46 insertions(+), 14 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index fc93802..88587a5 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1250,6 +1250,17 @@ int set_page_dirty(struct page *page)
> {
> struct address_space *mapping = page_mapping(page);
>
> + /*
> + * readahead/lru_deactivate_page could remain
> + * PG_readahead/PG_reclaim due to race with end_page_writeback
> + * About readahead, if the page is written, the flags would be
> + * reset. So no problem.
> + * About lru_deactivate_page, if the page is redirty, the flag
> + * will be reset. So no problem. but if the page is used by readahead
> + * it will confuse readahead and make it restart the size rampup
> + * process. But it's a trivial problem.
> + */
> + ClearPageReclaim(page);
> if (likely(mapping)) {
> int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
> #ifdef CONFIG_BLOCK
> @@ -1307,7 +1318,6 @@ int clear_page_dirty_for_io(struct page *page)
>
> BUG_ON(!PageLocked(page));
>
> - ClearPageReclaim(page);
> if (mapping && mapping_cap_account_dirty(mapping)) {
> /*
> * Yes, Virginia, this is indeed insane.
> diff --git a/mm/swap.c b/mm/swap.c
> index 19e0812..936b281 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -275,28 +275,50 @@ void add_page_to_unevictable_list(struct page *page)
> * into inative list's head. Because the VM expects the page would
> * be writeout by flusher. The flusher's writeout is much effective
> * than reclaimer's random writeout.
> + *
> + * If the page isn't page_mapped and dirty/writeback, the page
> + * could reclaim asap using PG_reclaim.
> + *
> + * 1. active, mapped page -> none
> + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> + * 3. inactive, mapped page -> none
> + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
> + * 5. Others -> none
> + *
> + * In 4, why it moves inactive's head, the VM expects the page would
> + * be writeout by flusher. The flusher's writeout is much effective than
> + * reclaimer's random writeout.
> */
> static void __lru_deactivate(struct page *page, struct zone *zone)
> {
> int lru, file;
> - unsigned long vm_flags;
> + int active = 0;
>
> - if (!PageLRU(page) || !PageActive(page))
> + if (!PageLRU(page))
> return;
> -
> /* Some processes are using the page */
> if (page_mapped(page))
> return;
> -
> - file = page_is_file_cache(page);
> - lru = page_lru_base_type(page);
> - del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
> - ClearPageActive(page);
> - ClearPageReferenced(page);
> - add_page_to_lru_list(zone, page, lru);
> - __count_vm_event(PGDEACTIVATE);
> -
> - update_page_reclaim_stat(zone, page, file, 0);
> + if (PageActive(page))
> + active = 1;
> +
> + if (PageWriteback(page) || PageDirty(page)) {
> + /*
> + * PG_reclaim could be raced with end_page_writeback
> + * It can make readahead confusing. But race window
> + * is _really_ small and it's non-critical problem.
> + */
> + SetPageReclaim(page);
> +
> + file = page_is_file_cache(page);
> + lru = page_lru_base_type(page);
> + del_page_from_lru_list(zone, page, lru + active);
> + ClearPageActive(page);
> + ClearPageReferenced(page);
> + add_page_to_lru_list(zone, page, lru);
> + __count_vm_event(PGDEACTIVATE);

You update PGDEACTIVATE whether the page was active or not.

> + update_page_reclaim_stat(zone, page, file, 0);
> + }
> }
>
> /*
> --
> 1.7.0.4
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-29 22:42:07

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/3] Reclaim invalidated page ASAP

On Mon, Nov 29, 2010 at 04:57:06PM +0000, Mel Gorman wrote:
> On Tue, Nov 30, 2010 at 12:23:20AM +0900, Minchan Kim wrote:
> > invalidate_mapping_pages is very big hint to reclaimer.
> > It means user doesn't want to use the page any more.
> > So in order to prevent working set page eviction, this patch
> > move the page into tail of inactive list by PG_reclaim.
> >
> > Please, remember that pages in inactive list are working set
> > as well as active list. If we don't move pages into inactive list's
> > tail, pages near by tail of inactive list can be evicted although
> > we have a big clue about useless pages. It's totally bad.
> >
> > Now PG_readahead/PG_reclaim is shared.
> > fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> > preventing fast reclaiming readahead marker page.
> >
> > In this series, PG_reclaim is used by invalidated page, too.
> > If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> > to reclaim asap. Then, when the dirty page will be writeback,
> > clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> > It disturbs this serie's goal.
> >
> > I think it's okay to clear PG_readahead when the page is dirty, not
> > writeback time. So this patch moves ClearPageReadahead.
> >
> > Signed-off-by: Minchan Kim <[email protected]>
> > Acked-by: Rik van Riel <[email protected]>
> > Cc: Wu Fengguang <[email protected]>
> > Cc: KOSAKI Motohiro <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Nick Piggin <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> >
> > Changelog since v2:
> > - put ClearPageReclaim in set_page_dirty - suggested by Wu.
> >
> > Changelog since v1:
> > - make the invalidated page reclaim asap - suggested by Andrew.
> > ---
> > mm/page-writeback.c | 12 +++++++++++-
> > mm/swap.c | 48 +++++++++++++++++++++++++++++++++++-------------
> > 2 files changed, 46 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index fc93802..88587a5 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -1250,6 +1250,17 @@ int set_page_dirty(struct page *page)
> > {
> > struct address_space *mapping = page_mapping(page);
> >
> > + /*
> > + * readahead/lru_deactivate_page could remain
> > + * PG_readahead/PG_reclaim due to race with end_page_writeback
> > + * About readahead, if the page is written, the flags would be
> > + * reset. So no problem.
> > + * About lru_deactivate_page, if the page is redirty, the flag
> > + * will be reset. So no problem. but if the page is used by readahead
> > + * it will confuse readahead and make it restart the size rampup
> > + * process. But it's a trivial problem.
> > + */
> > + ClearPageReclaim(page);
> > if (likely(mapping)) {
> > int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
> > #ifdef CONFIG_BLOCK
> > @@ -1307,7 +1318,6 @@ int clear_page_dirty_for_io(struct page *page)
> >
> > BUG_ON(!PageLocked(page));
> >
> > - ClearPageReclaim(page);
> > if (mapping && mapping_cap_account_dirty(mapping)) {
> > /*
> > * Yes, Virginia, this is indeed insane.
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 19e0812..936b281 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -275,28 +275,50 @@ void add_page_to_unevictable_list(struct page *page)
> > * into inative list's head. Because the VM expects the page would
> > * be writeout by flusher. The flusher's writeout is much effective
> > * than reclaimer's random writeout.
> > + *
> > + * If the page isn't page_mapped and dirty/writeback, the page
> > + * could reclaim asap using PG_reclaim.
> > + *
> > + * 1. active, mapped page -> none
> > + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> > + * 3. inactive, mapped page -> none
> > + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
> > + * 5. Others -> none
> > + *
> > + * In 4, why it moves inactive's head, the VM expects the page would
> > + * be writeout by flusher. The flusher's writeout is much effective than
> > + * reclaimer's random writeout.
> > */
> > static void __lru_deactivate(struct page *page, struct zone *zone)
> > {
> > int lru, file;
> > - unsigned long vm_flags;
> > + int active = 0;
> >
> > - if (!PageLRU(page) || !PageActive(page))
> > + if (!PageLRU(page))
> > return;
> > -
> > /* Some processes are using the page */
> > if (page_mapped(page))
> > return;
> > -
> > - file = page_is_file_cache(page);
> > - lru = page_lru_base_type(page);
> > - del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
> > - ClearPageActive(page);
> > - ClearPageReferenced(page);
> > - add_page_to_lru_list(zone, page, lru);
> > - __count_vm_event(PGDEACTIVATE);
> > -
> > - update_page_reclaim_stat(zone, page, file, 0);
> > + if (PageActive(page))
> > + active = 1;
> > +
> > + if (PageWriteback(page) || PageDirty(page)) {
> > + /*
> > + * PG_reclaim could be raced with end_page_writeback
> > + * It can make readahead confusing. But race window
> > + * is _really_ small and it's non-critical problem.
> > + */
> > + SetPageReclaim(page);
> > +
> > + file = page_is_file_cache(page);
> > + lru = page_lru_base_type(page);
> > + del_page_from_lru_list(zone, page, lru + active);
> > + ClearPageActive(page);
> > + ClearPageReferenced(page);
> > + add_page_to_lru_list(zone, page, lru);
> > + __count_vm_event(PGDEACTIVATE);
>
> You update PGDEACTIVATE whether the page was active or not.

My fault.
Resend.

Subject: [PATCH v3 2/3] Reclaim invalidated page ASAP

invalidate_mapping_pages is very big hint to reclaimer.
It means user doesn't want to use the page any more.
So in order to prevent working set page eviction, this patch
move the page into tail of inactive list by PG_reclaim.

Please, remember that pages in inactive list are working set
as well as active list. If we don't move pages into inactive list's
tail, pages near by tail of inactive list can be evicted although
we have a big clue about useless pages. It's totally bad.

Now PG_readahead/PG_reclaim is shared.
fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
preventing fast reclaiming readahead marker page.

In this series, PG_reclaim is used by invalidated page, too.
If VM find the page is invalidated and it's dirty, it sets PG_reclaim
to reclaim asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally.
It disturbs this serie's goal.

I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead.

Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>

Changelog since v2:
- put ClearPageReclaim in set_page_dirty - suggested by Wu.

Changelog since v1:
- make the invalidated page reclaim asap - suggested by Andrew.
---
mm/page-writeback.c | 12 +++++++++++-
mm/swap.c | 49 ++++++++++++++++++++++++++++++++++++-------------
2 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fc93802..88587a5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1250,6 +1250,17 @@ int set_page_dirty(struct page *page)
{
struct address_space *mapping = page_mapping(page);

+ /*
+ * readahead/lru_deactivate_page could remain
+ * PG_readahead/PG_reclaim due to race with end_page_writeback
+ * About readahead, if the page is written, the flags would be
+ * reset. So no problem.
+ * About lru_deactivate_page, if the page is redirty, the flag
+ * will be reset. So no problem. but if the page is used by readahead
+ * it will confuse readahead and make it restart the size rampup
+ * process. But it's a trivial problem.
+ */
+ ClearPageReclaim(page);
if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
#ifdef CONFIG_BLOCK
@@ -1307,7 +1318,6 @@ int clear_page_dirty_for_io(struct page *page)

BUG_ON(!PageLocked(page));

- ClearPageReclaim(page);
if (mapping && mapping_cap_account_dirty(mapping)) {
/*
* Yes, Virginia, this is indeed insane.
diff --git a/mm/swap.c b/mm/swap.c
index 19e0812..1f1f435 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -275,28 +275,51 @@ void add_page_to_unevictable_list(struct page *page)
* into inative list's head. Because the VM expects the page would
* be writeout by flusher. The flusher's writeout is much effective
* than reclaimer's random writeout.
+ *
+ * If the page isn't page_mapped and dirty/writeback, the page
+ * could reclaim asap using PG_reclaim.
+ *
+ * 1. active, mapped page -> none
+ * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
+ * 3. inactive, mapped page -> none
+ * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
+ * 5. Others -> none
+ *
+ * In 4, why it moves inactive's head, the VM expects the page would
+ * be writeout by flusher. The flusher's writeout is much effective than
+ * reclaimer's random writeout.
*/
static void __lru_deactivate(struct page *page, struct zone *zone)
{
int lru, file;
- unsigned long vm_flags;
+ int active = 0;

- if (!PageLRU(page) || !PageActive(page))
+ if (!PageLRU(page))
return;
-
/* Some processes are using the page */
if (page_mapped(page))
return;
-
- file = page_is_file_cache(page);
- lru = page_lru_base_type(page);
- del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
- ClearPageActive(page);
- ClearPageReferenced(page);
- add_page_to_lru_list(zone, page, lru);
- __count_vm_event(PGDEACTIVATE);
-
- update_page_reclaim_stat(zone, page, file, 0);
+ if (PageActive(page))
+ active = 1;
+
+ if (PageWriteback(page) || PageDirty(page)) {
+ /*
+ * PG_reclaim could be raced with end_page_writeback
+ * It can make readahead confusing. But race window
+ * is _really_ small and it's non-critical problem.
+ */
+ SetPageReclaim(page);
+
+ file = page_is_file_cache(page);
+ lru = page_lru_base_type(page);
+ del_page_from_lru_list(zone, page, lru + active);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ add_page_to_lru_list(zone, page, lru);
+ if (active)
+ __count_vm_event(PGDEACTIVATE);
+ update_page_reclaim_stat(zone, page, file, 0);
+ }
}

/*
--
1.7.0.4

2010-11-30 01:01:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] deactivate invalidated pages

> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
>
> Some app developer want to support POSIX_FADV_NOREUSE.
> But other OSes don't support it, either.
> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>
> By other approach, app developers use POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It is very hard for application programmer to use it.
> Because they always have to sync data before calling
> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> be discardable. At last, they can't use deferred write of kernel
> so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)
>
> In fact, invalidation is very big hint to reclaimer.
> It means we don't use the page any more. So let's move
> the writing page into inactive list's head.
>
> Why I need the page to head, Dirty/Writeback page would be flushed
> sooner or later. It can prevent writeout of pageout which is less
> effective than flusher's writeout.
>
> Originally, I reused lru_demote of Peter with some change so added
> his Signed-off-by.
>
> Reported-by: Ben Gamari <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>

Reviewed-by: KOSAKI Motohiro <[email protected]>


2010-11-30 01:08:08

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] Prevent activation of page in madvise_dontneed

> Now zap_pte_range alwayas activates pages which are pte_young &&
> !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
> it's unnecessary since the page wouldn't use any more.
>
> Signed-off-by: Minchan Kim <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Wu Fengguang <[email protected]>

Looks good to me.
Reviewed-by: KOSAKI Motohiro <[email protected]>


2010-11-30 01:10:26

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/3] Reclaim invalidated page ASAP

> invalidate_mapping_pages is very big hint to reclaimer.
> It means user doesn't want to use the page any more.
> So in order to prevent working set page eviction, this patch
> move the page into tail of inactive list by PG_reclaim.
>
> Please, remember that pages in inactive list are working set
> as well as active list. If we don't move pages into inactive list's
> tail, pages near by tail of inactive list can be evicted although
> we have a big clue about useless pages. It's totally bad.
>
> Now PG_readahead/PG_reclaim is shared.
> fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> preventing fast reclaiming readahead marker page.
>
> In this series, PG_reclaim is used by invalidated page, too.
> If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> to reclaim asap. Then, when the dirty page will be writeback,
> clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> It disturbs this serie's goal.
>
> I think it's okay to clear PG_readahead when the page is dirty, not
> writeback time. So this patch moves ClearPageReadahead.
>
> Signed-off-by: Minchan Kim <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>

I still dislike this one. I doubt this trick makes much benefit in real
world workload.


2010-11-30 05:22:18

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] deactivate invalidated pages

On Tue, Nov 30, 2010 at 12:23:19AM +0900, Minchan Kim wrote:
> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
>
> Some app developer want to support POSIX_FADV_NOREUSE.
> But other OSes don't support it, either.
> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>
> By other approach, app developers use POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It is very hard for application programmer to use it.
> Because they always have to sync data before calling
> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> be discardable. At last, they can't use deferred write of kernel
> so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)
>
> In fact, invalidation is very big hint to reclaimer.
> It means we don't use the page any more. So let's move
> the writing page into inactive list's head.
>
> Why I need the page to head, Dirty/Writeback page would be flushed
> sooner or later. It can prevent writeout of pageout which is less
> effective than flusher's writeout.
>
> Originally, I reused lru_demote of Peter with some change so added
> his Signed-off-by.
>
> Reported-by: Ben Gamari <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
>
> Adnrew. Before applying this series, please drop below two patches.
> mm-deactivate-invalidated-pages.patch
> mm-deactivate-invalidated-pages-fix.patch
>
> Changelog since v2:
> - mapped page leaves alone - suggested by Mel
> - pass part related PG_reclaim in next patch.
>
> Changelog since v1:
> - modify description
> - correct typo
> - add some comment
> ---
> include/linux/swap.h | 1 +
> mm/swap.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++
> mm/truncate.c | 16 +++++++--
> 3 files changed, 93 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index eba53e7..84375e4 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h

[...]

> @@ -267,6 +270,63 @@ void add_page_to_unevictable_list(struct page *page)
> }
>
> /*
> + * This function is used by invalidate_mapping_pages.
> + * If the page can't be invalidated, this function moves the page
> + * into inative list's head. Because the VM expects the page would
> + * be writeout by flusher. The flusher's writeout is much effective
> + * than reclaimer's random writeout.

The wording is a bit confusing, I find. It sounds a bit like the
flusher's chance is increased by moving it to the inactive list in the
first place, but the key is that it is moved to the head instead of,
what one would expect, the tail of the list. How about:

If the page can not be invalidated, it is moved to the
inactive list to speed up its reclaim. It is moved to the
head of the list, rather than the tail, to give the flusher
threads some time to write it out, as this is much more
effective than the single-page writeout from reclaim.

> +static void __lru_deactivate(struct page *page, struct zone *zone)

Do you insist on the underscores? :)

> +{
> + int lru, file;
> + unsigned long vm_flags;
> +
> + if (!PageLRU(page) || !PageActive(page))
> + return;
> +
> + /* Some processes are using the page */
> + if (page_mapped(page))
> + return;
> +
> + file = page_is_file_cache(page);
> + lru = page_lru_base_type(page);
> + del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
> + ClearPageActive(page);
> + ClearPageReferenced(page);
> + add_page_to_lru_list(zone, page, lru);
> + __count_vm_event(PGDEACTIVATE);
> +
> + update_page_reclaim_stat(zone, page, file, 0);
> +}
> +
> +/*
> + * This function must be called with preemption disable.

Why is that? Unless I missed something, the only thing that needs
protection is the per-cpu pagevec reference the only user of this
function passes in. But this should be the caller's concern and is
not really a requirement of this function per-se, is it?

> +static void __pagevec_lru_deactivate(struct pagevec *pvec)

More underscores!

> +{
> + int i;
> + struct zone *zone = NULL;
> +
> + for (i = 0; i < pagevec_count(pvec); i++) {
> + struct page *page = pvec->pages[i];
> + struct zone *pagezone = page_zone(page);
> +
> + if (pagezone != zone) {
> + if (zone)
> + spin_unlock_irq(&zone->lru_lock);
> + zone = pagezone;
> + spin_lock_irq(&zone->lru_lock);
> + }
> + __lru_deactivate(page, zone);
> + }
> + if (zone)
> + spin_unlock_irq(&zone->lru_lock);
> +
> + release_pages(pvec->pages, pvec->nr, pvec->cold);
> + pagevec_reinit(pvec);
> +}
> +
> +/*
> * Drain pages out of the cpu's pagevecs.
> * Either "cpu" is the current CPU, and preemption has already been
> * disabled; or "cpu" is being hot-unplugged, and is already dead.
> @@ -292,6 +352,26 @@ static void drain_cpu_pagevecs(int cpu)
> pagevec_move_tail(pvec);
> local_irq_restore(flags);
> }
> +
> + pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> + if (pagevec_count(pvec))
> + __pagevec_lru_deactivate(pvec);
> +}
> +
> +/*
> + * Forcefully deactivate a page.
> + * This function is used for reclaiming the page ASAP when the page
> + * can't be invalidated by Dirty/Writeback.

How about:

/**
* lru_deactivate_page - forcefully deactivate a page
* @page: page to deactivate
*
* This function hints the VM that @page is a good reclaim candidate,
* for example if its invalidation fails due to the page being dirty
* or under writeback.
*/

> +void lru_deactivate_page(struct page *page)

I would love that lru_ prefix for most of the API in this file. In
fact, the file should probably be called lru.c. But for now, can you
keep the naming consistent and call it deactivate_page?

> + if (likely(get_page_unless_zero(page))) {
> + struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +
> + if (!pagevec_add(pvec, page))
> + __pagevec_lru_deactivate(pvec);
> + put_cpu_var(lru_deactivate_pvecs);
> + }
> }
>
> void lru_add_drain(void)

[...]

> @@ -359,8 +360,15 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> if (lock_failed)
> continue;
>
> - ret += invalidate_inode_page(page);
> -
> + ret = invalidate_inode_page(page);
> + /*
> + * If the page was dirty or under writeback we cannot
> + * invalidate it now. Move it to the head of the
> + * inactive LRU for using deferred writeback of flusher.

This would also be less confusing if it would say

Move it to the head of the inactive LRU (rather than the tail)
for using [...]

But I am not sure that this detail is interesting at this point. It
would be more interesting to name the reasons for why the page is
moved to the inactive list in the first place:

If the page is dirty or under writeback, we can not invalidate
it now. But we assume that attempted invalidation is a hint
that the page is no longer of interest and try to speed up its
reclaim.

> + */
> + if (!ret)
> + lru_deactivate_page(page);
> + count += ret;
> unlock_page(page);
> if (next > end)
> break;

Hannes

2010-11-30 06:18:44

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] deactivate invalidated pages

Hi Hannes,

On Tue, Nov 30, 2010 at 2:22 PM, Johannes Weiner <[email protected]> wrote:
> On Tue, Nov 30, 2010 at 12:23:19AM +0900, Minchan Kim wrote:
>> Recently, there are reported problem about thrashing.
>> (http://marc.info/?l=rsync&m=128885034930933&w=2)
>> It happens by backup workloads(ex, nightly rsync).
>> That's because the workload makes just use-once pages
>> and touches pages twice. It promotes the page into
>> active list so that it results in working set page eviction.
>>
>> Some app developer want to support POSIX_FADV_NOREUSE.
>> But other OSes don't support it, either.
>> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>>
>> By other approach, app developers use POSIX_FADV_DONTNEED.
>> But it has a problem. If kernel meets page is writing
>> during invalidate_mapping_pages, it can't work.
>> It is very hard for application programmer to use it.
>> Because they always have to sync data before calling
>> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
>> be discardable. At last, they can't use deferred write of kernel
>> so that they could see performance loss.
>> (http://insights.oetiker.ch/linux/fadvise.html)
>>
>> In fact, invalidation is very big hint to reclaimer.
>> It means we don't use the page any more. So let's move
>> the writing page into inactive list's head.
>>
>> Why I need the page to head, Dirty/Writeback page would be flushed
>> sooner or later. It can prevent writeout of pageout which is less
>> effective than flusher's writeout.
>>
>> Originally, I reused lru_demote of Peter with some change so added
>> his Signed-off-by.
>>
>> Reported-by: Ben Gamari <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>> Signed-off-by: Peter Zijlstra <[email protected]>
>> Acked-by: Rik van Riel <[email protected]>
>> Cc: Wu Fengguang <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>>
>> Adnrew. Before applying this series, please drop below two patches.
>> ?mm-deactivate-invalidated-pages.patch
>> ?mm-deactivate-invalidated-pages-fix.patch
>>
>> Changelog since v2:
>> ?- mapped page leaves alone - suggested by Mel
>> ?- pass part related PG_reclaim in next patch.
>>
>> Changelog since v1:
>> ?- modify description
>> ?- correct typo
>> ?- add some comment
>> ---
>> ?include/linux/swap.h | ? ?1 +
>> ?mm/swap.c ? ? ? ? ? ?| ? 80 ++++++++++++++++++++++++++++++++++++++++++++++++++
>> ?mm/truncate.c ? ? ? ?| ? 16 +++++++--
>> ?3 files changed, 93 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index eba53e7..84375e4 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>
> [...]
>
>> @@ -267,6 +270,63 @@ void add_page_to_unevictable_list(struct page *page)
>> ?}
>>
>> ?/*
>> + * This function is used by invalidate_mapping_pages.
>> + * If the page can't be invalidated, this function moves the page
>> + * into inative list's head. Because the VM expects the page would
>> + * be writeout by flusher. The flusher's writeout is much effective
>> + * than reclaimer's random writeout.
>
> The wording is a bit confusing, I find. ?It sounds a bit like the
> flusher's chance is increased by moving it to the inactive list in the
> first place, but the key is that it is moved to the head instead of,
> what one would expect, the tail of the list. ?How about:
>
> ? ? ? ?If the page can not be invalidated, it is moved to the
> ? ? ? ?inactive list to speed up its reclaim. ?It is moved to the
> ? ? ? ?head of the list, rather than the tail, to give the flusher
> ? ? ? ?threads some time to write it out, as this is much more
> ? ? ? ?effective than the single-page writeout from reclaim.
>

Looks good to me.
I will add your comment instead of my ugly comment.

>> +static void __lru_deactivate(struct page *page, struct zone *zone)
>
> Do you insist on the underscores? :)

Good point.

__lru_deactivate is self-contained.
It is valuable enough using other places.
I will remove underscores.

>
>> +{
>> + ? ? int lru, file;
>> + ? ? unsigned long vm_flags;
>> +
>> + ? ? if (!PageLRU(page) || !PageActive(page))
>> + ? ? ? ? ? ? return;
>> +
>> + ? ? /* Some processes are using the page */
>> + ? ? if (page_mapped(page))
>> + ? ? ? ? ? ? return;
>> +
>> + ? ? file = page_is_file_cache(page);
>> + ? ? lru = page_lru_base_type(page);
>> + ? ? del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
>> + ? ? ClearPageActive(page);
>> + ? ? ClearPageReferenced(page);
>> + ? ? add_page_to_lru_list(zone, page, lru);
>> + ? ? __count_vm_event(PGDEACTIVATE);
>> +
>> + ? ? update_page_reclaim_stat(zone, page, file, 0);
>> +}
>> +
>> +/*
>> + * This function must be called with preemption disable.
>
> Why is that? ?Unless I missed something, the only thing that needs
> protection is the per-cpu pagevec reference the only user of this
> function passes in. ?But this should be the caller's concern and is
> not really a requirement of this function per-se, is it?

Yes. It's unnecessary.
I didn't consider enoughly.
Will fix.

>
>> +static void __pagevec_lru_deactivate(struct pagevec *pvec)
>
> More underscores!

Will fix.

>
>> +{
>> + ? ? int i;
>> + ? ? struct zone *zone = NULL;
>> +
>> + ? ? for (i = 0; i < pagevec_count(pvec); i++) {
>> + ? ? ? ? ? ? struct page *page = pvec->pages[i];
>> + ? ? ? ? ? ? struct zone *pagezone = page_zone(page);
>> +
>> + ? ? ? ? ? ? if (pagezone != zone) {
>> + ? ? ? ? ? ? ? ? ? ? if (zone)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? spin_unlock_irq(&zone->lru_lock);
>> + ? ? ? ? ? ? ? ? ? ? zone = pagezone;
>> + ? ? ? ? ? ? ? ? ? ? spin_lock_irq(&zone->lru_lock);
>> + ? ? ? ? ? ? }
>> + ? ? ? ? ? ? __lru_deactivate(page, zone);
>> + ? ? }
>> + ? ? if (zone)
>> + ? ? ? ? ? ? spin_unlock_irq(&zone->lru_lock);
>> +
>> + ? ? release_pages(pvec->pages, pvec->nr, pvec->cold);
>> + ? ? pagevec_reinit(pvec);
>> +}
>> +
>> +/*
>> ? * Drain pages out of the cpu's pagevecs.
>> ? * Either "cpu" is the current CPU, and preemption has already been
>> ? * disabled; or "cpu" is being hot-unplugged, and is already dead.
>> @@ -292,6 +352,26 @@ static void drain_cpu_pagevecs(int cpu)
>> ? ? ? ? ? ? ? pagevec_move_tail(pvec);
>> ? ? ? ? ? ? ? local_irq_restore(flags);
>> ? ? ? }
>> +
>> + ? ? pvec = &per_cpu(lru_deactivate_pvecs, cpu);
>> + ? ? if (pagevec_count(pvec))
>> + ? ? ? ? ? ? __pagevec_lru_deactivate(pvec);
>> +}
>> +
>> +/*
>> + * Forcefully deactivate a page.
>> + * This function is used for reclaiming the page ASAP when the page
>> + * can't be invalidated by Dirty/Writeback.
>
> How about:
>
> /**
> ?* lru_deactivate_page - forcefully deactivate a page
> ?* @page: page to deactivate
> ?*
> ?* This function hints the VM that @page is a good reclaim candidate,
> ?* for example if its invalidation fails due to the page being dirty
> ?* or under writeback.
> ?*/
>
>> +void lru_deactivate_page(struct page *page)
>
> I would love that lru_ prefix for most of the API in this file. ?In
> fact, the file should probably be called lru.c. ?But for now, can you
> keep the naming consistent and call it deactivate_page?

No matter. I can change it. but deactivate_page will be asymmetric
about that deactivate_page move active page into inactive
forcefully(two step) while activate_page does one step activation.
That's why I name it.

>
>> + ? ? if (likely(get_page_unless_zero(page))) {
>> + ? ? ? ? ? ? struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
>> +
>> + ? ? ? ? ? ? if (!pagevec_add(pvec, page))
>> + ? ? ? ? ? ? ? ? ? ? __pagevec_lru_deactivate(pvec);
>> + ? ? ? ? ? ? put_cpu_var(lru_deactivate_pvecs);
>> + ? ? }
>> ?}
>>
>> ?void lru_add_drain(void)
>
> [...]
>
>> @@ -359,8 +360,15 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>> ? ? ? ? ? ? ? ? ? ? ? if (lock_failed)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>>
>> - ? ? ? ? ? ? ? ? ? ? ret += invalidate_inode_page(page);
>> -
>> + ? ? ? ? ? ? ? ? ? ? ret = invalidate_inode_page(page);
>> + ? ? ? ? ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ? ? ? ? ?* If the page was dirty or under writeback we cannot
>> + ? ? ? ? ? ? ? ? ? ? ?* invalidate it now. ?Move it to the head of the
>> + ? ? ? ? ? ? ? ? ? ? ?* inactive LRU for using deferred writeback of flusher.
>
> This would also be less confusing if it would say
>
> ? ? ? ?Move it to the head of the inactive LRU (rather than the tail)
> ? ? ? ?for using [...]
>
> But I am not sure that this detail is interesting at this point. ?It
> would be more interesting to name the reasons for why the page is
> moved to the inactive list in the first place:
>
> ? ? ? ?If the page is dirty or under writeback, we can not invalidate
> ? ? ? ?it now. ?But we assume that attempted invalidation is a hint
> ? ? ? ?that the page is no longer of interest and try to speed up its
> ? ? ? ?reclaim.
>

Will fix.
I hope listen you guys's opinions about [2/3], too. :)

Thanks, Hannes.

>> + ? ? ? ? ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? ? ? ? ? if (!ret)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru_deactivate_page(page);
>> + ? ? ? ? ? ? ? ? ? ? count += ret;
>> ? ? ? ? ? ? ? ? ? ? ? unlock_page(page);
>> ? ? ? ? ? ? ? ? ? ? ? if (next > end)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
>
> ? ? ? ?Hannes
>



--
Kind regards,
Minchan Kim

2010-11-30 09:16:37

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] Reclaim invalidated page ASAP

On Tue, Nov 30, 2010 at 07:41:30AM +0900, Minchan Kim wrote:
> On Mon, Nov 29, 2010 at 04:57:06PM +0000, Mel Gorman wrote:
> > On Tue, Nov 30, 2010 at 12:23:20AM +0900, Minchan Kim wrote:
> > > invalidate_mapping_pages is very big hint to reclaimer.
> > > It means user doesn't want to use the page any more.
> > > So in order to prevent working set page eviction, this patch
> > > move the page into tail of inactive list by PG_reclaim.
> > >
> > > Please, remember that pages in inactive list are working set
> > > as well as active list. If we don't move pages into inactive list's
> > > tail, pages near by tail of inactive list can be evicted although
> > > we have a big clue about useless pages. It's totally bad.
> > >
> > > Now PG_readahead/PG_reclaim is shared.
> > > fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> > > preventing fast reclaiming readahead marker page.
> > >
> > > In this series, PG_reclaim is used by invalidated page, too.
> > > If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> > > to reclaim asap. Then, when the dirty page will be writeback,
> > > clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> > > It disturbs this serie's goal.
> > >
> > > I think it's okay to clear PG_readahead when the page is dirty, not
> > > writeback time. So this patch moves ClearPageReadahead.
> > >
> > > Signed-off-by: Minchan Kim <[email protected]>
> > > Acked-by: Rik van Riel <[email protected]>
> > > Cc: Wu Fengguang <[email protected]>
> > > Cc: KOSAKI Motohiro <[email protected]>
> > > Cc: Johannes Weiner <[email protected]>
> > > Cc: Nick Piggin <[email protected]>
> > > Cc: Mel Gorman <[email protected]>
> > >
> > > Changelog since v2:
> > > - put ClearPageReclaim in set_page_dirty - suggested by Wu.
> > >
> > > Changelog since v1:
> > > - make the invalidated page reclaim asap - suggested by Andrew.
> > > ---
> > > mm/page-writeback.c | 12 +++++++++++-
> > > mm/swap.c | 48 +++++++++++++++++++++++++++++++++++-------------
> > > 2 files changed, 46 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index fc93802..88587a5 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -1250,6 +1250,17 @@ int set_page_dirty(struct page *page)
> > > {
> > > struct address_space *mapping = page_mapping(page);
> > >
> > > + /*
> > > + * readahead/lru_deactivate_page could remain
> > > + * PG_readahead/PG_reclaim due to race with end_page_writeback
> > > + * About readahead, if the page is written, the flags would be
> > > + * reset. So no problem.
> > > + * About lru_deactivate_page, if the page is redirty, the flag
> > > + * will be reset. So no problem. but if the page is used by readahead
> > > + * it will confuse readahead and make it restart the size rampup
> > > + * process. But it's a trivial problem.
> > > + */
> > > + ClearPageReclaim(page);
> > > if (likely(mapping)) {
> > > int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
> > > #ifdef CONFIG_BLOCK
> > > @@ -1307,7 +1318,6 @@ int clear_page_dirty_for_io(struct page *page)
> > >
> > > BUG_ON(!PageLocked(page));
> > >
> > > - ClearPageReclaim(page);
> > > if (mapping && mapping_cap_account_dirty(mapping)) {
> > > /*
> > > * Yes, Virginia, this is indeed insane.
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > index 19e0812..936b281 100644
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -275,28 +275,50 @@ void add_page_to_unevictable_list(struct page *page)
> > > * into inative list's head. Because the VM expects the page would
> > > * be writeout by flusher. The flusher's writeout is much effective
> > > * than reclaimer's random writeout.
> > > + *
> > > + * If the page isn't page_mapped and dirty/writeback, the page
> > > + * could reclaim asap using PG_reclaim.
> > > + *
> > > + * 1. active, mapped page -> none
> > > + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> > > + * 3. inactive, mapped page -> none
> > > + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
> > > + * 5. Others -> none
> > > + *
> > > + * In 4, why it moves inactive's head, the VM expects the page would
> > > + * be writeout by flusher. The flusher's writeout is much effective than
> > > + * reclaimer's random writeout.
> > > */
> > > static void __lru_deactivate(struct page *page, struct zone *zone)
> > > {
> > > int lru, file;
> > > - unsigned long vm_flags;
> > > + int active = 0;
> > >
> > > - if (!PageLRU(page) || !PageActive(page))
> > > + if (!PageLRU(page))
> > > return;
> > > -
> > > /* Some processes are using the page */
> > > if (page_mapped(page))
> > > return;
> > > -
> > > - file = page_is_file_cache(page);
> > > - lru = page_lru_base_type(page);
> > > - del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
> > > - ClearPageActive(page);
> > > - ClearPageReferenced(page);
> > > - add_page_to_lru_list(zone, page, lru);
> > > - __count_vm_event(PGDEACTIVATE);
> > > -
> > > - update_page_reclaim_stat(zone, page, file, 0);
> > > + if (PageActive(page))
> > > + active = 1;
> > > +
> > > + if (PageWriteback(page) || PageDirty(page)) {
> > > + /*
> > > + * PG_reclaim could be raced with end_page_writeback
> > > + * It can make readahead confusing. But race window
> > > + * is _really_ small and it's non-critical problem.
> > > + */
> > > + SetPageReclaim(page);
> > > +
> > > + file = page_is_file_cache(page);
> > > + lru = page_lru_base_type(page);
> > > + del_page_from_lru_list(zone, page, lru + active);
> > > + ClearPageActive(page);
> > > + ClearPageReferenced(page);
> > > + add_page_to_lru_list(zone, page, lru);
> > > + __count_vm_event(PGDEACTIVATE);
> >
> > You update PGDEACTIVATE whether the page was active or not.
>
> My fault.
> Resend.
>
> Subject: [PATCH v3 2/3] Reclaim invalidated page ASAP
>
> invalidate_mapping_pages is very big hint to reclaimer.
> It means user doesn't want to use the page any more.
> So in order to prevent working set page eviction, this patch
> move the page into tail of inactive list by PG_reclaim.
>
> Please, remember that pages in inactive list are working set
> as well as active list. If we don't move pages into inactive list's
> tail, pages near by tail of inactive list can be evicted although
> we have a big clue about useless pages. It's totally bad.
>
> Now PG_readahead/PG_reclaim is shared.
> fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> preventing fast reclaiming readahead marker page.
>
> In this series, PG_reclaim is used by invalidated page, too.
> If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> to reclaim asap. Then, when the dirty page will be writeback,
> clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> It disturbs this serie's goal.
>
> I think it's okay to clear PG_readahead when the page is dirty, not
> writeback time. So this patch moves ClearPageReadahead.
>
> Signed-off-by: Minchan Kim <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
>
> Changelog since v2:
> - put ClearPageReclaim in set_page_dirty - suggested by Wu.
>
> Changelog since v1:
> - make the invalidated page reclaim asap - suggested by Andrew.
> ---
> mm/page-writeback.c | 12 +++++++++++-
> mm/swap.c | 49 ++++++++++++++++++++++++++++++++++++-------------
> 2 files changed, 47 insertions(+), 14 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index fc93802..88587a5 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1250,6 +1250,17 @@ int set_page_dirty(struct page *page)
> {
> struct address_space *mapping = page_mapping(page);
>
> + /*
> + * readahead/lru_deactivate_page could remain
> + * PG_readahead/PG_reclaim due to race with end_page_writeback
> + * About readahead, if the page is written, the flags would be
> + * reset. So no problem.
> + * About lru_deactivate_page, if the page is redirty, the flag
> + * will be reset. So no problem. but if the page is used by readahead
> + * it will confuse readahead and make it restart the size rampup
> + * process. But it's a trivial problem.
> + */
> + ClearPageReclaim(page);
> if (likely(mapping)) {
> int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
> #ifdef CONFIG_BLOCK
> @@ -1307,7 +1318,6 @@ int clear_page_dirty_for_io(struct page *page)
>
> BUG_ON(!PageLocked(page));
>
> - ClearPageReclaim(page);
> if (mapping && mapping_cap_account_dirty(mapping)) {
> /*
> * Yes, Virginia, this is indeed insane.
> diff --git a/mm/swap.c b/mm/swap.c
> index 19e0812..1f1f435 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -275,28 +275,51 @@ void add_page_to_unevictable_list(struct page *page)
> * into inative list's head. Because the VM expects the page would
> * be writeout by flusher. The flusher's writeout is much effective
> * than reclaimer's random writeout.
> + *
> + * If the page isn't page_mapped and dirty/writeback, the page
> + * could reclaim asap using PG_reclaim.
> + *
> + * 1. active, mapped page -> none
> + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> + * 3. inactive, mapped page -> none
> + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
> + * 5. Others -> none
> + *
> + * In 4, why it moves inactive's head, the VM expects the page would
> + * be writeout by flusher. The flusher's writeout is much effective than
> + * reclaimer's random writeout.
> */
> static void __lru_deactivate(struct page *page, struct zone *zone)
> {
> int lru, file;
> - unsigned long vm_flags;
> + int active = 0;
>
> - if (!PageLRU(page) || !PageActive(page))
> + if (!PageLRU(page))
> return;
> -
> /* Some processes are using the page */
> if (page_mapped(page))
> return;
> -
> - file = page_is_file_cache(page);
> - lru = page_lru_base_type(page);
> - del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
> - ClearPageActive(page);
> - ClearPageReferenced(page);
> - add_page_to_lru_list(zone, page, lru);
> - __count_vm_event(PGDEACTIVATE);
> -
> - update_page_reclaim_stat(zone, page, file, 0);
> + if (PageActive(page))
> + active = 1;
> +

I should have said this the last time but if you do another revision,
make active a "bool". There is a very slow migration of int to bool in
cases it makes sense. It's not urgent though.

> + if (PageWriteback(page) || PageDirty(page)) {
> + /*
> + * PG_reclaim could be raced with end_page_writeback
> + * It can make readahead confusing. But race window
> + * is _really_ small and it's non-critical problem.
> + */
> + SetPageReclaim(page);
> +
> + file = page_is_file_cache(page);
> + lru = page_lru_base_type(page);
> + del_page_from_lru_list(zone, page, lru + active);
> + ClearPageActive(page);
> + ClearPageReferenced(page);
> + add_page_to_lru_list(zone, page, lru);
> + if (active)
> + __count_vm_event(PGDEACTIVATE);
> + update_page_reclaim_stat(zone, page, file, 0);
> + }
> }
>

Whether you update active's type or not;

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-30 09:18:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] Reclaim invalidated page ASAP

On Tue, Nov 30, 2010 at 10:10:20AM +0900, KOSAKI Motohiro wrote:
> > invalidate_mapping_pages is very big hint to reclaimer.
> > It means user doesn't want to use the page any more.
> > So in order to prevent working set page eviction, this patch
> > move the page into tail of inactive list by PG_reclaim.
> >
> > Please, remember that pages in inactive list are working set
> > as well as active list. If we don't move pages into inactive list's
> > tail, pages near by tail of inactive list can be evicted although
> > we have a big clue about useless pages. It's totally bad.
> >
> > Now PG_readahead/PG_reclaim is shared.
> > fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> > preventing fast reclaiming readahead marker page.
> >
> > In this series, PG_reclaim is used by invalidated page, too.
> > If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> > to reclaim asap. Then, when the dirty page will be writeback,
> > clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> > It disturbs this serie's goal.
> >
> > I think it's okay to clear PG_readahead when the page is dirty, not
> > writeback time. So this patch moves ClearPageReadahead.
> >
> > Signed-off-by: Minchan Kim <[email protected]>
> > Acked-by: Rik van Riel <[email protected]>
> > Cc: Wu Fengguang <[email protected]>
> > Cc: KOSAKI Motohiro <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Nick Piggin <[email protected]>
> > Cc: Mel Gorman <[email protected]>
>
> I still dislike this one. I doubt this trick makes much benefit in real
> world workload.
>

I would agree except as said elsewhere, it's a chicken and egg problem.
We don't have a real world test because fadvise is not useful in its
current iteration. I'm hoping that there will be a test comparing

rsync on vanilla kernel
rsync on patched kernel
rsync+patch on vanilla kernel
rsync+patch on patched kernel

Are the results of such a test likely to happen?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-30 11:21:00

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 2/3] Reclaim invalidated page ASAP

On Tue, Nov 30, 2010 at 07:41:30AM +0900, Minchan Kim wrote:

> diff --git a/mm/swap.c b/mm/swap.c
> index 19e0812..1f1f435 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -275,28 +275,51 @@ void add_page_to_unevictable_list(struct page *page)
> * into inative list's head. Because the VM expects the page would
> * be writeout by flusher. The flusher's writeout is much effective
> * than reclaimer's random writeout.
> + *
> + * If the page isn't page_mapped and dirty/writeback, the page
> + * could reclaim asap using PG_reclaim.
> + *
> + * 1. active, mapped page -> none
> + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> + * 3. inactive, mapped page -> none
> + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
> + * 5. Others -> none
> + *
> + * In 4, why it moves inactive's head, the VM expects the page would
> + * be writeout by flusher. The flusher's writeout is much effective than
> + * reclaimer's random writeout.
> */
> static void __lru_deactivate(struct page *page, struct zone *zone)
> {
> int lru, file;
> - unsigned long vm_flags;
> + int active = 0;

vm_flags is never used in this series.

> - if (!PageLRU(page) || !PageActive(page))
> + if (!PageLRU(page))
> return;
> -
> /* Some processes are using the page */
> if (page_mapped(page))
> return;
> -
> - file = page_is_file_cache(page);
> - lru = page_lru_base_type(page);
> - del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
> - ClearPageActive(page);
> - ClearPageReferenced(page);
> - add_page_to_lru_list(zone, page, lru);
> - __count_vm_event(PGDEACTIVATE);
> -
> - update_page_reclaim_stat(zone, page, file, 0);
> + if (PageActive(page))
> + active = 1;

active = PageActive(page)

> + if (PageWriteback(page) || PageDirty(page)) {
> + /*
> + * PG_reclaim could be raced with end_page_writeback
> + * It can make readahead confusing. But race window
> + * is _really_ small and it's non-critical problem.
> + */
> + SetPageReclaim(page);
> +
> + file = page_is_file_cache(page);
> + lru = page_lru_base_type(page);
> + del_page_from_lru_list(zone, page, lru + active);
> + ClearPageActive(page);
> + ClearPageReferenced(page);
> + add_page_to_lru_list(zone, page, lru);
> + if (active)
> + __count_vm_event(PGDEACTIVATE);
> + update_page_reclaim_stat(zone, page, file, 0);
> + }

If we lose the race with writeback, the completion handler won't see
PG_reclaim, won't move the page, and we have an unwanted clean cache
page on the active list. Given the pagevec caching of those pages it
could be rather likely that IO completes before the above executes.

Shouldn't this be

if (PageWriteback() || PageDirty()) {
SetPageReclaim()
move_to_inactive_head()
} else {
move_to_inactive_tail()
}

instead?

Hannes

2010-11-30 11:35:53

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] Prevent activation of page in madvise_dontneed

On Tue, Nov 30, 2010 at 12:23:21AM +0900, Minchan Kim wrote:
> Now zap_pte_range alwayas activates pages which are pte_young &&
> !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
> it's unnecessary since the page wouldn't use any more.
>
> Signed-off-by: Minchan Kim <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Wu Fengguang <[email protected]>
>
> Changelog since v2:
> - remove unnecessary description
> Changelog since v1:
> - change word from promote to activate
> - add activate argument to zap_pte_range and family function
>
> ---
> include/linux/mm.h | 4 ++--
> mm/madvise.c | 4 ++--
> mm/memory.c | 38 +++++++++++++++++++++++---------------
> mm/mmap.c | 4 ++--
> 4 files changed, 29 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e097df6..6032881 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -779,11 +779,11 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
> unsigned long size);
> unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
> - unsigned long size, struct zap_details *);
> + unsigned long size, struct zap_details *, bool activate);

I would prefer naming the parameter 'ignore_references' or something
similar, so that it reflects the immediate effect on the zappers'
behaviour, not what mark_page_accessed() might end up doing.

Other than that, the patch looks good to me.

Hannes

2010-11-30 14:02:08

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/3] Reclaim invalidated page ASAP

On Tue, Nov 30, 2010 at 09:18:22AM +0000, Mel Gorman wrote:
> On Tue, Nov 30, 2010 at 10:10:20AM +0900, KOSAKI Motohiro wrote:
> > > invalidate_mapping_pages is very big hint to reclaimer.
> > > It means user doesn't want to use the page any more.
> > > So in order to prevent working set page eviction, this patch
> > > move the page into tail of inactive list by PG_reclaim.
> > >
> > > Please, remember that pages in inactive list are working set
> > > as well as active list. If we don't move pages into inactive list's
> > > tail, pages near by tail of inactive list can be evicted although
> > > we have a big clue about useless pages. It's totally bad.
> > >
> > > Now PG_readahead/PG_reclaim is shared.
> > > fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> > > preventing fast reclaiming readahead marker page.
> > >
> > > In this series, PG_reclaim is used by invalidated page, too.
> > > If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> > > to reclaim asap. Then, when the dirty page will be writeback,
> > > clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> > > It disturbs this serie's goal.
> > >
> > > I think it's okay to clear PG_readahead when the page is dirty, not
> > > writeback time. So this patch moves ClearPageReadahead.
> > >
> > > Signed-off-by: Minchan Kim <[email protected]>
> > > Acked-by: Rik van Riel <[email protected]>
> > > Cc: Wu Fengguang <[email protected]>
> > > Cc: KOSAKI Motohiro <[email protected]>
> > > Cc: Johannes Weiner <[email protected]>
> > > Cc: Nick Piggin <[email protected]>
> > > Cc: Mel Gorman <[email protected]>
> >
> > I still dislike this one. I doubt this trick makes much benefit in real
> > world workload.
> >
>
> I would agree except as said elsewhere, it's a chicken and egg problem.
> We don't have a real world test because fadvise is not useful in its
> current iteration. I'm hoping that there will be a test comparing
>
> rsync on vanilla kernel
> rsync on patched kernel
> rsync+patch on vanilla kernel
> rsync+patch on patched kernel
>
> Are the results of such a test likely to happen?

Ben, Could you get the rsync execution time(user/sys) and
'cat /proc/vmstat' result before/after?
If Ben is busy, I will try to get a data. but I need enough time.

I expect rsync+patch on patched kernel should have a less allocstall, less pgscan
so fast execution time.

>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab

--
Kind regards,
Minchan Kim

2010-11-30 14:03:50

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/3] Reclaim invalidated page ASAP

On Tue, Nov 30, 2010 at 12:20:41PM +0100, Johannes Weiner wrote:
> On Tue, Nov 30, 2010 at 07:41:30AM +0900, Minchan Kim wrote:
>
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 19e0812..1f1f435 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -275,28 +275,51 @@ void add_page_to_unevictable_list(struct page *page)
> > * into inative list's head. Because the VM expects the page would
> > * be writeout by flusher. The flusher's writeout is much effective
> > * than reclaimer's random writeout.
> > + *
> > + * If the page isn't page_mapped and dirty/writeback, the page
> > + * could reclaim asap using PG_reclaim.
> > + *
> > + * 1. active, mapped page -> none
> > + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> > + * 3. inactive, mapped page -> none
> > + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
> > + * 5. Others -> none
> > + *
> > + * In 4, why it moves inactive's head, the VM expects the page would
> > + * be writeout by flusher. The flusher's writeout is much effective than
> > + * reclaimer's random writeout.
> > */
> > static void __lru_deactivate(struct page *page, struct zone *zone)
> > {
> > int lru, file;
> > - unsigned long vm_flags;
> > + int active = 0;
>
> vm_flags is never used in this series.

It's garbage in my old version which is used page_referenced.

>
> > - if (!PageLRU(page) || !PageActive(page))
> > + if (!PageLRU(page))
> > return;
> > -
> > /* Some processes are using the page */
> > if (page_mapped(page))
> > return;
> > -
> > - file = page_is_file_cache(page);
> > - lru = page_lru_base_type(page);
> > - del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
> > - ClearPageActive(page);
> > - ClearPageReferenced(page);
> > - add_page_to_lru_list(zone, page, lru);
> > - __count_vm_event(PGDEACTIVATE);
> > -
> > - update_page_reclaim_stat(zone, page, file, 0);
> > + if (PageActive(page))
> > + active = 1;
>
> active = PageActive(page)

Will fix.

>
> > + if (PageWriteback(page) || PageDirty(page)) {
> > + /*
> > + * PG_reclaim could be raced with end_page_writeback
> > + * It can make readahead confusing. But race window
> > + * is _really_ small and it's non-critical problem.
> > + */
> > + SetPageReclaim(page);
> > +
> > + file = page_is_file_cache(page);
> > + lru = page_lru_base_type(page);
> > + del_page_from_lru_list(zone, page, lru + active);
> > + ClearPageActive(page);
> > + ClearPageReferenced(page);
> > + add_page_to_lru_list(zone, page, lru);
> > + if (active)
> > + __count_vm_event(PGDEACTIVATE);
> > + update_page_reclaim_stat(zone, page, file, 0);
> > + }
>
> If we lose the race with writeback, the completion handler won't see
> PG_reclaim, won't move the page, and we have an unwanted clean cache
> page on the active list. Given the pagevec caching of those pages it
> could be rather likely that IO completes before the above executes.
>
> Shouldn't this be
>
> if (PageWriteback() || PageDirty()) {
> SetPageReclaim()
> move_to_inactive_head()
> } else {
> move_to_inactive_tail()
> }
>
> instead?

Fair enough.

Thanks, Hannes.
>
> Hannes

--
Kind regards,
Minchan Kim

2010-11-30 14:04:47

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/3] Reclaim invalidated page ASAP

On Tue, Nov 30, 2010 at 09:16:18AM +0000, Mel Gorman wrote:
> On Tue, Nov 30, 2010 at 07:41:30AM +0900, Minchan Kim wrote:
> > On Mon, Nov 29, 2010 at 04:57:06PM +0000, Mel Gorman wrote:
> > > On Tue, Nov 30, 2010 at 12:23:20AM +0900, Minchan Kim wrote:
> > > > invalidate_mapping_pages is very big hint to reclaimer.
> > > > It means user doesn't want to use the page any more.
> > > > So in order to prevent working set page eviction, this patch
> > > > move the page into tail of inactive list by PG_reclaim.
> > > >
> > > > Please, remember that pages in inactive list are working set
> > > > as well as active list. If we don't move pages into inactive list's
> > > > tail, pages near by tail of inactive list can be evicted although
> > > > we have a big clue about useless pages. It's totally bad.
> > > >
> > > > Now PG_readahead/PG_reclaim is shared.
> > > > fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> > > > preventing fast reclaiming readahead marker page.
> > > >
> > > > In this series, PG_reclaim is used by invalidated page, too.
> > > > If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> > > > to reclaim asap. Then, when the dirty page will be writeback,
> > > > clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> > > > It disturbs this serie's goal.
> > > >
> > > > I think it's okay to clear PG_readahead when the page is dirty, not
> > > > writeback time. So this patch moves ClearPageReadahead.
> > > >
> > > > Signed-off-by: Minchan Kim <[email protected]>
> > > > Acked-by: Rik van Riel <[email protected]>
> > > > Cc: Wu Fengguang <[email protected]>
> > > > Cc: KOSAKI Motohiro <[email protected]>
> > > > Cc: Johannes Weiner <[email protected]>
> > > > Cc: Nick Piggin <[email protected]>
> > > > Cc: Mel Gorman <[email protected]>
> > > >
> > > > Changelog since v2:
> > > > - put ClearPageReclaim in set_page_dirty - suggested by Wu.
> > > >
> > > > Changelog since v1:
> > > > - make the invalidated page reclaim asap - suggested by Andrew.
> > > > ---
> > > > mm/page-writeback.c | 12 +++++++++++-
> > > > mm/swap.c | 48 +++++++++++++++++++++++++++++++++++-------------
> > > > 2 files changed, 46 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > index fc93802..88587a5 100644
> > > > --- a/mm/page-writeback.c
> > > > +++ b/mm/page-writeback.c
> > > > @@ -1250,6 +1250,17 @@ int set_page_dirty(struct page *page)
> > > > {
> > > > struct address_space *mapping = page_mapping(page);
> > > >
> > > > + /*
> > > > + * readahead/lru_deactivate_page could remain
> > > > + * PG_readahead/PG_reclaim due to race with end_page_writeback
> > > > + * About readahead, if the page is written, the flags would be
> > > > + * reset. So no problem.
> > > > + * About lru_deactivate_page, if the page is redirty, the flag
> > > > + * will be reset. So no problem. but if the page is used by readahead
> > > > + * it will confuse readahead and make it restart the size rampup
> > > > + * process. But it's a trivial problem.
> > > > + */
> > > > + ClearPageReclaim(page);
> > > > if (likely(mapping)) {
> > > > int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
> > > > #ifdef CONFIG_BLOCK
> > > > @@ -1307,7 +1318,6 @@ int clear_page_dirty_for_io(struct page *page)
> > > >
> > > > BUG_ON(!PageLocked(page));
> > > >
> > > > - ClearPageReclaim(page);
> > > > if (mapping && mapping_cap_account_dirty(mapping)) {
> > > > /*
> > > > * Yes, Virginia, this is indeed insane.
> > > > diff --git a/mm/swap.c b/mm/swap.c
> > > > index 19e0812..936b281 100644
> > > > --- a/mm/swap.c
> > > > +++ b/mm/swap.c
> > > > @@ -275,28 +275,50 @@ void add_page_to_unevictable_list(struct page *page)
> > > > * into inative list's head. Because the VM expects the page would
> > > > * be writeout by flusher. The flusher's writeout is much effective
> > > > * than reclaimer's random writeout.
> > > > + *
> > > > + * If the page isn't page_mapped and dirty/writeback, the page
> > > > + * could reclaim asap using PG_reclaim.
> > > > + *
> > > > + * 1. active, mapped page -> none
> > > > + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> > > > + * 3. inactive, mapped page -> none
> > > > + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
> > > > + * 5. Others -> none
> > > > + *
> > > > + * In 4, why it moves inactive's head, the VM expects the page would
> > > > + * be writeout by flusher. The flusher's writeout is much effective than
> > > > + * reclaimer's random writeout.
> > > > */
> > > > static void __lru_deactivate(struct page *page, struct zone *zone)
> > > > {
> > > > int lru, file;
> > > > - unsigned long vm_flags;
> > > > + int active = 0;
> > > >
> > > > - if (!PageLRU(page) || !PageActive(page))
> > > > + if (!PageLRU(page))
> > > > return;
> > > > -
> > > > /* Some processes are using the page */
> > > > if (page_mapped(page))
> > > > return;
> > > > -
> > > > - file = page_is_file_cache(page);
> > > > - lru = page_lru_base_type(page);
> > > > - del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
> > > > - ClearPageActive(page);
> > > > - ClearPageReferenced(page);
> > > > - add_page_to_lru_list(zone, page, lru);
> > > > - __count_vm_event(PGDEACTIVATE);
> > > > -
> > > > - update_page_reclaim_stat(zone, page, file, 0);
> > > > + if (PageActive(page))
> > > > + active = 1;
> > > > +
> > > > + if (PageWriteback(page) || PageDirty(page)) {
> > > > + /*
> > > > + * PG_reclaim could be raced with end_page_writeback
> > > > + * It can make readahead confusing. But race window
> > > > + * is _really_ small and it's non-critical problem.
> > > > + */
> > > > + SetPageReclaim(page);
> > > > +
> > > > + file = page_is_file_cache(page);
> > > > + lru = page_lru_base_type(page);
> > > > + del_page_from_lru_list(zone, page, lru + active);
> > > > + ClearPageActive(page);
> > > > + ClearPageReferenced(page);
> > > > + add_page_to_lru_list(zone, page, lru);
> > > > + __count_vm_event(PGDEACTIVATE);
> > >
> > > You update PGDEACTIVATE whether the page was active or not.
> >
> > My fault.
> > Resend.
> >
> > Subject: [PATCH v3 2/3] Reclaim invalidated page ASAP
> >
> > invalidate_mapping_pages is very big hint to reclaimer.
> > It means user doesn't want to use the page any more.
> > So in order to prevent working set page eviction, this patch
> > move the page into tail of inactive list by PG_reclaim.
> >
> > Please, remember that pages in inactive list are working set
> > as well as active list. If we don't move pages into inactive list's
> > tail, pages near by tail of inactive list can be evicted although
> > we have a big clue about useless pages. It's totally bad.
> >
> > Now PG_readahead/PG_reclaim is shared.
> > fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> > preventing fast reclaiming readahead marker page.
> >
> > In this series, PG_reclaim is used by invalidated page, too.
> > If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> > to reclaim asap. Then, when the dirty page will be writeback,
> > clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> > It disturbs this serie's goal.
> >
> > I think it's okay to clear PG_readahead when the page is dirty, not
> > writeback time. So this patch moves ClearPageReadahead.
> >
> > Signed-off-by: Minchan Kim <[email protected]>
> > Acked-by: Rik van Riel <[email protected]>
> > Cc: Wu Fengguang <[email protected]>
> > Cc: KOSAKI Motohiro <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Nick Piggin <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> >
> > Changelog since v2:
> > - put ClearPageReclaim in set_page_dirty - suggested by Wu.
> >
> > Changelog since v1:
> > - make the invalidated page reclaim asap - suggested by Andrew.
> > ---
> > mm/page-writeback.c | 12 +++++++++++-
> > mm/swap.c | 49 ++++++++++++++++++++++++++++++++++++-------------
> > 2 files changed, 47 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index fc93802..88587a5 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -1250,6 +1250,17 @@ int set_page_dirty(struct page *page)
> > {
> > struct address_space *mapping = page_mapping(page);
> >
> > + /*
> > + * readahead/lru_deactivate_page could remain
> > + * PG_readahead/PG_reclaim due to race with end_page_writeback
> > + * About readahead, if the page is written, the flags would be
> > + * reset. So no problem.
> > + * About lru_deactivate_page, if the page is redirty, the flag
> > + * will be reset. So no problem. but if the page is used by readahead
> > + * it will confuse readahead and make it restart the size rampup
> > + * process. But it's a trivial problem.
> > + */
> > + ClearPageReclaim(page);
> > if (likely(mapping)) {
> > int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
> > #ifdef CONFIG_BLOCK
> > @@ -1307,7 +1318,6 @@ int clear_page_dirty_for_io(struct page *page)
> >
> > BUG_ON(!PageLocked(page));
> >
> > - ClearPageReclaim(page);
> > if (mapping && mapping_cap_account_dirty(mapping)) {
> > /*
> > * Yes, Virginia, this is indeed insane.
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 19e0812..1f1f435 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -275,28 +275,51 @@ void add_page_to_unevictable_list(struct page *page)
> > * into inative list's head. Because the VM expects the page would
> > * be writeout by flusher. The flusher's writeout is much effective
> > * than reclaimer's random writeout.
> > + *
> > + * If the page isn't page_mapped and dirty/writeback, the page
> > + * could reclaim asap using PG_reclaim.
> > + *
> > + * 1. active, mapped page -> none
> > + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> > + * 3. inactive, mapped page -> none
> > + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
> > + * 5. Others -> none
> > + *
> > + * In 4, why it moves inactive's head, the VM expects the page would
> > + * be writeout by flusher. The flusher's writeout is much effective than
> > + * reclaimer's random writeout.
> > */
> > static void __lru_deactivate(struct page *page, struct zone *zone)
> > {
> > int lru, file;
> > - unsigned long vm_flags;
> > + int active = 0;
> >
> > - if (!PageLRU(page) || !PageActive(page))
> > + if (!PageLRU(page))
> > return;
> > -
> > /* Some processes are using the page */
> > if (page_mapped(page))
> > return;
> > -
> > - file = page_is_file_cache(page);
> > - lru = page_lru_base_type(page);
> > - del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
> > - ClearPageActive(page);
> > - ClearPageReferenced(page);
> > - add_page_to_lru_list(zone, page, lru);
> > - __count_vm_event(PGDEACTIVATE);
> > -
> > - update_page_reclaim_stat(zone, page, file, 0);
> > + if (PageActive(page))
> > + active = 1;
> > +
>
> I should have said this the last time but if you do another revision,
> make active a "bool". There is a very slow migration of int to bool in
> cases it makes sense. It's not urgent though.

Okay, I will fix it.

>
> > + if (PageWriteback(page) || PageDirty(page)) {
> > + /*
> > + * PG_reclaim could be raced with end_page_writeback
> > + * It can make readahead confusing. But race window
> > + * is _really_ small and it's non-critical problem.
> > + */
> > + SetPageReclaim(page);
> > +
> > + file = page_is_file_cache(page);
> > + lru = page_lru_base_type(page);
> > + del_page_from_lru_list(zone, page, lru + active);
> > + ClearPageActive(page);
> > + ClearPageReferenced(page);
> > + add_page_to_lru_list(zone, page, lru);
> > + if (active)
> > + __count_vm_event(PGDEACTIVATE);
> > + update_page_reclaim_stat(zone, page, file, 0);
> > + }
> > }
> >
>
> Whether you update active's type or not;
>
> Acked-by: Mel Gorman <[email protected]>

Thanks, Mel.

>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab

--
Kind regards,
Minchan Kim

2010-11-30 14:11:07

by Ben Gamari

[permalink] [raw]
Subject: Re: [PATCH 2/3] Reclaim invalidated page ASAP

On Tue, 30 Nov 2010 09:18:22 +0000, Mel Gorman <[email protected]> wrote:
> I would agree except as said elsewhere, it's a chicken and egg problem.
> We don't have a real world test because fadvise is not useful in its
> current iteration. I'm hoping that there will be a test comparing
>
> rsync on vanilla kernel
> rsync on patched kernel
> rsync+patch on vanilla kernel
> rsync+patch on patched kernel
>
> Are the results of such a test likely to happen?
>
Yes, absolutely, although I'm sorry it has taken so long. Between
thanksgiving and the impending end of the semester things have been a
bit hectic. Nevertheless, I just finished putting together a script to
record some metics from /proc/vmstat and /proc/[pid]/statm, so at this
point I'm ready to finally take some data. Any suggestions for
particular patterns to look for in the numbers or other metrics to
record are welcome. Cheers,

- Ben

2010-11-30 18:35:04

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] Prevent activation of page in madvise_dontneed

On Tue, 30 Nov 2010, Minchan Kim wrote:

> Now zap_pte_range alwayas activates pages which are pte_young &&
> !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
> it's unnecessary since the page wouldn't use any more.
>
> Signed-off-by: Minchan Kim <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Wu Fengguang <[email protected]>
>
> Changelog since v2:
> - remove unnecessary description
> Changelog since v1:
> - change word from promote to activate
> - add activate argument to zap_pte_range and family function
>
> ---
> include/linux/mm.h | 4 ++--
> mm/madvise.c | 4 ++--
> mm/memory.c | 38 +++++++++++++++++++++++---------------
> mm/mmap.c | 4 ++--
> 4 files changed, 29 insertions(+), 21 deletions(-)

Everyone else seems pretty happy with this, and I've not checked
at all whether it achieves your purpose; but personally I'd much
prefer a smaller patch which adds your "activate" or "ignore_references"
flag to struct zap_details, instead of passing this exceptional arg
down lots of levels. That's precisely the purpose of zap_details,
to gather together a few things that aren't needed in the common case
(though I admit the NULL details defaulting may be ugly).

Hugh

2010-12-01 00:49:59

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] Prevent activation of page in madvise_dontneed

Hi Hugh,

On Wed, Dec 1, 2010 at 3:34 AM, Hugh Dickins <[email protected]> wrote:
> On Tue, 30 Nov 2010, Minchan Kim wrote:
>
>> Now zap_pte_range alwayas activates pages which are pte_young &&
>> !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
>> it's unnecessary since the page wouldn't use any more.
>>
>> Signed-off-by: Minchan Kim <[email protected]>
>> Acked-by: Rik van Riel <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Wu Fengguang <[email protected]>
>>
>> Changelog since v2:
>> ?- remove unnecessary description
>> Changelog since v1:
>> ?- change word from promote to activate
>> ?- add activate argument to zap_pte_range and family function
>>
>> ---
>> ?include/linux/mm.h | ? ?4 ++--
>> ?mm/madvise.c ? ? ? | ? ?4 ++--
>> ?mm/memory.c ? ? ? ?| ? 38 +++++++++++++++++++++++---------------
>> ?mm/mmap.c ? ? ? ? ?| ? ?4 ++--
>> ?4 files changed, 29 insertions(+), 21 deletions(-)
>
> Everyone else seems pretty happy with this, and I've not checked
> at all whether it achieves your purpose; but personally I'd much
> prefer a smaller patch which adds your "activate" or "ignore_references"
> flag to struct zap_details, instead of passing this exceptional arg
> down lots of levels. ?That's precisely the purpose of zap_details,
> to gather together a few things that aren't needed in the common case
> (though I admit the NULL details defaulting may be ugly).

Before I sent RFC, I tried it and suffered from NULL detail as you said.
But it's valuable to look on it, again.
Since other guys don't opposed this patch's goal, I will have a time
for unifying it into zap_details.

Thanks, Hugh.


> Hugh
>



--
Kind regards,
Minchan Kim

2010-12-01 00:50:52

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] Prevent activation of page in madvise_dontneed

On Tue, Nov 30, 2010 at 8:35 PM, Johannes Weiner <[email protected]> wrote:
> On Tue, Nov 30, 2010 at 12:23:21AM +0900, Minchan Kim wrote:
>> Now zap_pte_range alwayas activates pages which are pte_young &&
>> !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
>> it's unnecessary since the page wouldn't use any more.
>>
>> Signed-off-by: Minchan Kim <[email protected]>
>> Acked-by: Rik van Riel <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Wu Fengguang <[email protected]>
>>
>> Changelog since v2:
>> ?- remove unnecessary description
>> Changelog since v1:
>> ?- change word from promote to activate
>> ?- add activate argument to zap_pte_range and family function
>>
>> ---
>> ?include/linux/mm.h | ? ?4 ++--
>> ?mm/madvise.c ? ? ? | ? ?4 ++--
>> ?mm/memory.c ? ? ? ?| ? 38 +++++++++++++++++++++++---------------
>> ?mm/mmap.c ? ? ? ? ?| ? ?4 ++--
>> ?4 files changed, 29 insertions(+), 21 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index e097df6..6032881 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -779,11 +779,11 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>> ?int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
>> ? ? ? ? ? ? ? unsigned long size);
>> ?unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
>> - ? ? ? ? ? ? unsigned long size, struct zap_details *);
>> + ? ? ? ? ? ? unsigned long size, struct zap_details *, bool activate);
>
> I would prefer naming the parameter 'ignore_references' or something
> similar, so that it reflects the immediate effect on the zappers'
> behaviour, not what mark_page_accessed() might end up doing.
>
> Other than that, the patch looks good to me.

Fair enough.
Will fix.
Maybe it would take a long time until sending next version.

Thanks, Hannes.

>
> ? ? ? ?Hannes
>



--
Kind regards,
Minchan Kim