2010-11-28 15:03:20

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 1/3] deactivate invalidated pages

This patch is based on mmotm-11-23.

Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2)
It happens by backup workloads(ex, nightly rsync).
That's because the workload makes just use-once pages
and touches pages twice. It promotes the page into
active list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE.
But other OSes don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By Other approach, app developer uses POSIX_FADV_DONTNEED.
But it has a problem. If kernel meets page is writing
during invalidate_mapping_pages, it can't work.
It is very hard for application programmer to use it.
Because they always have to sync data before calling
fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
be discardable. At last, they can't use deferred write of kernel
so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.
It means we don't use the page any more. So let's move
the writing page into inactive list's head.

Why I need the page to head, Dirty/Writeback page would be flushed
sooner or later. This patch uses trick PG_reclaim so the page would
be moved into tail of inactive list when the page writeout completes.

It can prevent writeout of pageout which is less effective than
flusher's writeout.

This patch considers page_mappged(page) with working set.
So the page could leave head of inactive to get a change to activate.

Originally, I reused lru_demote of Peter with some change so added
his Signed-off-by.

Note :
PG_reclaim trick of writeback page could race with end_page_writeback
so this patch check PageWriteback one more. It makes race window time
reall small. But by theoretical, it still have a race. But it's a trivial.

Quote from fe3cba17 and some modification
"If some page PG_reclaim unintentionally, it will confuse readahead and
make it restart the size rampup process. But it's a trivial problem, and
can mostly be avoided by checking PageWriteback(page) first in readahead"

PG_reclaim trick of dirty page don't work now since clear_page_dirty_for_io
always clears PG_reclaim. Next patch will fix it.

Reported-by: Ben Gamari <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>

Changelog since v1:
- modify description
- correct typo
- add some comment
- change deactivation policy
---
mm/swap.c | 84 +++++++++++++++++++++++++++++++++++++++++++++---------------
1 files changed, 63 insertions(+), 21 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 31f5ec4..345eca1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -268,10 +268,65 @@ void add_page_to_unevictable_list(struct page *page)
spin_unlock_irq(&zone->lru_lock);
}

-static void __pagevec_lru_deactive(struct pagevec *pvec)
+/*
+ * This function is used by invalidate_mapping_pages.
+ * If the page can't be invalidated, this function moves the page
+ * into inative list's head or tail to reclaim ASAP and evict
+ * working set page.
+ *
+ * PG_reclaim means when the page's writeback completes, the page
+ * will move into tail of inactive for reclaiming ASAP.
+ *
+ * 1. active, mapped page -> inactive, head
+ * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
+ * 3. inactive, mapped page -> none
+ * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
+ * 5. others -> none
+ *
+ * In 4, why it moves inactive's head, the VM expects the page would
+ * be writeout by flusher. The flusher's writeout is much effective than
+ * reclaimer's random writeout.
+ */
+static void __lru_deactivate(struct page *page, struct zone *zone)
{
- int i, lru, file;
+ int lru, file;
+ int active = 0;
+
+ if (!PageLRU(page))
+ return;
+
+ if (PageActive(page))
+ active = 1;
+ /* Some processes are using the page */
+ if (page_mapped(page) && !active)
+ return;
+
+ else if (PageWriteback(page)) {
+ SetPageReclaim(page);
+ /* Check race with end_page_writeback */
+ if (!PageWriteback(page))
+ ClearPageReclaim(page);
+ } else if (PageDirty(page))
+ SetPageReclaim(page);
+
+ file = page_is_file_cache(page);
+ lru = page_lru_base_type(page);
+ del_page_from_lru_list(zone, page, lru + active);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ add_page_to_lru_list(zone, page, lru);
+ if (active)
+ __count_vm_event(PGDEACTIVATE);
+
+ update_page_reclaim_stat(zone, page, file, 0);
+}

+/*
+ * This function must be called with preemption disable.
+ */
+static void __pagevec_lru_deactivate(struct pagevec *pvec)
+{
+ int i;
struct zone *zone = NULL;

for (i = 0; i < pagevec_count(pvec); i++) {
@@ -284,21 +339,7 @@ static void __pagevec_lru_deactive(struct pagevec *pvec)
zone = pagezone;
spin_lock_irq(&zone->lru_lock);
}
-
- if (PageLRU(page)) {
- if (PageActive(page)) {
- file = page_is_file_cache(page);
- lru = page_lru_base_type(page);
- del_page_from_lru_list(zone, page,
- lru + LRU_ACTIVE);
- ClearPageActive(page);
- ClearPageReferenced(page);
- add_page_to_lru_list(zone, page, lru);
- __count_vm_event(PGDEACTIVATE);
-
- update_page_reclaim_stat(zone, page, file, 0);
- }
- }
+ __lru_deactivate(page, zone);
}
if (zone)
spin_unlock_irq(&zone->lru_lock);
@@ -336,11 +377,13 @@ static void drain_cpu_pagevecs(int cpu)

pvec = &per_cpu(lru_deactivate_pvecs, cpu);
if (pagevec_count(pvec))
- __pagevec_lru_deactive(pvec);
+ __pagevec_lru_deactivate(pvec);
}

/*
- * Forecfully demote a page to the tail of the inactive list.
+ * Forcefully deactivate a page.
+ * This function is used for reclaiming the page ASAP when the page
+ * can't be invalidated by Dirty/Writeback.
*/
void lru_deactivate_page(struct page *page)
{
@@ -348,12 +391,11 @@ void lru_deactivate_page(struct page *page)
struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);

if (!pagevec_add(pvec, page))
- __pagevec_lru_deactive(pvec);
+ __pagevec_lru_deactivate(pvec);
put_cpu_var(lru_deactivate_pvecs);
}
}

-
void lru_add_drain(void)
{
drain_cpu_pagevecs(get_cpu());
--
1.7.0.4


2010-11-28 15:03:30

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 2/3] move ClearPageReclaim

fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
preventing fast reclaiming readahead marker page.

In this series, PG_reclaim is used by invalidated page, too.
If VM find the page is invalidated and it's dirty, it sets PG_reclaim
to reclaim asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally.
It disturbs this serie's goal.

I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead.
This patch needs Wu's opinion.

Signed-off-by: Minchan Kim <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>
---
fs/buffer.c | 1 +
mm/page-writeback.c | 6 +++++-
2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 20a41c6..b920086 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -717,6 +717,7 @@ int __set_page_dirty_buffers(struct page *page)
int newly_dirty;
struct address_space *mapping = page_mapping(page);

+ ClearPageReclaim(page);
if (unlikely(!mapping))
return !TestSetPageDirty(page);

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fc93802..962b0d8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1146,6 +1146,7 @@ EXPORT_SYMBOL(write_one_page);
*/
int __set_page_dirty_no_writeback(struct page *page)
{
+ ClearPageReclaim(page);
if (!PageDirty(page))
return !TestSetPageDirty(page);
return 0;
@@ -1196,6 +1197,7 @@ EXPORT_SYMBOL(account_page_writeback);
*/
int __set_page_dirty_nobuffers(struct page *page)
{
+ ClearPageReclaim(page);
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
struct address_space *mapping2;
@@ -1258,6 +1260,8 @@ int set_page_dirty(struct page *page)
#endif
return (*spd)(page);
}
+
+ ClearPageReclaim(page);
if (!PageDirty(page)) {
if (!TestSetPageDirty(page))
return 1;
@@ -1280,6 +1284,7 @@ int set_page_dirty_lock(struct page *page)
{
int ret;

+ ClearPageReclaim(page);
lock_page_nosync(page);
ret = set_page_dirty(page);
unlock_page(page);
@@ -1307,7 +1312,6 @@ int clear_page_dirty_for_io(struct page *page)

BUG_ON(!PageLocked(page));

- ClearPageReclaim(page);
if (mapping && mapping_cap_account_dirty(mapping)) {
/*
* Yes, Virginia, this is indeed insane.
--
1.7.0.4

2010-11-28 15:03:39

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 3/3] Prevent promotion of page in madvise_dontneed

Now zap_pte_range alwayas promotes pages which are pte_young &&
!VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
it's unnecessary since the page wouldn't use any more.

If the page is sharred by other processes and it's real working set

Signed-off-by: Minchan Kim <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Wu Fengguang <[email protected]>

Changelog since v1:
- change word from promote to activate
- add activate argument to zap_pte_range and family function
---
include/linux/mm.h | 4 ++--
mm/madvise.c | 4 ++--
mm/memory.c | 38 +++++++++++++++++++++++---------------
mm/mmap.c | 4 ++--
4 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e097df6..6032881 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -779,11 +779,11 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
- unsigned long size, struct zap_details *);
+ unsigned long size, struct zap_details *, bool activate);
unsigned long unmap_vmas(struct mmu_gather **tlb,
struct vm_area_struct *start_vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
- struct zap_details *);
+ struct zap_details *, bool activate);

/**
* mm_walk - callbacks for walk_page_range
diff --git a/mm/madvise.c b/mm/madvise.c
index 319528b..8bc4b2d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -171,9 +171,9 @@ static long madvise_dontneed(struct vm_area_struct * vma,
.nonlinear_vma = vma,
.last_index = ULONG_MAX,
};
- zap_page_range(vma, start, end - start, &details);
+ zap_page_range(vma, start, end - start, &details, false);
} else
- zap_page_range(vma, start, end - start, NULL);
+ zap_page_range(vma, start, end - start, NULL, false);
return 0;
}

diff --git a/mm/memory.c b/mm/memory.c
index 2c989f3..249e23a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -891,7 +891,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
- long *zap_work, struct zap_details *details)
+ long *zap_work, struct zap_details *details,
+ bool activate)
{
struct mm_struct *mm = tlb->mm;
pte_t *pte;
@@ -949,7 +950,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (pte_dirty(ptent))
set_page_dirty(page);
if (pte_young(ptent) &&
- likely(!VM_SequentialReadHint(vma)))
+ likely(!VM_SequentialReadHint(vma)) &&
+ activate)
mark_page_accessed(page);
rss[MM_FILEPAGES]--;
}
@@ -989,7 +991,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pud_t *pud,
unsigned long addr, unsigned long end,
- long *zap_work, struct zap_details *details)
+ long *zap_work, struct zap_details *details,
+ bool activate)
{
pmd_t *pmd;
unsigned long next;
@@ -1002,7 +1005,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
continue;
}
next = zap_pte_range(tlb, vma, pmd, addr, next,
- zap_work, details);
+ zap_work, details, activate);
} while (pmd++, addr = next, (addr != end && *zap_work > 0));

return addr;
@@ -1011,7 +1014,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pgd_t *pgd,
unsigned long addr, unsigned long end,
- long *zap_work, struct zap_details *details)
+ long *zap_work, struct zap_details *details,
+ bool activate)
{
pud_t *pud;
unsigned long next;
@@ -1024,7 +1028,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
continue;
}
next = zap_pmd_range(tlb, vma, pud, addr, next,
- zap_work, details);
+ zap_work, details, activate);
} while (pud++, addr = next, (addr != end && *zap_work > 0));

return addr;
@@ -1033,7 +1037,8 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
static unsigned long unmap_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
- long *zap_work, struct zap_details *details)
+ long *zap_work, struct zap_details *details,
+ bool activate)
{
pgd_t *pgd;
unsigned long next;
@@ -1052,7 +1057,7 @@ static unsigned long unmap_page_range(struct mmu_gather *tlb,
continue;
}
next = zap_pud_range(tlb, vma, pgd, addr, next,
- zap_work, details);
+ zap_work, details, activate);
} while (pgd++, addr = next, (addr != end && *zap_work > 0));
tlb_end_vma(tlb, vma);
mem_cgroup_uncharge_end();
@@ -1075,6 +1080,7 @@ static unsigned long unmap_page_range(struct mmu_gather *tlb,
* @end_addr: virtual address at which to end unmapping
* @nr_accounted: Place number of unmapped pages in vm-accountable vma's here
* @details: details of nonlinear truncation or shared cache invalidation
+ * @activate: whether pages included in the vma should be activated or not
*
* Returns the end address of the unmapping (restart addr if interrupted).
*
@@ -1096,7 +1102,7 @@ static unsigned long unmap_page_range(struct mmu_gather *tlb,
unsigned long unmap_vmas(struct mmu_gather **tlbp,
struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
- struct zap_details *details)
+ struct zap_details *details, bool activate)
{
long zap_work = ZAP_BLOCK_SIZE;
unsigned long tlb_start = 0; /* For tlb_finish_mmu */
@@ -1149,8 +1155,8 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,

start = end;
} else
- start = unmap_page_range(*tlbp, vma,
- start, end, &zap_work, details);
+ start = unmap_page_range(*tlbp, vma, start,
+ end, &zap_work, details, activate);

if (zap_work > 0) {
BUG_ON(start != end);
@@ -1184,9 +1190,10 @@ out:
* @address: starting address of pages to zap
* @size: number of bytes to zap
* @details: details of nonlinear truncation or shared cache invalidation
+ * @activate: whether pages included in the vma should be activated or not
*/
unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
- unsigned long size, struct zap_details *details)
+ unsigned long size, struct zap_details *details, bool activate)
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather *tlb;
@@ -1196,7 +1203,8 @@ unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
- end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
+ end = unmap_vmas(&tlb, vma, address, end, &nr_accounted,
+ details, activate);
if (tlb)
tlb_finish_mmu(tlb, address, end);
return end;
@@ -1220,7 +1228,7 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
if (address < vma->vm_start || address + size > vma->vm_end ||
!(vma->vm_flags & VM_PFNMAP))
return -1;
- zap_page_range(vma, address, size, NULL);
+ zap_page_range(vma, address, size, NULL, false);
return 0;
}
EXPORT_SYMBOL_GPL(zap_vma_ptes);
@@ -2481,7 +2489,7 @@ again:
}

restart_addr = zap_page_range(vma, start_addr,
- end_addr - start_addr, details);
+ end_addr - start_addr, details, true);
need_break = need_resched() || spin_needbreak(details->i_mmap_lock);

if (restart_addr >= end_addr) {
diff --git a/mm/mmap.c b/mm/mmap.c
index b179abb..0ed5ab3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1904,7 +1904,7 @@ static void unmap_region(struct mm_struct *mm,
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
- unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
+ unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL, true);
vm_unacct_memory(nr_accounted);
free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
@@ -2278,7 +2278,7 @@ void exit_mmap(struct mm_struct *mm)
tlb = tlb_gather_mmu(mm, 1);
/* update_hiwater_rss(mm) here? but nobody should be looking */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
- end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL, true);
vm_unacct_memory(nr_accounted);

free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
--
1.7.0.4

2010-11-28 15:13:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] deactivate invalidated pages

This patch is add-on patch of mmotm-11-23.
Please, read original patch and thread.

http://marc.info/?l=linux-kernel&m=129034986927826&w=3

--
Kind regards,
Minchan Kim

2010-11-28 19:02:59

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] deactivate invalidated pages

On 11/28/2010 10:02 AM, Minchan Kim wrote:
> This patch is based on mmotm-11-23.
>
> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-11-29 00:33:46

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] deactivate invalidated pages

> ---
> mm/swap.c | 84 +++++++++++++++++++++++++++++++++++++++++++++---------------
> 1 files changed, 63 insertions(+), 21 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 31f5ec4..345eca1 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -268,10 +268,65 @@ void add_page_to_unevictable_list(struct page *page)
> spin_unlock_irq(&zone->lru_lock);
> }
>
> -static void __pagevec_lru_deactive(struct pagevec *pvec)
> +/*
> + * This function is used by invalidate_mapping_pages.
> + * If the page can't be invalidated, this function moves the page
> + * into inative list's head or tail to reclaim ASAP and evict
> + * working set page.
> + *
> + * PG_reclaim means when the page's writeback completes, the page
> + * will move into tail of inactive for reclaiming ASAP.
> + *
> + * 1. active, mapped page -> inactive, head
> + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> + * 3. inactive, mapped page -> none
> + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
> + * 5. others -> none
> + *
> + * In 4, why it moves inactive's head, the VM expects the page would
> + * be writeout by flusher. The flusher's writeout is much effective than
> + * reclaimer's random writeout.
> + */
> +static void __lru_deactivate(struct page *page, struct zone *zone)
> {
> - int i, lru, file;
> + int lru, file;
> + int active = 0;
> +
> + if (!PageLRU(page))
> + return;
> +
> + if (PageActive(page))
> + active = 1;
> + /* Some processes are using the page */
> + if (page_mapped(page) && !active)
> + return;
> +
> + else if (PageWriteback(page)) {
> + SetPageReclaim(page);
> + /* Check race with end_page_writeback */
> + if (!PageWriteback(page))
> + ClearPageReclaim(page);
> + } else if (PageDirty(page))
> + SetPageReclaim(page);
> +
> + file = page_is_file_cache(page);
> + lru = page_lru_base_type(page);
> + del_page_from_lru_list(zone, page, lru + active);
> + ClearPageActive(page);
> + ClearPageReferenced(page);
> + add_page_to_lru_list(zone, page, lru);
> + if (active)
> + __count_vm_event(PGDEACTIVATE);
> +
> + update_page_reclaim_stat(zone, page, file, 0);
> +}

I don't like this change because fadvise(DONT_NEED) is rarely used
function and this PG_reclaim trick doesn't improve so much. In the
other hand, It increase VM state mess.

However, I haven't found any fault and unworked reason in this patch.

2010-11-29 01:58:50

by Ben Gamari

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] deactivate invalidated pages

On Mon, 29 Nov 2010 09:33:38 +0900 (JST), KOSAKI Motohiro <[email protected]> wrote:
> > ---
> > mm/swap.c | 84 +++++++++++++++++++++++++++++++++++++++++++++---------------
> > 1 files changed, 63 insertions(+), 21 deletions(-)
> >
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 31f5ec4..345eca1 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -268,10 +268,65 @@ void add_page_to_unevictable_list(struct page *page)
> > spin_unlock_irq(&zone->lru_lock);
> > }
> >
> > -static void __pagevec_lru_deactive(struct pagevec *pvec)
> > +/*
> > + * This function is used by invalidate_mapping_pages.
> > + * If the page can't be invalidated, this function moves the page
> > + * into inative list's head or tail to reclaim ASAP and evict
> > + * working set page.
> > + *
> > + * PG_reclaim means when the page's writeback completes, the page
> > + * will move into tail of inactive for reclaiming ASAP.
> > + *
> > + * 1. active, mapped page -> inactive, head
> > + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> > + * 3. inactive, mapped page -> none
> > + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
> > + * 5. others -> none
> > + *
> > + * In 4, why it moves inactive's head, the VM expects the page would
> > + * be writeout by flusher. The flusher's writeout is much effective than
> > + * reclaimer's random writeout.
> > + */
> > +static void __lru_deactivate(struct page *page, struct zone *zone)
> > {
> > - int i, lru, file;
> > + int lru, file;
> > + int active = 0;
> > +
> > + if (!PageLRU(page))
> > + return;
> > +
> > + if (PageActive(page))
> > + active = 1;
> > + /* Some processes are using the page */
> > + if (page_mapped(page) && !active)
> > + return;
> > +
> > + else if (PageWriteback(page)) {
> > + SetPageReclaim(page);
> > + /* Check race with end_page_writeback */
> > + if (!PageWriteback(page))
> > + ClearPageReclaim(page);
> > + } else if (PageDirty(page))
> > + SetPageReclaim(page);
> > +
> > + file = page_is_file_cache(page);
> > + lru = page_lru_base_type(page);
> > + del_page_from_lru_list(zone, page, lru + active);
> > + ClearPageActive(page);
> > + ClearPageReferenced(page);
> > + add_page_to_lru_list(zone, page, lru);
> > + if (active)
> > + __count_vm_event(PGDEACTIVATE);
> > +
> > + update_page_reclaim_stat(zone, page, file, 0);
> > +}
>
> I don't like this change because fadvise(DONT_NEED) is rarely used
> function and this PG_reclaim trick doesn't improve so much. In the
> other hand, It increase VM state mess.
>

Can we please stop appealing to this argument? The reason that
fadvise(DONT_NEED) is currently rarely employed is that the interface as
implemented now is extremely kludgey to use.

Are you proposing that this particular implementation is not worth the
mess (as opposed to putting the pages at the head of the inactive list
as done earlier) or would you rather that we simply leave DONT_NEED in
its current state? Even if today's gains aren't as great as we would
like them to be, we should still make an effort to make fadvise()
usable, if for no other reason than to encourage use in user-space so
that applications can benefit when we finally do figure out how to
properly account for the user's hints.

Cheers,

- Ben

2010-11-29 02:13:24

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] deactivate invalidated pages

Hi KOSAKI,

On Mon, Nov 29, 2010 at 9:33 AM, KOSAKI Motohiro
<[email protected]> wrote:
>> ---
>> ?mm/swap.c | ? 84 +++++++++++++++++++++++++++++++++++++++++++++---------------
>> ?1 files changed, 63 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 31f5ec4..345eca1 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -268,10 +268,65 @@ void add_page_to_unevictable_list(struct page *page)
>> ? ? ? spin_unlock_irq(&zone->lru_lock);
>> ?}
>>
>> -static void __pagevec_lru_deactive(struct pagevec *pvec)
>> +/*
>> + * This function is used by invalidate_mapping_pages.
>> + * If the page can't be invalidated, this function moves the page
>> + * into inative list's head or tail to reclaim ASAP and evict
>> + * working set page.
>> + *
>> + * PG_reclaim means when the page's writeback completes, the page
>> + * will move into tail of inactive for reclaiming ASAP.
>> + *
>> + * 1. active, mapped page -> inactive, head
>> + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
>> + * 3. inactive, mapped page -> none
>> + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
>> + * 5. others -> none
>> + *
>> + * In 4, why it moves inactive's head, the VM expects the page would
>> + * be writeout by flusher. The flusher's writeout is much effective than
>> + * reclaimer's random writeout.
>> + */
>> +static void __lru_deactivate(struct page *page, struct zone *zone)
>> ?{
>> - ? ? int i, lru, file;
>> + ? ? int lru, file;
>> + ? ? int active = 0;
>> +
>> + ? ? if (!PageLRU(page))
>> + ? ? ? ? ? ? return;
>> +
>> + ? ? if (PageActive(page))
>> + ? ? ? ? ? ? active = 1;
>> + ? ? /* Some processes are using the page */
>> + ? ? if (page_mapped(page) && !active)
>> + ? ? ? ? ? ? return;
>> +
>> + ? ? else if (PageWriteback(page)) {
>> + ? ? ? ? ? ? SetPageReclaim(page);
>> + ? ? ? ? ? ? /* Check race with end_page_writeback */
>> + ? ? ? ? ? ? if (!PageWriteback(page))
>> + ? ? ? ? ? ? ? ? ? ? ClearPageReclaim(page);
>> + ? ? } else if (PageDirty(page))
>> + ? ? ? ? ? ? SetPageReclaim(page);
>> +
>> + ? ? file = page_is_file_cache(page);
>> + ? ? lru = page_lru_base_type(page);
>> + ? ? del_page_from_lru_list(zone, page, lru + active);
>> + ? ? ClearPageActive(page);
>> + ? ? ClearPageReferenced(page);
>> + ? ? add_page_to_lru_list(zone, page, lru);
>> + ? ? if (active)
>> + ? ? ? ? ? ? __count_vm_event(PGDEACTIVATE);
>> +
>> + ? ? update_page_reclaim_stat(zone, page, file, 0);
>> +}
>
> I don't like this change because fadvise(DONT_NEED) is rarely used
> function and this PG_reclaim trick doesn't improve so much. In the
> other hand, It increase VM state mess.

Chick-egg problem.
Why fadvise(DONT_NEED) is rarely used is it's hard to use effective.
mincore + fdatasync + fadvise series is very ugly.
This patch's goal is to solve it.

PG_reclaim trick would prevent working set eviction.
If you fadvise call and there are the invalidated page which are
dirtying in middle of inactive LRU,
reclaimer would evict working set of inactive LRU's tail even if we
have a invalidated page in LRU.
It's bad.

About VM state mess, PG_readahead already have done it.
But I admit this patch could make it worse and that's why I Cced Wu Fengguang.

The problem it can make is readahead confusing and working set
eviction after writeback.
I can add ClearPageReclaim of mark_page_accessed for clear flag if the
page is accessed during race.
But I didn't add it in this version because I think it's very rare case.

I don't want to add new page flag due to this function or revert merge
patch of (PG_readahead and PG_reclaim)


>
> However, I haven't found any fault and unworked reason in this patch.
>
Thanks for the good review, KOSAKI. :)


--
Kind regards,
Minchan Kim

2010-11-29 02:13:58

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] deactivate invalidated pages

> > I don't like this change because fadvise(DONT_NEED) is rarely used
> > function and this PG_reclaim trick doesn't improve so much. In the
> > other hand, It increase VM state mess.
> >
>
> Can we please stop appealing to this argument? The reason that
> fadvise(DONT_NEED) is currently rarely employed is that the interface as
> implemented now is extremely kludgey to use.
>
> Are you proposing that this particular implementation is not worth the
> mess (as opposed to putting the pages at the head of the inactive list
> as done earlier) or would you rather that we simply leave DONT_NEED in
> its current state? Even if today's gains aren't as great as we would
> like them to be, we should still make an effort to make fadvise()
> usable, if for no other reason than to encourage use in user-space so
> that applications can benefit when we finally do figure out how to
> properly account for the user's hints.

Hi

I'm not againt DONT_NEED feature. I only said PG_reclaim trick is not
so effective. Every feature has their own pros/cons. I think the cons
is too big. Also, nobody have mesured PG_reclaim performance gain. Did you?


2010-11-29 02:26:22

by Ben Gamari

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] deactivate invalidated pages

On Mon, 29 Nov 2010 11:13:53 +0900 (JST), KOSAKI Motohiro <[email protected]> wrote:
> I'm not againt DONT_NEED feature. I only said PG_reclaim trick is not
> so effective. Every feature has their own pros/cons. I think the cons
> is too big. Also, nobody have mesured PG_reclaim performance gain. Did you?
>
Not yet. Finally back from vacation here in the States. I'll try sitting
down to put together a test tonight.

- Ben

2010-11-29 02:36:04

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] deactivate invalidated pages

Hi

> > I don't like this change because fadvise(DONT_NEED) is rarely used
> > function and this PG_reclaim trick doesn't improve so much. In the
> > other hand, It increase VM state mess.
>
> Chick-egg problem.
> Why fadvise(DONT_NEED) is rarely used is it's hard to use effective.
> mincore + fdatasync + fadvise series is very ugly.
> This patch's goal is to solve it.

Well, I haven't put opposition to your previous patch for this reason.
I think every one have agree mincore + fdatasync + fadvise mess is ugly.

However I doubt PG_reclaim trick is so effective. I mean, _if_ it's effective, our current
streaming io heristics doesn't work so effective. It's bad. and if so, we should fix
it generically. That's the reason why I prefer to use simple add_page_to_lru_list().

Please remember why do we made this one. rsync has special io access pattern
then our streaming io detection doesn't work so good. therefore we decided to
improve manual knob. But, why do we need to make completely different behavior
manual DONT_NEED suggestion and automatic DONT_NEED detection?

That's my point.


> PG_reclaim trick would prevent working set eviction.
> If you fadvise call and there are the invalidated page which are
> dirtying in middle of inactive LRU,
> reclaimer would evict working set of inactive LRU's tail even if we
> have a invalidated page in LRU.
> It's bad.
>
> About VM state mess, PG_readahead already have done it.
> But I admit this patch could make it worse and that's why I Cced Wu Fengguang.
>
> The problem it can make is readahead confusing and working set
> eviction after writeback.
> I can add ClearPageReclaim of mark_page_accessed for clear flag if the
> page is accessed during race.
> But I didn't add it in this version because I think it's very rare case.
>
> I don't want to add new page flag due to this function or revert merge
> patch of (PG_readahead and PG_reclaim)
>
>
> >
> > However, I haven't found any fault and unworked reason in this patch.
> >
> Thanks for the good review, KOSAKI. :)
>
>
> --
> Kind regards,
> Minchan Kim


2010-11-29 04:25:56

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] move ClearPageReclaim

On 11/28/2010 10:02 AM, Minchan Kim wrote:
> fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> preventing fast reclaiming readahead marker page.
>
> In this series, PG_reclaim is used by invalidated page, too.
> If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> to reclaim asap. Then, when the dirty page will be writeback,
> clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> It disturbs this serie's goal.
>
> I think it's okay to clear PG_readahead when the page is dirty, not
> writeback time. So this patch moves ClearPageReadahead.
> This patch needs Wu's opinion.

Acked-by: Rik van Riel <[email protected]>


--
All rights reversed

2010-11-29 04:28:26

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] Prevent promotion of page in madvise_dontneed

On 11/28/2010 10:02 AM, Minchan Kim wrote:
> Now zap_pte_range alwayas promotes pages which are pte_young&&
> !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
> it's unnecessary since the page wouldn't use any more.
>
> If the page is sharred by other processes and it's real working set

This line seems to be superfluous, I don't see any special
treatment for this case in the code.

> Signed-off-by: Minchan Kim<[email protected]>
> Cc: Rik van Riel<[email protected]>
> Cc: KOSAKI Motohiro<[email protected]>
> Cc: Johannes Weiner<[email protected]>
> Cc: Nick Piggin<[email protected]>
> Cc: Mel Gorman<[email protected]>
> Cc: Wu Fengguang<[email protected]>

The patch itself looks good to me.

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-11-29 04:30:30

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] Prevent promotion of page in madvise_dontneed

On Mon, Nov 29, 2010 at 1:28 PM, Rik van Riel <[email protected]> wrote:
> On 11/28/2010 10:02 AM, Minchan Kim wrote:
>>
>> Now zap_pte_range alwayas promotes pages which are pte_young&&
>> !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
>> it's unnecessary since the page wouldn't use any more.
>>
>> If the page is sharred by other processes and it's real working set
>
> This line seems to be superfluous, I don't see any special
> treatment for this case in the code.

I should remove the lines.
It's my fault.

>
>> Signed-off-by: Minchan Kim<[email protected]>
>> Cc: Rik van Riel<[email protected]>
>> Cc: KOSAKI Motohiro<[email protected]>
>> Cc: Johannes Weiner<[email protected]>
>> Cc: Nick Piggin<[email protected]>
>> Cc: Mel Gorman<[email protected]>
>> Cc: Wu Fengguang<[email protected]>
>
> The patch itself looks good to me.
>
> Acked-by: Rik van Riel <[email protected]>

Thanks, Rik.
--
Kind regards,
Minchan Kim

2010-11-29 07:29:56

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] move ClearPageReclaim

On Sun, Nov 28, 2010 at 11:02:56PM +0800, Minchan Kim wrote:
> fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> preventing fast reclaiming readahead marker page.
>
> In this series, PG_reclaim is used by invalidated page, too.
> If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> to reclaim asap. Then, when the dirty page will be writeback,
> clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> It disturbs this serie's goal.
>
> I think it's okay to clear PG_readahead when the page is dirty, not
> writeback time. So this patch moves ClearPageReadahead.
> This patch needs Wu's opinion.

It's a safe change. The possibility and consequence of races are both
small enough. However the patch could be simplified as follows?

Thanks,
Fengguang
---

--- linux-next.orig/mm/page-writeback.c 2010-11-29 15:14:54.000000000 +0800
+++ linux-next/mm/page-writeback.c 2010-11-29 15:15:02.000000000 +0800
@@ -1330,6 +1330,7 @@ int set_page_dirty(struct page *page)
{
struct address_space *mapping = page_mapping(page);

+ ClearPageReclaim(page);
if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
#ifdef CONFIG_BLOCK
@@ -1387,7 +1388,6 @@ int clear_page_dirty_for_io(struct page

BUG_ON(!PageLocked(page));

- ClearPageReclaim(page);
if (mapping && mapping_cap_account_dirty(mapping)) {
/*
* Yes, Virginia, this is indeed insane.

2010-11-29 07:50:08

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] deactivate invalidated pages

On Sun, Nov 28, 2010 at 11:02:55PM +0800, Minchan Kim wrote:
> This patch is based on mmotm-11-23.

I cannot find __pagevec_lru_deactive() in mmotm-11-23.
Do you have any more patches?

> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
>
> Some app developer want to support POSIX_FADV_NOREUSE.
> But other OSes don't support it, either.
> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>
> By Other approach, app developer uses POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It is very hard for application programmer to use it.
> Because they always have to sync data before calling
> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> be discardable. At last, they can't use deferred write of kernel
> so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)
>
> In fact, invalidation is very big hint to reclaimer.
> It means we don't use the page any more. So let's move
> the writing page into inactive list's head.
>
> Why I need the page to head, Dirty/Writeback page would be flushed
> sooner or later. This patch uses trick PG_reclaim so the page would
> be moved into tail of inactive list when the page writeout completes.
>
> It can prevent writeout of pageout which is less effective than
> flusher's writeout.
>
> This patch considers page_mappged(page) with working set.
> So the page could leave head of inactive to get a change to activate.
>
> Originally, I reused lru_demote of Peter with some change so added
> his Signed-off-by.
>
> Note :
> PG_reclaim trick of writeback page could race with end_page_writeback
> so this patch check PageWriteback one more. It makes race window time
> reall small. But by theoretical, it still have a race. But it's a trivial.
>
> Quote from fe3cba17 and some modification
> "If some page PG_reclaim unintentionally, it will confuse readahead and
> make it restart the size rampup process. But it's a trivial problem, and
> can mostly be avoided by checking PageWriteback(page) first in readahead"
>
> PG_reclaim trick of dirty page don't work now since clear_page_dirty_for_io
> always clears PG_reclaim. Next patch will fix it.
>
> Reported-by: Ben Gamari <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
>
> Changelog since v1:
> - modify description
> - correct typo
> - add some comment
> - change deactivation policy
> ---
> mm/swap.c | 84 +++++++++++++++++++++++++++++++++++++++++++++---------------
> 1 files changed, 63 insertions(+), 21 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 31f5ec4..345eca1 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -268,10 +268,65 @@ void add_page_to_unevictable_list(struct page *page)
> spin_unlock_irq(&zone->lru_lock);
> }
>
> -static void __pagevec_lru_deactive(struct pagevec *pvec)
> +/*
> + * This function is used by invalidate_mapping_pages.
> + * If the page can't be invalidated, this function moves the page
> + * into inative list's head or tail to reclaim ASAP and evict
> + * working set page.
> + *
> + * PG_reclaim means when the page's writeback completes, the page
> + * will move into tail of inactive for reclaiming ASAP.
> + *
> + * 1. active, mapped page -> inactive, head
> + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> + * 3. inactive, mapped page -> none
> + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
> + * 5. others -> none
> + *
> + * In 4, why it moves inactive's head, the VM expects the page would
> + * be writeout by flusher. The flusher's writeout is much effective than
> + * reclaimer's random writeout.
> + */
> +static void __lru_deactivate(struct page *page, struct zone *zone)
> {
> - int i, lru, file;
> + int lru, file;
> + int active = 0;
> +
> + if (!PageLRU(page))
> + return;
> +
> + if (PageActive(page))
> + active = 1;
> + /* Some processes are using the page */
> + if (page_mapped(page) && !active)
> + return;

It's good to check such protections if doing heuristic demotion.
However if requested explicitly by the user, I'm _much more_ inclined
to act stupid&dumb and meet the user's expectation. Or will this code
be called by someone other than DONTNEED? Sorry I have no context of
the full code.

> + else if (PageWriteback(page)) {
> + SetPageReclaim(page);
> + /* Check race with end_page_writeback */
> + if (!PageWriteback(page))
> + ClearPageReclaim(page);

Does the double check help a lot?

> + } else if (PageDirty(page))
> + SetPageReclaim(page);

Typically there are much more dirty pages than writeback pages.
I guess it's a good place to call bdi_start_inode_writeback() which
was posted here: http://www.spinics.net/lists/linux-mm/msg10833.html

Thanks,
Fengguang

> +
> + file = page_is_file_cache(page);
> + lru = page_lru_base_type(page);
> + del_page_from_lru_list(zone, page, lru + active);
> + ClearPageActive(page);
> + ClearPageReferenced(page);
> + add_page_to_lru_list(zone, page, lru);
> + if (active)
> + __count_vm_event(PGDEACTIVATE);
> +
> + update_page_reclaim_stat(zone, page, file, 0);
> +}
>
> +/*
> + * This function must be called with preemption disable.
> + */
> +static void __pagevec_lru_deactivate(struct pagevec *pvec)
> +{
> + int i;
> struct zone *zone = NULL;
>
> for (i = 0; i < pagevec_count(pvec); i++) {
> @@ -284,21 +339,7 @@ static void __pagevec_lru_deactive(struct pagevec *pvec)
> zone = pagezone;
> spin_lock_irq(&zone->lru_lock);
> }
> -
> - if (PageLRU(page)) {
> - if (PageActive(page)) {
> - file = page_is_file_cache(page);
> - lru = page_lru_base_type(page);
> - del_page_from_lru_list(zone, page,
> - lru + LRU_ACTIVE);
> - ClearPageActive(page);
> - ClearPageReferenced(page);
> - add_page_to_lru_list(zone, page, lru);
> - __count_vm_event(PGDEACTIVATE);
> -
> - update_page_reclaim_stat(zone, page, file, 0);
> - }
> - }
> + __lru_deactivate(page, zone);
> }
> if (zone)
> spin_unlock_irq(&zone->lru_lock);
> @@ -336,11 +377,13 @@ static void drain_cpu_pagevecs(int cpu)
>
> pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> if (pagevec_count(pvec))
> - __pagevec_lru_deactive(pvec);
> + __pagevec_lru_deactivate(pvec);
> }
>
> /*
> - * Forecfully demote a page to the tail of the inactive list.
> + * Forcefully deactivate a page.
> + * This function is used for reclaiming the page ASAP when the page
> + * can't be invalidated by Dirty/Writeback.
> */
> void lru_deactivate_page(struct page *page)
> {
> @@ -348,12 +391,11 @@ void lru_deactivate_page(struct page *page)
> struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
>
> if (!pagevec_add(pvec, page))
> - __pagevec_lru_deactive(pvec);
> + __pagevec_lru_deactivate(pvec);
> put_cpu_var(lru_deactivate_pvecs);
> }
> }
>
> -
> void lru_add_drain(void)
> {
> drain_cpu_pagevecs(get_cpu());
> --
> 1.7.0.4

2010-11-29 08:09:23

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] deactivate invalidated pages

Hi Wu,

On Mon, Nov 29, 2010 at 4:49 PM, Wu Fengguang <[email protected]> wrote:
> On Sun, Nov 28, 2010 at 11:02:55PM +0800, Minchan Kim wrote:
>> This patch is based on mmotm-11-23.
>
> I cannot find __pagevec_lru_deactive() in mmotm-11-23.
> Do you have any more patches?

Please see this patch.
http://www.spinics.net/lists/mm-commits/msg80851.html

>
>> Recently, there are reported problem about thrashing.
>> (http://marc.info/?l=rsync&m=128885034930933&w=2)
>> It happens by backup workloads(ex, nightly rsync).
>> That's because the workload makes just use-once pages
>> and touches pages twice. It promotes the page into
>> active list so that it results in working set page eviction.
>>
>> Some app developer want to support POSIX_FADV_NOREUSE.
>> But other OSes don't support it, either.
>> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>>
>> By Other approach, app developer uses POSIX_FADV_DONTNEED.
>> But it has a problem. If kernel meets page is writing
>> during invalidate_mapping_pages, it can't work.
>> It is very hard for application programmer to use it.
>> Because they always have to sync data before calling
>> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
>> be discardable. At last, they can't use deferred write of kernel
>> so that they could see performance loss.
>> (http://insights.oetiker.ch/linux/fadvise.html)
>>
>> In fact, invalidation is very big hint to reclaimer.
>> It means we don't use the page any more. So let's move
>> the writing page into inactive list's head.
>>
>> Why I need the page to head, Dirty/Writeback page would be flushed
>> sooner or later. This patch uses trick PG_reclaim so the page would
>> be moved into tail of inactive list when the page writeout completes.
>>
>> It can prevent writeout of pageout which is less effective than
>> flusher's writeout.
>>
>> This patch considers page_mappged(page) with working set.
>> So the page could leave head of inactive to get a change to activate.
>>
>> Originally, I reused lru_demote of Peter with some change so added
>> his Signed-off-by.
>>
>> Note :
>> PG_reclaim trick of writeback page could race with end_page_writeback
>> so this patch check PageWriteback one more. It makes race window time
>> reall small. But by theoretical, it still have a race. But it's a trivial.
>>
>> Quote from fe3cba17 and some modification
>> "If some page PG_reclaim unintentionally, it will confuse readahead and
>> make it restart the size rampup process. But it's a trivial problem, and
>> can mostly be avoided by checking PageWriteback(page) first in readahead"
>>
>> PG_reclaim trick of dirty page don't work now since clear_page_dirty_for_io
>> always clears PG_reclaim. Next patch will fix it.
>>
>> Reported-by: Ben Gamari <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>> Signed-off-by: Peter Zijlstra <[email protected]>
>> Cc: Wu Fengguang <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>>
>> Changelog since v1:
>> ?- modify description
>> ?- correct typo
>> ?- add some comment
>> ?- change deactivation policy
>> ---
>> ?mm/swap.c | ? 84 +++++++++++++++++++++++++++++++++++++++++++++---------------
>> ?1 files changed, 63 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 31f5ec4..345eca1 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -268,10 +268,65 @@ void add_page_to_unevictable_list(struct page *page)
>> ? ? ? spin_unlock_irq(&zone->lru_lock);
>> ?}
>>
>> -static void __pagevec_lru_deactive(struct pagevec *pvec)
>> +/*
>> + * This function is used by invalidate_mapping_pages.
>> + * If the page can't be invalidated, this function moves the page
>> + * into inative list's head or tail to reclaim ASAP and evict
>> + * working set page.
>> + *
>> + * PG_reclaim means when the page's writeback completes, the page
>> + * will move into tail of inactive for reclaiming ASAP.
>> + *
>> + * 1. active, mapped page -> inactive, head
>> + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
>> + * 3. inactive, mapped page -> none
>> + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
>> + * 5. others -> none
>> + *
>> + * In 4, why it moves inactive's head, the VM expects the page would
>> + * be writeout by flusher. The flusher's writeout is much effective than
>> + * reclaimer's random writeout.
>> + */
>> +static void __lru_deactivate(struct page *page, struct zone *zone)
>> ?{
>> - ? ? int i, lru, file;
>> + ? ? int lru, file;
>> + ? ? int active = 0;
>> +
>> + ? ? if (!PageLRU(page))
>> + ? ? ? ? ? ? return;
>> +
>> + ? ? if (PageActive(page))
>> + ? ? ? ? ? ? active = 1;
>> + ? ? /* Some processes are using the page */
>> + ? ? if (page_mapped(page) && !active)
>> + ? ? ? ? ? ? return;
>
> It's good to check such protections if doing heuristic demotion.
> However if requested explicitly by the user, I'm _much more_ inclined
> to act stupid&dumb and meet the user's expectation. Or will this code
> be called by someone other than DONTNEED? Sorry I have no context of
> the full code.

Sorry.

Yes. I expect lru_deactive_page can be used by other places with some
modification.
First thing I expected is here.

http://www.mail-archive.com/[email protected]/msg179576.html
After I make sure this patch's effective, I will try it, too.


>
>> + ? ? else if (PageWriteback(page)) {
>> + ? ? ? ? ? ? SetPageReclaim(page);
>> + ? ? ? ? ? ? /* Check race with end_page_writeback */
>> + ? ? ? ? ? ? if (!PageWriteback(page))
>> + ? ? ? ? ? ? ? ? ? ? ClearPageReclaim(page);
>
> Does the double check help a lot?
>
>> + ? ? } else if (PageDirty(page))
>> + ? ? ? ? ? ? SetPageReclaim(page);
>
> Typically there are much more dirty pages than writeback pages.
> I guess it's a good place to call bdi_start_inode_writeback() which
> was posted here: http://www.spinics.net/lists/linux-mm/msg10833.html

It looks good to me.
It makes my code very simple.

I can use it. It means my patch depends on yours patch.
Do you have a plan to merge it?


>
> Thanks,
> Fengguang
>
>> +
>> + ? ? file = page_is_file_cache(page);
>> + ? ? lru = page_lru_base_type(page);
>> + ? ? del_page_from_lru_list(zone, page, lru + active);
>> + ? ? ClearPageActive(page);
>> + ? ? ClearPageReferenced(page);
>> + ? ? add_page_to_lru_list(zone, page, lru);
>> + ? ? if (active)
>> + ? ? ? ? ? ? __count_vm_event(PGDEACTIVATE);
>> +
>> + ? ? update_page_reclaim_stat(zone, page, file, 0);
>> +}
>>
>> +/*
>> + * This function must be called with preemption disable.
>> + */
>> +static void __pagevec_lru_deactivate(struct pagevec *pvec)
>> +{
>> + ? ? int i;
>> ? ? ? struct zone *zone = NULL;
>>
>> ? ? ? for (i = 0; i < pagevec_count(pvec); i++) {
>> @@ -284,21 +339,7 @@ static void __pagevec_lru_deactive(struct pagevec *pvec)
>> ? ? ? ? ? ? ? ? ? ? ? zone = pagezone;
>> ? ? ? ? ? ? ? ? ? ? ? spin_lock_irq(&zone->lru_lock);
>> ? ? ? ? ? ? ? }
>> -
>> - ? ? ? ? ? ? if (PageLRU(page)) {
>> - ? ? ? ? ? ? ? ? ? ? if (PageActive(page)) {
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? file = page_is_file_cache(page);
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru = page_lru_base_type(page);
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? del_page_from_lru_list(zone, page,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru + LRU_ACTIVE);
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ClearPageActive(page);
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ClearPageReferenced(page);
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? add_page_to_lru_list(zone, page, lru);
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? __count_vm_event(PGDEACTIVATE);
>> -
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? update_page_reclaim_stat(zone, page, file, 0);
>> - ? ? ? ? ? ? ? ? ? ? }
>> - ? ? ? ? ? ? }
>> + ? ? ? ? ? ? __lru_deactivate(page, zone);
>> ? ? ? }
>> ? ? ? if (zone)
>> ? ? ? ? ? ? ? spin_unlock_irq(&zone->lru_lock);
>> @@ -336,11 +377,13 @@ static void drain_cpu_pagevecs(int cpu)
>>
>> ? ? ? pvec = &per_cpu(lru_deactivate_pvecs, cpu);
>> ? ? ? if (pagevec_count(pvec))
>> - ? ? ? ? ? ? __pagevec_lru_deactive(pvec);
>> + ? ? ? ? ? ? __pagevec_lru_deactivate(pvec);
>> ?}
>>
>> ?/*
>> - * Forecfully demote a page to the tail of the inactive list.
>> + * Forcefully deactivate a page.
>> + * This function is used for reclaiming the page ASAP when the page
>> + * can't be invalidated by Dirty/Writeback.
>> ? */
>> ?void lru_deactivate_page(struct page *page)
>> ?{
>> @@ -348,12 +391,11 @@ void lru_deactivate_page(struct page *page)
>> ? ? ? ? ? ? ? struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
>>
>> ? ? ? ? ? ? ? if (!pagevec_add(pvec, page))
>> - ? ? ? ? ? ? ? ? ? ? __pagevec_lru_deactive(pvec);
>> + ? ? ? ? ? ? ? ? ? ? __pagevec_lru_deactivate(pvec);
>> ? ? ? ? ? ? ? put_cpu_var(lru_deactivate_pvecs);
>> ? ? ? }
>> ?}
>>
>> -
>> ?void lru_add_drain(void)
>> ?{
>> ? ? ? drain_cpu_pagevecs(get_cpu());
>> --
>> 1.7.0.4
>



--
Kind regards,
Minchan Kim

2010-11-29 08:16:04

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] move ClearPageReclaim

On Mon, Nov 29, 2010 at 4:29 PM, Wu Fengguang <[email protected]> wrote:
> On Sun, Nov 28, 2010 at 11:02:56PM +0800, Minchan Kim wrote:
>> fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
>> preventing fast reclaiming readahead marker page.
>>
>> In this series, PG_reclaim is used by invalidated page, too.
>> If VM find the page is invalidated and it's dirty, it sets PG_reclaim
>> to reclaim asap. Then, when the dirty page will be writeback,
>> clear_page_dirty_for_io will clear PG_reclaim unconditionally.
>> It disturbs this serie's goal.
>>
>> I think it's okay to clear PG_readahead when the page is dirty, not
>> writeback time. So this patch moves ClearPageReadahead.
>> This patch needs Wu's opinion.
>
> It's a safe change. The possibility and consequence of races are both
> small enough. However the patch could be simplified as follows?

If all of file systems use it, I don't mind it.
Do all of filesystems use it when the page is dirtied?
I was not sure it.(It's why I added Cc. :)
If it doesn't have a problem, I hope so.

Thanks, Wu.

>
> Thanks,
> Fengguang
> ---
>
> --- linux-next.orig/mm/page-writeback.c 2010-11-29 15:14:54.000000000 +0800
> +++ linux-next/mm/page-writeback.c ? ? ?2010-11-29 15:15:02.000000000 +0800
> @@ -1330,6 +1330,7 @@ int set_page_dirty(struct page *page)
> ?{
> ? ? ? ?struct address_space *mapping = page_mapping(page);
>
> + ? ? ? ClearPageReclaim(page);
> ? ? ? ?if (likely(mapping)) {
> ? ? ? ? ? ? ? ?int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
> ?#ifdef CONFIG_BLOCK
> @@ -1387,7 +1388,6 @@ int clear_page_dirty_for_io(struct page
>
> ? ? ? ?BUG_ON(!PageLocked(page));
>
> - ? ? ? ClearPageReclaim(page);
> ? ? ? ?if (mapping && mapping_cap_account_dirty(mapping)) {
> ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? * Yes, Virginia, this is indeed insane.
>



--
Kind regards,
Minchan Kim

2010-11-29 08:26:57

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] move ClearPageReclaim

On Mon, Nov 29, 2010 at 04:16:01PM +0800, Minchan Kim wrote:
> On Mon, Nov 29, 2010 at 4:29 PM, Wu Fengguang <[email protected]> wrote:
> > On Sun, Nov 28, 2010 at 11:02:56PM +0800, Minchan Kim wrote:
> >> fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> >> preventing fast reclaiming readahead marker page.
> >>
> >> In this series, PG_reclaim is used by invalidated page, too.
> >> If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> >> to reclaim asap. Then, when the dirty page will be writeback,
> >> clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> >> It disturbs this serie's goal.
> >>
> >> I think it's okay to clear PG_readahead when the page is dirty, not
> >> writeback time. So this patch moves ClearPageReadahead.
> >> This patch needs Wu's opinion.
> >
> > It's a safe change. The possibility and consequence of races are both
> > small enough. However the patch could be simplified as follows?
>
> If all of file systems use it, I don't mind it.
> Do all of filesystems use it when the page is dirtied?
> I was not sure it.(It's why I added Cc. :)
> If it doesn't have a problem, I hope so.

Please double check, but here is my findings:

__set_page_dirty_buffers() is called by several fs' ->set_page_dirty()
which are all called by set_page_dirty().

set_page_dirty_lock() will call set_page_dirty().

__set_page_dirty_no_writeback(): it have no connection to
end_page_writeback(), so no need to set PG_reclaim.

Thanks,
Fengguang


> > --- linux-next.orig/mm/page-writeback.c 2010-11-29 15:14:54.000000000 +0800
> > +++ linux-next/mm/page-writeback.c      2010-11-29 15:15:02.000000000 +0800
> > @@ -1330,6 +1330,7 @@ int set_page_dirty(struct page *page)
> >  {
> >        struct address_space *mapping = page_mapping(page);
> >
> > +       ClearPageReclaim(page);
> >        if (likely(mapping)) {
> >                int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
> >  #ifdef CONFIG_BLOCK
> > @@ -1387,7 +1388,6 @@ int clear_page_dirty_for_io(struct page
> >
> >        BUG_ON(!PageLocked(page));
> >
> > -       ClearPageReclaim(page);
> >        if (mapping && mapping_cap_account_dirty(mapping)) {
> >                /*
> >                 * Yes, Virginia, this is indeed insane.
> >
>
>
>
> --
> Kind regards,
> Minchan Kim

2010-11-29 12:07:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] deactivate invalidated pages

On Mon, Nov 29, 2010 at 12:02:55AM +0900, Minchan Kim wrote:
> This patch is based on mmotm-11-23.
>
> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
>
> Some app developer want to support POSIX_FADV_NOREUSE.
> But other OSes don't support it, either.
> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>
> By Other approach, app developer uses POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It is very hard for application programmer to use it.
> Because they always have to sync data before calling
> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> be discardable. At last, they can't use deferred write of kernel
> so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)
>
> In fact, invalidation is very big hint to reclaimer.
> It means we don't use the page any more. So let's move
> the writing page into inactive list's head.
>
> Why I need the page to head, Dirty/Writeback page would be flushed
> sooner or later. This patch uses trick PG_reclaim so the page would
> be moved into tail of inactive list when the page writeout completes.
>
> It can prevent writeout of pageout which is less effective than
> flusher's writeout.
>
> This patch considers page_mappged(page) with working set.
> So the page could leave head of inactive to get a change to activate.
>
> Originally, I reused lru_demote of Peter with some change so added
> his Signed-off-by.
>
> Note :
> PG_reclaim trick of writeback page could race with end_page_writeback
> so this patch check PageWriteback one more. It makes race window time
> reall small. But by theoretical, it still have a race. But it's a trivial.
>
> Quote from fe3cba17 and some modification
> "If some page PG_reclaim unintentionally, it will confuse readahead and
> make it restart the size rampup process. But it's a trivial problem, and
> can mostly be avoided by checking PageWriteback(page) first in readahead"
>
> PG_reclaim trick of dirty page don't work now since clear_page_dirty_for_io
> always clears PG_reclaim. Next patch will fix it.
>
> Reported-by: Ben Gamari <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
>
> Changelog since v1:
> - modify description
> - correct typo
> - add some comment
> - change deactivation policy
> ---
> mm/swap.c | 84 +++++++++++++++++++++++++++++++++++++++++++++---------------
> 1 files changed, 63 insertions(+), 21 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 31f5ec4..345eca1 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -268,10 +268,65 @@ void add_page_to_unevictable_list(struct page *page)
> spin_unlock_irq(&zone->lru_lock);
> }
>
> -static void __pagevec_lru_deactive(struct pagevec *pvec)
> +/*
> + * This function is used by invalidate_mapping_pages.
> + * If the page can't be invalidated, this function moves the page
> + * into inative list's head or tail to reclaim ASAP and evict
> + * working set page.
> + *
> + * PG_reclaim means when the page's writeback completes, the page
> + * will move into tail of inactive for reclaiming ASAP.
> + *
> + * 1. active, mapped page -> inactive, head
> + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> + * 3. inactive, mapped page -> none
> + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
> + * 5. others -> none
> + *
> + * In 4, why it moves inactive's head, the VM expects the page would
> + * be writeout by flusher. The flusher's writeout is much effective than
> + * reclaimer's random writeout.
> + */
> +static void __lru_deactivate(struct page *page, struct zone *zone)
> {
> - int i, lru, file;
> + int lru, file;
> + int active = 0;
> +
> + if (!PageLRU(page))
> + return;
> +
> + if (PageActive(page))
> + active = 1;
> + /* Some processes are using the page */
> + if (page_mapped(page) && !active)
> + return;
> +

Why do we move active pages to the inactive list? I thought the decision was
that mapped pages are certainly in use so we they should be not affected by
fadvise(). In contrast, I see you leave inactive pages alone.

> + else if (PageWriteback(page)) {
> + SetPageReclaim(page);
> + /* Check race with end_page_writeback */
> + if (!PageWriteback(page))
> + ClearPageReclaim(page);

I think this is safe but the comment could be expanded to mention that
the page is locked at this point and explain how it's impossible for
PageReclaim to be set on a !PageWriteback page here.

> + } else if (PageDirty(page))
> + SetPageReclaim(page);
> +
> + file = page_is_file_cache(page);
> + lru = page_lru_base_type(page);
> + del_page_from_lru_list(zone, page, lru + active);
> + ClearPageActive(page);
> + ClearPageReferenced(page);
> + add_page_to_lru_list(zone, page, lru);
> + if (active)
> + __count_vm_event(PGDEACTIVATE);
> +
> + update_page_reclaim_stat(zone, page, file, 0);
> +}
>
> +/*
> + * This function must be called with preemption disable.
> + */
> +static void __pagevec_lru_deactivate(struct pagevec *pvec)
> +{
> + int i;
> struct zone *zone = NULL;
>
> for (i = 0; i < pagevec_count(pvec); i++) {
> @@ -284,21 +339,7 @@ static void __pagevec_lru_deactive(struct pagevec *pvec)
> zone = pagezone;
> spin_lock_irq(&zone->lru_lock);
> }
> -
> - if (PageLRU(page)) {
> - if (PageActive(page)) {
> - file = page_is_file_cache(page);
> - lru = page_lru_base_type(page);
> - del_page_from_lru_list(zone, page,
> - lru + LRU_ACTIVE);
> - ClearPageActive(page);
> - ClearPageReferenced(page);
> - add_page_to_lru_list(zone, page, lru);
> - __count_vm_event(PGDEACTIVATE);
> -
> - update_page_reclaim_stat(zone, page, file, 0);
> - }
> - }
> + __lru_deactivate(page, zone);
> }
> if (zone)
> spin_unlock_irq(&zone->lru_lock);
> @@ -336,11 +377,13 @@ static void drain_cpu_pagevecs(int cpu)
>
> pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> if (pagevec_count(pvec))
> - __pagevec_lru_deactive(pvec);
> + __pagevec_lru_deactivate(pvec);
> }
>
> /*
> - * Forecfully demote a page to the tail of the inactive list.
> + * Forcefully deactivate a page.
> + * This function is used for reclaiming the page ASAP when the page
> + * can't be invalidated by Dirty/Writeback.
> */
> void lru_deactivate_page(struct page *page)
> {
> @@ -348,12 +391,11 @@ void lru_deactivate_page(struct page *page)
> struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
>
> if (!pagevec_add(pvec, page))
> - __pagevec_lru_deactive(pvec);
> + __pagevec_lru_deactivate(pvec);
> put_cpu_var(lru_deactivate_pvecs);
> }
> }
>
> -

Unnecessary whitespace change there.

> void lru_add_drain(void)
> {
> drain_cpu_pagevecs(get_cpu());

Functionally, I think this will work (although I'd like a clarification
on why active pages are rotated). It'd be nice if there was a test case
for this but it's a bit of a chicken-and-egg problem :/

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-29 15:29:01

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] deactivate invalidated pages

Hi Mel,

On Mon, Nov 29, 2010 at 12:07:16PM +0000, Mel Gorman wrote:
> On Mon, Nov 29, 2010 at 12:02:55AM +0900, Minchan Kim wrote:
> > This patch is based on mmotm-11-23.
> >
> > Recently, there are reported problem about thrashing.
> > (http://marc.info/?l=rsync&m=128885034930933&w=2)
> > It happens by backup workloads(ex, nightly rsync).
> > That's because the workload makes just use-once pages
> > and touches pages twice. It promotes the page into
> > active list so that it results in working set page eviction.
> >
> > Some app developer want to support POSIX_FADV_NOREUSE.
> > But other OSes don't support it, either.
> > (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
> >
> > By Other approach, app developer uses POSIX_FADV_DONTNEED.
> > But it has a problem. If kernel meets page is writing
> > during invalidate_mapping_pages, it can't work.
> > It is very hard for application programmer to use it.
> > Because they always have to sync data before calling
> > fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> > be discardable. At last, they can't use deferred write of kernel
> > so that they could see performance loss.
> > (http://insights.oetiker.ch/linux/fadvise.html)
> >
> > In fact, invalidation is very big hint to reclaimer.
> > It means we don't use the page any more. So let's move
> > the writing page into inactive list's head.
> >
> > Why I need the page to head, Dirty/Writeback page would be flushed
> > sooner or later. This patch uses trick PG_reclaim so the page would
> > be moved into tail of inactive list when the page writeout completes.
> >
> > It can prevent writeout of pageout which is less effective than
> > flusher's writeout.
> >
> > This patch considers page_mappged(page) with working set.
> > So the page could leave head of inactive to get a change to activate.
> >
> > Originally, I reused lru_demote of Peter with some change so added
> > his Signed-off-by.
> >
> > Note :
> > PG_reclaim trick of writeback page could race with end_page_writeback
> > so this patch check PageWriteback one more. It makes race window time
> > reall small. But by theoretical, it still have a race. But it's a trivial.
> >
> > Quote from fe3cba17 and some modification
> > "If some page PG_reclaim unintentionally, it will confuse readahead and
> > make it restart the size rampup process. But it's a trivial problem, and
> > can mostly be avoided by checking PageWriteback(page) first in readahead"
> >
> > PG_reclaim trick of dirty page don't work now since clear_page_dirty_for_io
> > always clears PG_reclaim. Next patch will fix it.
> >
> > Reported-by: Ben Gamari <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > Cc: Wu Fengguang <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> > Cc: KOSAKI Motohiro <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Nick Piggin <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> >
> > Changelog since v1:
> > - modify description
> > - correct typo
> > - add some comment
> > - change deactivation policy
> > ---
> > mm/swap.c | 84 +++++++++++++++++++++++++++++++++++++++++++++---------------
> > 1 files changed, 63 insertions(+), 21 deletions(-)
> >
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 31f5ec4..345eca1 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -268,10 +268,65 @@ void add_page_to_unevictable_list(struct page *page)
> > spin_unlock_irq(&zone->lru_lock);
> > }
> >
> > -static void __pagevec_lru_deactive(struct pagevec *pvec)
> > +/*
> > + * This function is used by invalidate_mapping_pages.
> > + * If the page can't be invalidated, this function moves the page
> > + * into inative list's head or tail to reclaim ASAP and evict
> > + * working set page.
> > + *
> > + * PG_reclaim means when the page's writeback completes, the page
> > + * will move into tail of inactive for reclaiming ASAP.
> > + *
> > + * 1. active, mapped page -> inactive, head
> > + * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
> > + * 3. inactive, mapped page -> none
> > + * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
> > + * 5. others -> none
> > + *
> > + * In 4, why it moves inactive's head, the VM expects the page would
> > + * be writeout by flusher. The flusher's writeout is much effective than
> > + * reclaimer's random writeout.
> > + */
> > +static void __lru_deactivate(struct page *page, struct zone *zone)
> > {
> > - int i, lru, file;
> > + int lru, file;
> > + int active = 0;
> > +
> > + if (!PageLRU(page))
> > + return;
> > +
> > + if (PageActive(page))
> > + active = 1;
> > + /* Some processes are using the page */
> > + if (page_mapped(page) && !active)
> > + return;
> > +
>
> Why do we move active pages to the inactive list? I thought the decision was
> that mapped pages are certainly in use so we they should be not affected by
> fadvise(). In contrast, I see you leave inactive pages alone.

Fix and resend new version.
Please look at that.

>
> > + else if (PageWriteback(page)) {
> > + SetPageReclaim(page);
> > + /* Check race with end_page_writeback */
> > + if (!PageWriteback(page))
> > + ClearPageReclaim(page);
>
> I think this is safe but the comment could be expanded to mention that
> the page is locked at this point and explain how it's impossible for
> PageReclaim to be set on a !PageWriteback page here.

Ditto.

>
> > + } else if (PageDirty(page))
> > + SetPageReclaim(page);
> > +
> > + file = page_is_file_cache(page);
> > + lru = page_lru_base_type(page);
> > + del_page_from_lru_list(zone, page, lru + active);
> > + ClearPageActive(page);
> > + ClearPageReferenced(page);
> > + add_page_to_lru_list(zone, page, lru);
> > + if (active)
> > + __count_vm_event(PGDEACTIVATE);
> > +
> > + update_page_reclaim_stat(zone, page, file, 0);
> > +}
> >
> > +/*
> > + * This function must be called with preemption disable.
> > + */
> > +static void __pagevec_lru_deactivate(struct pagevec *pvec)
> > +{
> > + int i;
> > struct zone *zone = NULL;
> >
> > for (i = 0; i < pagevec_count(pvec); i++) {
> > @@ -284,21 +339,7 @@ static void __pagevec_lru_deactive(struct pagevec *pvec)
> > zone = pagezone;
> > spin_lock_irq(&zone->lru_lock);
> > }
> > -
> > - if (PageLRU(page)) {
> > - if (PageActive(page)) {
> > - file = page_is_file_cache(page);
> > - lru = page_lru_base_type(page);
> > - del_page_from_lru_list(zone, page,
> > - lru + LRU_ACTIVE);
> > - ClearPageActive(page);
> > - ClearPageReferenced(page);
> > - add_page_to_lru_list(zone, page, lru);
> > - __count_vm_event(PGDEACTIVATE);
> > -
> > - update_page_reclaim_stat(zone, page, file, 0);
> > - }
> > - }
> > + __lru_deactivate(page, zone);
> > }
> > if (zone)
> > spin_unlock_irq(&zone->lru_lock);
> > @@ -336,11 +377,13 @@ static void drain_cpu_pagevecs(int cpu)
> >
> > pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> > if (pagevec_count(pvec))
> > - __pagevec_lru_deactive(pvec);
> > + __pagevec_lru_deactivate(pvec);
> > }
> >
> > /*
> > - * Forecfully demote a page to the tail of the inactive list.
> > + * Forcefully deactivate a page.
> > + * This function is used for reclaiming the page ASAP when the page
> > + * can't be invalidated by Dirty/Writeback.
> > */
> > void lru_deactivate_page(struct page *page)
> > {
> > @@ -348,12 +391,11 @@ void lru_deactivate_page(struct page *page)
> > struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> >
> > if (!pagevec_add(pvec, page))
> > - __pagevec_lru_deactive(pvec);
> > + __pagevec_lru_deactivate(pvec);
> > put_cpu_var(lru_deactivate_pvecs);
> > }
> > }
> >
> > -
>
> Unnecessary whitespace change there.

Thanks. Fixed.

>
> > void lru_add_drain(void)
> > {
> > drain_cpu_pagevecs(get_cpu());
>
> Functionally, I think this will work (although I'd like a clarification
> on why active pages are rotated). It'd be nice if there was a test case
> for this but it's a bit of a chicken-and-egg problem :/

I hope Ben have a good result.
Thanks.

>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab

--
Kind regards,
Minchan Kim