LinuxLists.cc - [RFC 1/2] deactive invalidated pages

[permalink] [raw]

Subject: [RFC 2/2] Prevent promotion of page in madvise_dontneed

Now zap_pte_range alwayas promotes pages which are pte_young &&
!VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
it's unnecessary since the page wouldn't use any more.

If the page is sharred by other processes and it's real working set

Signed-off-by: Minchan Kim <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
---
include/linux/mm.h | 4 ++--
mm/madvise.c | 4 ++--
mm/memory.c | 9 ++++++---
mm/mmap.c | 4 ++--
4 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 721f451..1555abe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -778,11 +778,11 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
- unsigned long size, struct zap_details *);
+ unsigned long size, struct zap_details *, int promote);
unsigned long unmap_vmas(struct mmu_gather **tlb,
struct vm_area_struct *start_vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
- struct zap_details *);
+ struct zap_details *, int promote);

/**
* mm_walk - callbacks for walk_page_range
diff --git a/mm/madvise.c b/mm/madvise.c
index 319528b..247e5fd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -171,9 +171,9 @@ static long madvise_dontneed(struct vm_area_struct * vma,
.nonlinear_vma = vma,
.last_index = ULONG_MAX,
};
- zap_page_range(vma, start, end - start, &details);
+ zap_page_range(vma, start, end - start, &details, 0);
} else
- zap_page_range(vma, start, end - start, NULL);
+ zap_page_range(vma, start, end - start, NULL, 0);
return 0;
}

diff --git a/mm/memory.c b/mm/memory.c
index 02e48aa..276abdb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1075,6 +1075,7 @@ static unsigned long unmap_page_range(struct mmu_gather *tlb,
* @end_addr: virtual address at which to end unmapping
* @nr_accounted: Place number of unmapped pages in vm-accountable vma's here
* @details: details of nonlinear truncation or shared cache invalidation
+ * @promote: whether pages inclued vma would be promoted or not
*
* Returns the end address of the unmapping (restart addr if interrupted).
*
@@ -1096,7 +1097,7 @@ static unsigned long unmap_page_range(struct mmu_gather *tlb,
unsigned long unmap_vmas(struct mmu_gather **tlbp,
struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
- struct zap_details *details)
+ struct zap_details *details, int promote)
{
long zap_work = ZAP_BLOCK_SIZE;
unsigned long tlb_start = 0; /* For tlb_finish_mmu */
@@ -1184,9 +1185,10 @@ out:
* @address: starting address of pages to zap
* @size: number of bytes to zap
* @details: details of nonlinear truncation or shared cache invalidation
+ * @promote: whether the page would be promoted or not
*/
unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
- unsigned long size, struct zap_details *details)
+ unsigned long size, struct zap_details *details, int promote)
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather *tlb;
@@ -1196,7 +1198,8 @@ unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
- end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
+ end = unmap_vmas(&tlb, vma, address, end,
+ &nr_accounted, details, promote);
if (tlb)
tlb_finish_mmu(tlb, address, end);
return end;
diff --git a/mm/mmap.c b/mm/mmap.c
index b179abb..0d42c08 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1904,7 +1904,7 @@ static void unmap_region(struct mm_struct *mm,
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
- unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
+ unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL, 1);
vm_unacct_memory(nr_accounted);
free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
@@ -2278,7 +2278,7 @@ void exit_mmap(struct mm_struct *mm)
tlb = tlb_gather_mmu(mm, 1);
/* update_hiwater_rss(mm) here? but nobody should be looking */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
- end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL, 1);
vm_unacct_memory(nr_accounted);

free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
--
1.7.0.4

2010-11-21 14:38:21

[permalink] [raw]

Subject: Re: [RFC 2/2] Prevent promotion of page in madvise_dontneed

On Sun, Nov 21, 2010 at 11:30:24PM +0900, Minchan Kim wrote:
> If the page is sharred by other processes and it's real working set

Please ignore this last sentense. It's my mistake. I will fix it v2 after
some review.

--
Kind regards,
Minchan Kim

2010-11-21 15:21:43

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Sun, 21 Nov 2010 23:30:23 +0900, Minchan Kim <[email protected]> wrote:
>
> Ben, Remain thing is to modify rsync and use
> fadvise(POSIX_FADV_DONTNEED). Could you test it?

Thanks a ton for the patch. Looks good. Testing as we speak.

- Ben

2010-11-21 16:35:06

[permalink] [raw]

Subject: Re: [RFC 2/2] Prevent promotion of page in madvise_dontneed

On Sun, 21 Nov 2010 23:30:24 +0900, Minchan Kim <[email protected]> wrote:
> Now zap_pte_range alwayas promotes pages which are pte_young &&
> !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
> it's unnecessary since the page wouldn't use any more.
>
Is this not against master? If it is, I think you might have forgotten
to update the zap_page_range() reference on mm/memory.c:1226 (in
zap_vma_ptes()). Should promote be true or false in this case? Cheers,

- Ben

2010-11-22 00:31:11

[permalink] [raw]

Subject: Re: [RFC 2/2] Prevent promotion of page in madvise_dontneed

On Mon, Nov 22, 2010 at 1:34 AM, Ben Gamari <[email protected]> wrote:
> On Sun, 21 Nov 2010 23:30:24 +0900, Minchan Kim <[email protected]> wrote:
>> Now zap_pte_range alwayas promotes pages which are pte_young &&
>> !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
>> it's unnecessary since the page wouldn't use any more.
>>
> Is this not against master? If it is, I think you might have forgotten
> to update the zap_page_range() reference on mm/memory.c:1226 (in
> zap_vma_ptes()). Should promote be true or false in this case? Cheers,

Thanks. I missed that. Whatever, It's okay. :)
That's because it is used by only VM_PFNMAP.
It means the VMA doesn't have struct page descriptor of pages.
So zap_pte_range never promote the page.

Anyway, by semantic, it should be "zero".
Will fix.
Thanks, Ben.

>
> - Ben
>

--
Kind regards,
Minchan Kim

2010-11-22 01:18:25

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On 11/21/2010 09:30 AM, Minchan Kim wrote:
> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
>
> Some app developer want to support POSIX_FADV_NOREUSE.
> But other OSes don't support it, either.
> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>
> By Other approach, app developer uses POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It is very hard for application programmer to use it.
> Because they always have to sync data before calling
> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> be discardable. At last, they can't use deferred write of kernel
> so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)

This looks promising...

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-11-22 22:15:32

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Sun, 21 Nov 2010 23:30:23 +0900
Minchan Kim <[email protected]> wrote:

> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
>
> Some app developer want to support POSIX_FADV_NOREUSE.
> But other OSes don't support it, either.
> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>
> By Other approach, app developer uses POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It is very hard for application programmer to use it.
> Because they always have to sync data before calling
> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> be discardable. At last, they can't use deferred write of kernel
> so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)
>
> In fact, invalidate is very big hint to reclaimer.
> It means we don't use the page any more. So let's move
> the writing page into inactive list's head.
>
> If it is real working set, it could have a enough time to
> activate the page since we always try to keep many pages in
> inactive list.
>
> I reuse lru_demote of Peter with some change.
>
>
> ...
>
> +/*
> + * Function used to forecefully demote a page to the head of the inactive
> + * list.
> + */

This comment is wrong? The page gets moved to the _tail_ of the
inactive list?

> +void lru_deactive_page(struct page *page)

Should be "deactivate" throughout the patch. IMO.

> +{
> + if (likely(get_page_unless_zero(page))) {
> + struct pagevec *pvec = &get_cpu_var(lru_deactive_pvecs);
> +
> + if (!pagevec_add(pvec, page))
> + __pagevec_lru_deactive(pvec);
> + put_cpu_var(lru_deactive_pvecs);
> + }
> }
>
> +
> void lru_add_drain(void)
> {
> drain_cpu_pagevecs(get_cpu());
> diff --git a/mm/truncate.c b/mm/truncate.c
> index cd94607..c73fb19 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -332,7 +332,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> {
> struct pagevec pvec;
> pgoff_t next = start;
> - unsigned long ret = 0;
> + unsigned long ret;
> + unsigned long count = 0;
> int i;
>
> pagevec_init(&pvec, 0);
> @@ -359,8 +360,10 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> if (lock_failed)
> continue;
>
> - ret += invalidate_inode_page(page);
> -
> + ret = invalidate_inode_page(page);
> + if (!ret)
> + lru_deactive_page(page);

This is the core part of the patch and it needs a code comment to
explain the reasons for doing this.

I wonder about the page_mapped() case. We were unable to invalidate
the page because it was mapped into pagetables. But was it really
appropriate to deactivate the page in that case?

> + count += ret;
> unlock_page(page);
> if (next > end)
> break;

Suggested updates:

include/linux/swap.h | 2 +-
mm/swap.c | 13 ++++++-------
mm/truncate.c | 7 ++++++-
3 files changed, 13 insertions(+), 9 deletions(-)

diff -puN include/linux/swap.h~mm-deactivate-invalidated-pages-fix include/linux/swap.h
--- a/include/linux/swap.h~mm-deactivate-invalidated-pages-fix
+++ a/include/linux/swap.h
@@ -213,7 +213,7 @@ extern void mark_page_accessed(struct pa
extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
-extern void lru_deactive_page(struct page *page);
+extern void lru_deactivate_page(struct page *page);
extern void swap_setup(void);

extern void add_page_to_unevictable_list(struct page *page);
diff -puN mm/swap.c~mm-deactivate-invalidated-pages-fix mm/swap.c
--- a/mm/swap.c~mm-deactivate-invalidated-pages-fix
+++ a/mm/swap.c
@@ -39,7 +39,7 @@ int page_cluster;

static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
-static DEFINE_PER_CPU(struct pagevec, lru_deactive_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);

/*
@@ -334,23 +334,22 @@ static void drain_cpu_pagevecs(int cpu)
local_irq_restore(flags);
}

- pvec = &per_cpu(lru_deactive_pvecs, cpu);
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
if (pagevec_count(pvec))
__pagevec_lru_deactive(pvec);
}

/*
- * Function used to forecefully demote a page to the head of the inactive
- * list.
+ * Forecfully demote a page to the tail of the inactive list.
*/
-void lru_deactive_page(struct page *page)
+void lru_deactivate_page(struct page *page)
{
if (likely(get_page_unless_zero(page))) {
- struct pagevec *pvec = &get_cpu_var(lru_deactive_pvecs);
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);

if (!pagevec_add(pvec, page))
__pagevec_lru_deactive(pvec);
- put_cpu_var(lru_deactive_pvecs);
+ put_cpu_var(lru_deactivate_pvecs);
}
}

diff -puN mm/truncate.c~mm-deactivate-invalidated-pages-fix mm/truncate.c
--- a/mm/truncate.c~mm-deactivate-invalidated-pages-fix
+++ a/mm/truncate.c
@@ -361,8 +361,13 @@ unsigned long invalidate_mapping_pages(s
continue;

ret = invalidate_inode_page(page);
+ /*
+ * If the page was dirty or under writeback we cannot
+ * invalidate it now. Move it to the tail of the
+ * inactive LRU so that reclaim will free it promptly.
+ */
if (!ret)
- lru_deactive_page(page);
+ lru_deactivate_page(page);
count += ret;
unlock_page(page);
if (next > end)
_

2010-11-22 22:22:05

[permalink] [raw]

Subject: Re: [RFC 2/2] Prevent promotion of page in madvise_dontneed

On Sun, 21 Nov 2010 23:30:24 +0900
Minchan Kim <[email protected]> wrote:

> Now zap_pte_range alwayas promotes pages which are pte_young &&
> !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
> it's unnecessary since the page wouldn't use any more.
>
> If the page is sharred by other processes and it's real working set

This patch doesn't actually do anything. It passes variable `promote'
all the way down to unmap_vmas(), but unmap_vmas() doesn't use that new
variable.

Have a comment fixlet:

--- a/mm/memory.c~mm-prevent-promotion-of-page-in-madvise_dontneed-fix
+++ a/mm/memory.c
@@ -1075,7 +1075,7 @@ static unsigned long unmap_page_range(st
* @end_addr: virtual address at which to end unmapping
* @nr_accounted: Place number of unmapped pages in vm-accountable vma's here
* @details: details of nonlinear truncation or shared cache invalidation
- * @promote: whether pages inclued vma would be promoted or not
+ * @promote: whether pages included in the vma should be promoted or not
*
* Returns the end address of the unmapping (restart addr if interrupted).
*
_

Also, I'd suggest that we avoid introducing the term "promote". It
isn't a term which is presently used in Linux MM. Probably "activate"
has a better-known meaning.

And `activate' could be a bool if one is in the mood for that.

2010-11-23 04:52:08

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, Nov 23, 2010 at 7:14 AM, Andrew Morton
<[email protected]> wrote:
> On Sun, 21 Nov 2010 23:30:23 +0900
> Minchan Kim <[email protected]> wrote:
>
>> Recently, there are reported problem about thrashing.
>> (http://marc.info/?l=rsync&m=128885034930933&w=2)
>> It happens by backup workloads(ex, nightly rsync).
>> That's because the workload makes just use-once pages
>> and touches pages twice. It promotes the page into
>> active list so that it results in working set page eviction.
>>
>> Some app developer want to support POSIX_FADV_NOREUSE.
>> But other OSes don't support it, either.
>> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>>
>> By Other approach, app developer uses POSIX_FADV_DONTNEED.
>> But it has a problem. If kernel meets page is writing
>> during invalidate_mapping_pages, it can't work.
>> It is very hard for application programmer to use it.
>> Because they always have to sync data before calling
>> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
>> be discardable. At last, they can't use deferred write of kernel
>> so that they could see performance loss.
>> (http://insights.oetiker.ch/linux/fadvise.html)
>>
>> In fact, invalidate is very big hint to reclaimer.
>> It means we don't use the page any more. So let's move
>> the writing page into inactive list's head.
>>
>> If it is real working set, it could have a enough time to
>> activate the page since we always try to keep many pages in
>> inactive list.
>>
>> I reuse lru_demote of Peter with some change.
>>
>>
>> ...
>>
>> +/*
>> + * Function used to forecefully demote a page to the head of the inactive
>> + * list.
>> + */
>
> This comment is wrong? ?The page gets moved to the _tail_ of the
> inactive list?

No. I add it in _head_ of the inactive list intentionally.
Why I don't add it to _tail_ is that I don't want to be aggressive.
The page might be real working set. So I want to give a chance to
activate it again.
If it's not working set, it can be reclaimed easily and it can prevent
active page demotion since inactive list size would be big enough for
not calling shrink_active_list.

>
>> +void lru_deactive_page(struct page *page)
>
> Should be "deactivate" throughout the patch. IMO.

Thank you.

>
>> +{
>> + ? ? if (likely(get_page_unless_zero(page))) {
>> + ? ? ? ? ? ? struct pagevec *pvec = &get_cpu_var(lru_deactive_pvecs);
>> +
>> + ? ? ? ? ? ? if (!pagevec_add(pvec, page))
>> + ? ? ? ? ? ? ? ? ? ? __pagevec_lru_deactive(pvec);
>> + ? ? ? ? ? ? put_cpu_var(lru_deactive_pvecs);
>> + ? ? }
>> ?}
>>
>> +
>> ?void lru_add_drain(void)
>> ?{
>> ? ? ? drain_cpu_pagevecs(get_cpu());
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index cd94607..c73fb19 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -332,7 +332,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>> ?{
>> ? ? ? struct pagevec pvec;
>> ? ? ? pgoff_t next = start;
>> - ? ? unsigned long ret = 0;
>> + ? ? unsigned long ret;
>> + ? ? unsigned long count = 0;
>> ? ? ? int i;
>>
>> ? ? ? pagevec_init(&pvec, 0);
>> @@ -359,8 +360,10 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>> ? ? ? ? ? ? ? ? ? ? ? if (lock_failed)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>>
>> - ? ? ? ? ? ? ? ? ? ? ret += invalidate_inode_page(page);
>> -
>> + ? ? ? ? ? ? ? ? ? ? ret = invalidate_inode_page(page);
>> + ? ? ? ? ? ? ? ? ? ? if (!ret)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru_deactive_page(page);
>
> This is the core part of the patch and it needs a code comment to
> explain the reasons for doing this.
>
> I wonder about the page_mapped() case. ?We were unable to invalidate
> the page because it was mapped into pagetables. ?But was it really
> appropriate to deactivate the page in that case?

Yes. My assumption is that if it's real working set, it could be
activated easily during inactive list moving window time.

>
>
>> + ? ? ? ? ? ? ? ? ? ? count += ret;
>> ? ? ? ? ? ? ? ? ? ? ? unlock_page(page);
>> ? ? ? ? ? ? ? ? ? ? ? if (next > end)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
>
> Suggested updates:
>
>
> ?include/linux/swap.h | ? ?2 +-
> ?mm/swap.c ? ? ? ? ? ?| ? 13 ++++++-------
> ?mm/truncate.c ? ? ? ?| ? ?7 ++++++-
> ?3 files changed, 13 insertions(+), 9 deletions(-)
>
> diff -puN include/linux/swap.h~mm-deactivate-invalidated-pages-fix include/linux/swap.h
> --- a/include/linux/swap.h~mm-deactivate-invalidated-pages-fix
> +++ a/include/linux/swap.h
> @@ -213,7 +213,7 @@ extern void mark_page_accessed(struct pa
> ?extern void lru_add_drain(void);
> ?extern int lru_add_drain_all(void);
> ?extern void rotate_reclaimable_page(struct page *page);
> -extern void lru_deactive_page(struct page *page);
> +extern void lru_deactivate_page(struct page *page);
> ?extern void swap_setup(void);
>
> ?extern void add_page_to_unevictable_list(struct page *page);
> diff -puN mm/swap.c~mm-deactivate-invalidated-pages-fix mm/swap.c
> --- a/mm/swap.c~mm-deactivate-invalidated-pages-fix
> +++ a/mm/swap.c
> @@ -39,7 +39,7 @@ int page_cluster;
>
> ?static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> ?static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> -static DEFINE_PER_CPU(struct pagevec, lru_deactive_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
>
>
> ?/*
> @@ -334,23 +334,22 @@ static void drain_cpu_pagevecs(int cpu)
> ? ? ? ? ? ? ? ?local_irq_restore(flags);
> ? ? ? ?}
>
> - ? ? ? pvec = &per_cpu(lru_deactive_pvecs, cpu);
> + ? ? ? pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> ? ? ? ?if (pagevec_count(pvec))
> ? ? ? ? ? ? ? ?__pagevec_lru_deactive(pvec);
> ?}
>
> ?/*
> - * Function used to forecefully demote a page to the head of the inactive
> - * list.
> + * Forecfully demote a page to the tail of the inactive list.
> ?*/
> -void lru_deactive_page(struct page *page)
> +void lru_deactivate_page(struct page *page)
> ?{
> ? ? ? ?if (likely(get_page_unless_zero(page))) {
> - ? ? ? ? ? ? ? struct pagevec *pvec = &get_cpu_var(lru_deactive_pvecs);
> + ? ? ? ? ? ? ? struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
>
> ? ? ? ? ? ? ? ?if (!pagevec_add(pvec, page))
> ? ? ? ? ? ? ? ? ? ? ? ?__pagevec_lru_deactive(pvec);
> - ? ? ? ? ? ? ? put_cpu_var(lru_deactive_pvecs);
> + ? ? ? ? ? ? ? put_cpu_var(lru_deactivate_pvecs);
> ? ? ? ?}
> ?}
>
> diff -puN mm/truncate.c~mm-deactivate-invalidated-pages-fix mm/truncate.c
> --- a/mm/truncate.c~mm-deactivate-invalidated-pages-fix
> +++ a/mm/truncate.c
> @@ -361,8 +361,13 @@ unsigned long invalidate_mapping_pages(s
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>
> ? ? ? ? ? ? ? ? ? ? ? ?ret = invalidate_inode_page(page);
> + ? ? ? ? ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ? ? ? ? ?* If the page was dirty or under writeback we cannot
> + ? ? ? ? ? ? ? ? ? ? ? ?* invalidate it now. ?Move it to the tail of the
> + ? ? ? ? ? ? ? ? ? ? ? ?* inactive LRU so that reclaim will free it promptly.
> + ? ? ? ? ? ? ? ? ? ? ? ?*/
> ? ? ? ? ? ? ? ? ? ? ? ?if (!ret)
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru_deactive_page(page);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru_deactivate_page(page);
> ? ? ? ? ? ? ? ? ? ? ? ?count += ret;
> ? ? ? ? ? ? ? ? ? ? ? ?unlock_page(page);
> ? ? ? ? ? ? ? ? ? ? ? ?if (next > end)
> _
>
>

--
Kind regards,
Minchan Kim

2010-11-23 04:57:20

[permalink] [raw]

Subject: Re: [RFC 2/2] Prevent promotion of page in madvise_dontneed

On Tue, Nov 23, 2010 at 7:21 AM, Andrew Morton
<[email protected]> wrote:
> On Sun, 21 Nov 2010 23:30:24 +0900
> Minchan Kim <[email protected]> wrote:
>
>> Now zap_pte_range alwayas promotes pages which are pte_young &&
>> !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
>> it's unnecessary since the page wouldn't use any more.
>>
>> If the page is sharred by other processes and it's real working set
>
> This patch doesn't actually do anything. ?It passes variable `promote'
> all the way down to unmap_vmas(), but unmap_vmas() doesn't use that new
> variable.

Oops. Sorry for that.
I will resend the patch with your fixlet and Ben's point.
Thank you.

> Have a comment fixlet:
>
> --- a/mm/memory.c~mm-prevent-promotion-of-page-in-madvise_dontneed-fix
> +++ a/mm/memory.c
> @@ -1075,7 +1075,7 @@ static unsigned long unmap_page_range(st
> ?* @end_addr: virtual address at which to end unmapping
> ?* @nr_accounted: Place number of unmapped pages in vm-accountable vma's here
> ?* @details: details of nonlinear truncation or shared cache invalidation
> - * @promote: whether pages inclued vma would be promoted or not
> + * @promote: whether pages included in the vma should be promoted or not
> ?*
> ?* Returns the end address of the unmapping (restart addr if interrupted).
> ?*
> _
>
> Also, I'd suggest that we avoid introducing the term "promote". ?It
> isn't a term which is presently used in Linux MM. ?Probably "activate"
> has a better-known meaning.
>
> And `activate' could be a bool if one is in the mood for that.
>

--
Kind regards,
Minchan Kim

2010-11-23 05:06:09

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, 23 Nov 2010 13:52:05 +0900 Minchan Kim <[email protected]> wrote:

> >> +/*
> >> + * Function used to forecefully demote a page to the head of the inactive
> >> + * list.
> >> + */
> >
> > This comment is wrong? __The page gets moved to the _tail_ of the
> > inactive list?
>
> No. I add it in _head_ of the inactive list intentionally.
> Why I don't add it to _tail_ is that I don't want to be aggressive.
> The page might be real working set. So I want to give a chance to
> activate it again.

Well.. why? The user just tried to toss the page away altogether. If
the kernel wasn't able to do that immediately, the best it can do is to
toss the page away asap?

> If it's not working set, it can be reclaimed easily and it can prevent
> active page demotion since inactive list size would be big enough for
> not calling shrink_active_list.

What is "working set"? Mapped and unmapped pagecache, or are you
referring solely to mapped pagecache?

If it's mapped pagecache then the user was being a bit silly (or didn't
know that some other process had mapped the file). In which case we
need to decide what to do - leave the page alone, deactivate it, or
half-deactivate it as this patch does.

2010-11-23 05:23:36

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, Nov 23, 2010 at 2:01 PM, Andrew Morton
<[email protected]> wrote:
> On Tue, 23 Nov 2010 13:52:05 +0900 Minchan Kim <[email protected]> wrote:
>
>> >> +/*
>> >> + * Function used to forecefully demote a page to the head of the inactive
>> >> + * list.
>> >> + */
>> >
>> > This comment is wrong? __The page gets moved to the _tail_ of the
>> > inactive list?
>>
>> No. I add it in _head_ of the inactive list intentionally.
>> Why I don't add it to _tail_ is that I don't want to be aggressive.
>> The page might be real working set. So I want to give a chance to
>> activate it again.
>
> Well.. ?why? ?The user just tried to toss the page away altogether. ?If
> the kernel wasn't able to do that immediately, the best it can do is to
> toss the page away asap?
>
>> If it's not working set, it can be reclaimed easily and it can prevent
>> active page demotion since inactive list size would be big enough for
>> not calling shrink_active_list.
>
> What is "working set"? ?Mapped and unmapped pagecache, or are you
> referring solely to mapped pagecache?

I mean it's mapped by other processes.

>
> If it's mapped pagecache then the user was being a bit silly (or didn't
> know that some other process had mapped the file). ?In which case we
> need to decide what to do - leave the page alone, deactivate it, or
> half-deactivate it as this patch does.

What I want is the half-deactivate.

Okay. We will use the result of invalidate_inode_page.
If fail happens by page_mapped, we can do half-deactivate.
But if fail happens by dirty(ex, writeback), we can add it to tail.
Does it make sense?

--
Kind regards,
Minchan Kim

2010-11-23 05:26:52

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, 23 Nov 2010 14:23:33 +0900 Minchan Kim <[email protected]> wrote:

> On Tue, Nov 23, 2010 at 2:01 PM, Andrew Morton
> <[email protected]> wrote:
> > On Tue, 23 Nov 2010 13:52:05 +0900 Minchan Kim <[email protected]> wrote:
> >
> >> >> +/*
> >> >> + * Function used to forecefully demote a page to the head of the inactive
> >> >> + * list.
> >> >> + */
> >> >
> >> > This comment is wrong? __The page gets moved to the _tail_ of the
> >> > inactive list?
> >>
> >> No. I add it in _head_ of the inactive list intentionally.
> >> Why I don't add it to _tail_ is that I don't want to be aggressive.
> >> The page might be real working set. So I want to give a chance to
> >> activate it again.
> >
> > Well.. __why? __The user just tried to toss the page away altogether. __If
> > the kernel wasn't able to do that immediately, the best it can do is to
> > toss the page away asap?
> >
> >> If it's not working set, it can be reclaimed easily and it can prevent
> >> active page demotion since inactive list size would be big enough for
> >> not calling shrink_active_list.
> >
> > What is "working set"? __Mapped and unmapped pagecache, or are you
> > referring solely to mapped pagecache?
>
> I mean it's mapped by other processes.
>
> >
> > If it's mapped pagecache then the user was being a bit silly (or didn't
> > know that some other process had mapped the file). __In which case we
> > need to decide what to do - leave the page alone, deactivate it, or
> > half-deactivate it as this patch does.
>
>
> What I want is the half-deactivate.
>
> Okay. We will use the result of invalidate_inode_page.
> If fail happens by page_mapped, we can do half-deactivate.
> But if fail happens by dirty(ex, writeback), we can add it to tail.
> Does it make sense?

Spose so. It's unobvious.

If the page is dirty or under writeback then reclaim will immediately
move it to the head of the LRU anyway. But given that the user has
just freed a bunch of pages with invalidate(), it's unlikely that
reclaim will be running soon.

2010-11-23 05:45:18

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, Nov 23, 2010 at 2:22 PM, Andrew Morton
<[email protected]> wrote:
> On Tue, 23 Nov 2010 14:23:33 +0900 Minchan Kim <[email protected]> wrote:
>
>> On Tue, Nov 23, 2010 at 2:01 PM, Andrew Morton
>> <[email protected]> wrote:
>> > On Tue, 23 Nov 2010 13:52:05 +0900 Minchan Kim <[email protected]> wrote:
>> >
>> >> >> +/*
>> >> >> + * Function used to forecefully demote a page to the head of the inactive
>> >> >> + * list.
>> >> >> + */
>> >> >
>> >> > This comment is wrong? __The page gets moved to the _tail_ of the
>> >> > inactive list?
>> >>
>> >> No. I add it in _head_ of the inactive list intentionally.
>> >> Why I don't add it to _tail_ is that I don't want to be aggressive.
>> >> The page might be real working set. So I want to give a chance to
>> >> activate it again.
>> >
>> > Well.. __why? __The user just tried to toss the page away altogether. __If
>> > the kernel wasn't able to do that immediately, the best it can do is to
>> > toss the page away asap?
>> >
>> >> If it's not working set, it can be reclaimed easily and it can prevent
>> >> active page demotion since inactive list size would be big enough for
>> >> not calling shrink_active_list.
>> >
>> > What is "working set"? __Mapped and unmapped pagecache, or are you
>> > referring solely to mapped pagecache?
>>
>> I mean it's mapped by other processes.
>>
>> >
>> > If it's mapped pagecache then the user was being a bit silly (or didn't
>> > know that some other process had mapped the file). __In which case we
>> > need to decide what to do - leave the page alone, deactivate it, or
>> > half-deactivate it as this patch does.
>>
>>
>> What I want is the half-deactivate.
>>
>> Okay. We will use the result of invalidate_inode_page.
>> If fail happens by page_mapped, we can do half-deactivate.
>> But if fail happens by dirty(ex, writeback), we can add it to tail.
>> Does it make sense?
>
> Spose so. ?It's unobvious.
>
> If the page is dirty or under writeback then reclaim will immediately
> move it to the head of the LRU anyway. ?But given that the user has

Why does it move into head of LRU?
If the page which isn't mapped doesn't have PG_referenced, it would be
reclaimed.

> just freed a bunch of pages with invalidate(), it's unlikely that
> reclaim will be running soon.

If reclaim doesn't start soon, it's good. That's because we have a
time to activate it and
when reclaim happens, reclaimer can reclaim pages easily.

If I don't understand your point, could you elaborate on it?

--
Kind regards,
Minchan Kim

2010-11-23 05:53:20

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, 23 Nov 2010 14:45:15 +0900 Minchan Kim <[email protected]> wrote:

> On Tue, Nov 23, 2010 at 2:22 PM, Andrew Morton
> <[email protected]> wrote:
> > On Tue, 23 Nov 2010 14:23:33 +0900 Minchan Kim <[email protected]> wrote:
> >
> >> On Tue, Nov 23, 2010 at 2:01 PM, Andrew Morton
> >> <[email protected]> wrote:
> >> > On Tue, 23 Nov 2010 13:52:05 +0900 Minchan Kim <[email protected]> wrote:
> >> >
> >> >> >> +/*
> >> >> >> + * Function used to forecefully demote a page to the head of the inactive
> >> >> >> + * list.
> >> >> >> + */
> >> >> >
> >> >> > This comment is wrong? __The page gets moved to the _tail_ of the
> >> >> > inactive list?
> >> >>
> >> >> No. I add it in _head_ of the inactive list intentionally.
> >> >> Why I don't add it to _tail_ is that I don't want to be aggressive.
> >> >> The page might be real working set. So I want to give a chance to
> >> >> activate it again.
> >> >
> >> > Well.. __why? __The user just tried to toss the page away altogether. __If
> >> > the kernel wasn't able to do that immediately, the best it can do is to
> >> > toss the page away asap?
> >> >
> >> >> If it's not working set, it can be reclaimed easily and it can prevent
> >> >> active page demotion since inactive list size would be big enough for
> >> >> not calling shrink_active_list.
> >> >
> >> > What is "working set"? __Mapped and unmapped pagecache, or are you
> >> > referring solely to mapped pagecache?
> >>
> >> I mean it's mapped by other processes.
> >>
> >> >
> >> > If it's mapped pagecache then the user was being a bit silly (or didn't
> >> > know that some other process had mapped the file). __In which case we
> >> > need to decide what to do - leave the page alone, deactivate it, or
> >> > half-deactivate it as this patch does.
> >>
> >>
> >> What I want is the half-deactivate.
> >>
> >> Okay. We will use the result of invalidate_inode_page.
> >> If fail happens by page_mapped, we can do half-deactivate.
> >> But if fail happens by dirty(ex, writeback), we can add it to tail.
> >> Does it make sense?
> >
> > Spose so. __It's unobvious.
> >
> > If the page is dirty or under writeback then reclaim will immediately
> > move it to the head of the LRU anyway. __But given that the user has
>
> Why does it move into head of LRU?
> If the page which isn't mapped doesn't have PG_referenced, it would be
> reclaimed.

If it's dirty or under writeback it can't be reclaimed!

> > just freed a bunch of pages with invalidate(), it's unlikely that
> > reclaim will be running soon.
>
> If reclaim doesn't start soon, it's good. That's because we have a
> time to activate it and
> when reclaim happens, reclaimer can reclaim pages easily.
>
> If I don't understand your point, could you elaborate on it?

If reclaim doesn't happen soon and the page was dirty or under
writeback (and hence unreclaimable) then there's a better chance that
it _will_ be reclaimable by the time reclaim comes along and has a look
at it. Yes, that's good.

And a note to Mel: this is one way in which we can get significant
(perhaps tremendous) numbers of dirty pages coming off the tail of the
LRU, and hence eligible for pageout() treatment.

2010-11-23 06:05:41

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, Nov 23, 2010 at 2:48 PM, Andrew Morton
<[email protected]> wrote:
> On Tue, 23 Nov 2010 14:45:15 +0900 Minchan Kim <[email protected]> wrote:
>
>> On Tue, Nov 23, 2010 at 2:22 PM, Andrew Morton
>> <[email protected]> wrote:
>> > On Tue, 23 Nov 2010 14:23:33 +0900 Minchan Kim <[email protected]> wrote:
>> >
>> >> On Tue, Nov 23, 2010 at 2:01 PM, Andrew Morton
>> >> <[email protected]> wrote:
>> >> > On Tue, 23 Nov 2010 13:52:05 +0900 Minchan Kim <[email protected]> wrote:
>> >> >
>> >> >> >> +/*
>> >> >> >> + * Function used to forecefully demote a page to the head of the inactive
>> >> >> >> + * list.
>> >> >> >> + */
>> >> >> >
>> >> >> > This comment is wrong? __The page gets moved to the _tail_ of the
>> >> >> > inactive list?
>> >> >>
>> >> >> No. I add it in _head_ of the inactive list intentionally.
>> >> >> Why I don't add it to _tail_ is that I don't want to be aggressive.
>> >> >> The page might be real working set. So I want to give a chance to
>> >> >> activate it again.
>> >> >
>> >> > Well.. __why? __The user just tried to toss the page away altogether. __If
>> >> > the kernel wasn't able to do that immediately, the best it can do is to
>> >> > toss the page away asap?
>> >> >
>> >> >> If it's not working set, it can be reclaimed easily and it can prevent
>> >> >> active page demotion since inactive list size would be big enough for
>> >> >> not calling shrink_active_list.
>> >> >
>> >> > What is "working set"? __Mapped and unmapped pagecache, or are you
>> >> > referring solely to mapped pagecache?
>> >>
>> >> I mean it's mapped by other processes.
>> >>
>> >> >
>> >> > If it's mapped pagecache then the user was being a bit silly (or didn't
>> >> > know that some other process had mapped the file). __In which case we
>> >> > need to decide what to do - leave the page alone, deactivate it, or
>> >> > half-deactivate it as this patch does.
>> >>
>> >>
>> >> What I want is the half-deactivate.
>> >>
>> >> Okay. We will use the result of invalidate_inode_page.
>> >> If fail happens by page_mapped, we can do half-deactivate.
>> >> But if fail happens by dirty(ex, writeback), we can add it to tail.
>> >> Does it make sense?
>> >
>> > Spose so. __It's unobvious.
>> >
>> > If the page is dirty or under writeback then reclaim will immediately
>> > move it to the head of the LRU anyway. __But given that the user has
>>
>> Why does it move into head of LRU?
>> If the page which isn't mapped doesn't have PG_referenced, it would be
>> reclaimed.
>
> If it's dirty or under writeback it can't be reclaimed!

I see your point. And it's why I add it to head of inactive list.

>
>> > just freed a bunch of pages with invalidate(), it's unlikely that
>> > reclaim will be running soon.
>>
>> If reclaim doesn't start soon, it's good. That's because we have a
>> time to activate it and
>> when reclaim happens, reclaimer can reclaim pages easily.
>>
>> If I don't understand your point, could you elaborate on it?
>
> If reclaim doesn't happen soon and the page was dirty or under
> writeback (and hence unreclaimable) then there's a better chance that
> it _will_ be reclaimable by the time reclaim comes along and has a look
> at it. ?Yes, that's good.
>
> And a note to Mel: this is one way in which we can get significant
> (perhaps tremendous) numbers of dirty pages coming off the tail of the
> LRU, and hence eligible for pageout() treatment.
>

I think you agree my as-is approach. Right?
Then, could you revert your fixlet?
I will add additional description with your fix.

--
Kind regards,
Minchan Kim

2010-11-23 07:17:00

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

> By Other approach, app developer uses POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It is very hard for application programmer to use it.
> Because they always have to sync data before calling
> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> be discardable. At last, they can't use deferred write of kernel
> so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)

If rsync use the above url patch, we don't need your patch.
fdatasync() + POSIX_FADV_DONTNEED should work fine.

So, I think the core worth of previous PeterZ's patch is in readahead
based heuristics. I'm curious why you drop it.

> In fact, invalidate is very big hint to reclaimer.
> It means we don't use the page any more. So let's move
> the writing page into inactive list's head.

But, I agree this.

>
> If it is real working set, it could have a enough time to
> activate the page since we always try to keep many pages in
> inactive list.
>
> I reuse lru_demote of Peter with some change.
>
> Reported-by: Ben Gamari <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
>
> Ben, Remain thing is to modify rsync and use
> fadvise(POSIX_FADV_DONTNEED). Could you test it?
> ---
> include/linux/swap.h | 1 +
> mm/swap.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++
> mm/truncate.c | 11 +++++---
> 3 files changed, 69 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index eba53e7..a3c9248 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -213,6 +213,7 @@ extern void mark_page_accessed(struct page *);
> extern void lru_add_drain(void);
> extern int lru_add_drain_all(void);
> extern void rotate_reclaimable_page(struct page *page);
> +extern void lru_deactive_page(struct page *page);
> extern void swap_setup(void);
>
> extern void add_page_to_unevictable_list(struct page *page);
> diff --git a/mm/swap.c b/mm/swap.c
> index 3f48542..56fa298 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -39,6 +39,8 @@ int page_cluster;
>
> static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_deactive_pvecs);
> +
>
> /*
> * This path almost never happens for VM activity - pages are normally
> @@ -266,6 +268,45 @@ void add_page_to_unevictable_list(struct page *page)
> spin_unlock_irq(&zone->lru_lock);
> }
>
> +static void __pagevec_lru_deactive(struct pagevec *pvec)
> +{
> + int i, lru, file;
> +
> + struct zone *zone = NULL;
> +
> + for (i = 0; i < pagevec_count(pvec); i++) {
> + struct page *page = pvec->pages[i];
> + struct zone *pagezone = page_zone(page);
> +
> + if (pagezone != zone) {
> + if (zone)
> + spin_unlock_irq(&zone->lru_lock);
> + zone = pagezone;
> + spin_lock_irq(&zone->lru_lock);
> + }
> +
> + if (PageLRU(page)) {
> + if (PageActive(page)) {
> + file = page_is_file_cache(page);
> + lru = page_lru_base_type(page);
> + del_page_from_lru_list(zone, page,
> + lru + LRU_ACTIVE);
> + ClearPageActive(page);
> + ClearPageReferenced(page);
> + add_page_to_lru_list(zone, page, lru);
> + __count_vm_event(PGDEACTIVATE);
> +
> + update_page_reclaim_stat(zone, page, file, 0);

When PageActive is unset, we need to change cgroup lru too.

> + }
> + }
> + }
> + if (zone)
> + spin_unlock_irq(&zone->lru_lock);
> +
> + release_pages(pvec->pages, pvec->nr, pvec->cold);
> + pagevec_reinit(pvec);
> +}
> +
> /*
> * Drain pages out of the cpu's pagevecs.
> * Either "cpu" is the current CPU, and preemption has already been
> @@ -292,8 +333,28 @@ static void drain_cpu_pagevecs(int cpu)
> pagevec_move_tail(pvec);
> local_irq_restore(flags);
> }
> +
> + pvec = &per_cpu(lru_deactive_pvecs, cpu);
> + if (pagevec_count(pvec))
> + __pagevec_lru_deactive(pvec);
> +}
> +
> +/*
> + * Function used to forecefully demote a page to the head of the inactive
> + * list.
> + */
> +void lru_deactive_page(struct page *page)
> +{
> + if (likely(get_page_unless_zero(page))) {

Probably, we can check PageLRU and PageActive here too. It help to avoid
unnecessary batching and may slightly increase performance.

> + struct pagevec *pvec = &get_cpu_var(lru_deactive_pvecs);
> +
> + if (!pagevec_add(pvec, page))
> + __pagevec_lru_deactive(pvec);
> + put_cpu_var(lru_deactive_pvecs);
> + }
> }
>
> +
> void lru_add_drain(void)
> {
> drain_cpu_pagevecs(get_cpu());
> diff --git a/mm/truncate.c b/mm/truncate.c
> index cd94607..c73fb19 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -332,7 +332,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> {
> struct pagevec pvec;
> pgoff_t next = start;
> - unsigned long ret = 0;
> + unsigned long ret;
> + unsigned long count = 0;
> int i;
>
> pagevec_init(&pvec, 0);
> @@ -359,8 +360,10 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> if (lock_failed)
> continue;
>
> - ret += invalidate_inode_page(page);
> -
> + ret = invalidate_inode_page(page);
> + if (!ret)
> + lru_deactive_page(page);
> + count += ret;
> unlock_page(page);
> if (next > end)
> break;
> @@ -369,7 +372,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> mem_cgroup_uncharge_end();
> cond_resched();
> }
> - return ret;
> + return count;
> }
> EXPORT_SYMBOL(invalidate_mapping_pages);
>
> --
> 1.7.0.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2010-11-23 07:16:59

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

> On Sun, 21 Nov 2010 23:30:23 +0900, Minchan Kim <[email protected]> wrote:
> >
> > Ben, Remain thing is to modify rsync and use
> > fadvise(POSIX_FADV_DONTNEED). Could you test it?
>
> Thanks a ton for the patch. Looks good. Testing as we speak.

If possible, can you please post your rsync patch and your testcase
(or your rsync option + system memory size info + data size info)?

2010-11-23 07:28:33

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, 23 Nov 2010 15:05:39 +0900 Minchan Kim <[email protected]> wrote:

> On Tue, Nov 23, 2010 at 2:48 PM, Andrew Morton
> >> > move it to the head of the LRU anyway. __But given that the user has
> >>
> >> Why does it move into head of LRU?
> >> If the page which isn't mapped doesn't have PG_referenced, it would be
> >> reclaimed.
> >
> > If it's dirty or under writeback it can't be reclaimed!
>
> I see your point. And it's why I add it to head of inactive list.

But that *guarantees* that the page will get a full trip around the
inactive list. And this will guarantee that potentially useful pages
are reclaimed before the pages which we *know* the user doesn't want!
Bad!

Whereas if we queue it to the tail, it will only get that full trip if
reclaim happens to run before the page is cleaned. And we just agreed
that reclaim isn't likely to run immediately, because pages are being
freed.

So we face a choice between guaranteed eviction of potentially-useful
pages (which are very expensive to reestablish) versus a *possible*
need to move an unreclaimable page to the head of the LRU, which is
cheap.

So we need to decide if

(certain * possibly-expensive) is less than (possible * cheap).

So no, it's not obvious to me that this was the correct decision.

One way to help with this guess would be to instrument those pages
(would require adding a special page flag to identify them, I expect)
and see how many of these pages get immediately reclaimed off the LRU
versus how many get recycled. And then run a batch of "typical
workloads" under that instrumentation.

Only there aren't any "typical workloads" for fadvise(DONTNEED) :(

2010-11-23 07:40:08

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

Hi KOSAKI,

2010/11/23 KOSAKI Motohiro <[email protected]>:
>> By Other approach, app developer uses POSIX_FADV_DONTNEED.
>> But it has a problem. If kernel meets page is writing
>> during invalidate_mapping_pages, it can't work.
>> It is very hard for application programmer to use it.
>> Because they always have to sync data before calling
>> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
>> be discardable. At last, they can't use deferred write of kernel
>> so that they could see performance loss.
>> (http://insights.oetiker.ch/linux/fadvise.html)
>
> If rsync use the above url patch, we don't need your patch.
> fdatasync() + POSIX_FADV_DONTNEED should work fine.

It works well. But it needs always fdatasync before calling fadvise.
For small file, it hurt performance since we can't use the deferred write.

>
> So, I think the core worth of previous PeterZ's patch is in readahead
> based heuristics. I'm curious why you drop it.
>

In previous peter's patch, it couldn't move active page into inactive list.
So it's not what i want and I think invalidation is stronger hint than
the readahead heuristic.
But if we need it, I will add it in my series. It can help reclaiming
unnecessary inactive page asap.
but before that, I hope we make sure fadvise works well enough.

>
>> In fact, invalidate is very big hint to reclaimer.
>> It means we don't use the page any more. So let's move
>> the writing page into inactive list's head.
>
> But, I agree this.

Thank you.

>
>
>>
>> If it is real working set, it could have a enough time to
>> activate the page since we always try to keep many pages in
>> inactive list.
>>
>> I reuse lru_demote of Peter with some change.
>>
>> Reported-by: Ben Gamari <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>> Signed-off-by: Peter Zijlstra <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>>
>> Ben, Remain thing is to modify rsync and use
>> fadvise(POSIX_FADV_DONTNEED). Could you test it?
>> ---
>> ?include/linux/swap.h | ? ?1 +
>> ?mm/swap.c ? ? ? ? ? ?| ? 61 ++++++++++++++++++++++++++++++++++++++++++++++++++
>> ?mm/truncate.c ? ? ? ?| ? 11 +++++---
>> ?3 files changed, 69 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index eba53e7..a3c9248 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -213,6 +213,7 @@ extern void mark_page_accessed(struct page *);
>> ?extern void lru_add_drain(void);
>> ?extern int lru_add_drain_all(void);
>> ?extern void rotate_reclaimable_page(struct page *page);
>> +extern void lru_deactive_page(struct page *page);
>> ?extern void swap_setup(void);
>>
>> ?extern void add_page_to_unevictable_list(struct page *page);
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 3f48542..56fa298 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -39,6 +39,8 @@ int page_cluster;
>>
>> ?static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
>> ?static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>> +static DEFINE_PER_CPU(struct pagevec, lru_deactive_pvecs);
>> +
>>
>> ?/*
>> ? * This path almost never happens for VM activity - pages are normally
>> @@ -266,6 +268,45 @@ void add_page_to_unevictable_list(struct page *page)
>> ? ? ? spin_unlock_irq(&zone->lru_lock);
>> ?}
>>
>> +static void __pagevec_lru_deactive(struct pagevec *pvec)
>> +{
>> + ? ? int i, lru, file;
>> +
>> + ? ? struct zone *zone = NULL;
>> +
>> + ? ? for (i = 0; i < pagevec_count(pvec); i++) {
>> + ? ? ? ? ? ? struct page *page = pvec->pages[i];
>> + ? ? ? ? ? ? struct zone *pagezone = page_zone(page);
>> +
>> + ? ? ? ? ? ? if (pagezone != zone) {
>> + ? ? ? ? ? ? ? ? ? ? if (zone)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? spin_unlock_irq(&zone->lru_lock);
>> + ? ? ? ? ? ? ? ? ? ? zone = pagezone;
>> + ? ? ? ? ? ? ? ? ? ? spin_lock_irq(&zone->lru_lock);
>> + ? ? ? ? ? ? }
>> +
>> + ? ? ? ? ? ? if (PageLRU(page)) {
>> + ? ? ? ? ? ? ? ? ? ? if (PageActive(page)) {
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? file = page_is_file_cache(page);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru = page_lru_base_type(page);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? del_page_from_lru_list(zone, page,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru + LRU_ACTIVE);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ClearPageActive(page);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ClearPageReferenced(page);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? add_page_to_lru_list(zone, page, lru);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? __count_vm_event(PGDEACTIVATE);
>> +
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? update_page_reclaim_stat(zone, page, file, 0);
>
> When PageActive is unset, we need to change cgroup lru too.

Doesn't add_page_to_lru_list/del_page_from_lru_list do it?

>
>
>> + ? ? ? ? ? ? ? ? ? ? }
>> + ? ? ? ? ? ? }
>> + ? ? }
>> + ? ? if (zone)
>> + ? ? ? ? ? ? spin_unlock_irq(&zone->lru_lock);
>> +
>> + ? ? release_pages(pvec->pages, pvec->nr, pvec->cold);
>> + ? ? pagevec_reinit(pvec);
>> +}
>> +
>> ?/*
>> ? * Drain pages out of the cpu's pagevecs.
>> ? * Either "cpu" is the current CPU, and preemption has already been
>> @@ -292,8 +333,28 @@ static void drain_cpu_pagevecs(int cpu)
>> ? ? ? ? ? ? ? pagevec_move_tail(pvec);
>> ? ? ? ? ? ? ? local_irq_restore(flags);
>> ? ? ? }
>> +
>> + ? ? pvec = &per_cpu(lru_deactive_pvecs, cpu);
>> + ? ? if (pagevec_count(pvec))
>> + ? ? ? ? ? ? __pagevec_lru_deactive(pvec);
>> +}
>> +
>> +/*
>> + * Function used to forecefully demote a page to the head of the inactive
>> + * list.
>> + */
>> +void lru_deactive_page(struct page *page)
>> +{
>> + ? ? if (likely(get_page_unless_zero(page))) {
>
> Probably, we can check PageLRU and PageActive here too. It help to avoid
> unnecessary batching and may slightly increase performance.

Yes. Thanks. Will fix.

>
>
>> + ? ? ? ? ? ? struct pagevec *pvec = &get_cpu_var(lru_deactive_pvecs);
>> +
>> + ? ? ? ? ? ? if (!pagevec_add(pvec, page))
>> + ? ? ? ? ? ? ? ? ? ? __pagevec_lru_deactive(pvec);
>> + ? ? ? ? ? ? put_cpu_var(lru_deactive_pvecs);
>> + ? ? }
>> ?}
>>
>> +
>> ?void lru_add_drain(void)
>> ?{
>> ? ? ? drain_cpu_pagevecs(get_cpu());
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index cd94607..c73fb19 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -332,7 +332,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>> ?{
>> ? ? ? struct pagevec pvec;
>> ? ? ? pgoff_t next = start;
>> - ? ? unsigned long ret = 0;
>> + ? ? unsigned long ret;
>> + ? ? unsigned long count = 0;
>> ? ? ? int i;
>>
>> ? ? ? pagevec_init(&pvec, 0);
>> @@ -359,8 +360,10 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>> ? ? ? ? ? ? ? ? ? ? ? if (lock_failed)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>>
>> - ? ? ? ? ? ? ? ? ? ? ret += invalidate_inode_page(page);
>> -
>> + ? ? ? ? ? ? ? ? ? ? ret = invalidate_inode_page(page);
>> + ? ? ? ? ? ? ? ? ? ? if (!ret)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru_deactive_page(page);
>> + ? ? ? ? ? ? ? ? ? ? count += ret;
>> ? ? ? ? ? ? ? ? ? ? ? unlock_page(page);
>> ? ? ? ? ? ? ? ? ? ? ? if (next > end)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
>> @@ -369,7 +372,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>> ? ? ? ? ? ? ? mem_cgroup_uncharge_end();
>> ? ? ? ? ? ? ? cond_resched();
>> ? ? ? }
>> - ? ? return ret;
>> + ? ? return count;
>> ?}
>> ?EXPORT_SYMBOL(invalidate_mapping_pages);
>>
>> --
>> 1.7.0.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at ?http://www.tux.org/lkml/
>
>
>
>

--
Kind regards,
Minchan Kim

2010-11-23 07:44:52

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, Nov 23, 2010 at 4:15 PM, Andrew Morton
<[email protected]> wrote:
> On Tue, 23 Nov 2010 15:05:39 +0900 Minchan Kim <[email protected]> wrote:
>
>> On Tue, Nov 23, 2010 at 2:48 PM, Andrew Morton
>> >> > move it to the head of the LRU anyway. __But given that the user has
>> >>
>> >> Why does it move into head of LRU?
>> >> If the page which isn't mapped doesn't have PG_referenced, it would be
>> >> reclaimed.
>> >
>> > If it's dirty or under writeback it can't be reclaimed!
>>
>> I see your point. And it's why I add it to head of inactive list.
>
> But that *guarantees* that the page will get a full trip around the
> inactive list. ?And this will guarantee that potentially useful pages
> are reclaimed before the pages which we *know* the user doesn't want!
> Bad!
>
> Whereas if we queue it to the tail, it will only get that full trip if
> reclaim happens to run before the page is cleaned. ?And we just agreed
> that reclaim isn't likely to run immediately, because pages are being
> freed.
>
> So we face a choice between guaranteed eviction of potentially-useful
> pages (which are very expensive to reestablish) versus a *possible*
> need to move an unreclaimable page to the head of the LRU, which is
> cheap.

How about flagging SetPageReclaim when we add it to head of inactive?
If page write is complete, end_page_writeback would move it to tail of
inactive.

--
Kind regards,
Minchan Kim

2010-11-23 07:48:03

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, 23 Nov 2010 16:40:03 +0900 Minchan Kim <[email protected]> wrote:

> Hi KOSAKI,
>
> 2010/11/23 KOSAKI Motohiro <[email protected]>:
> >> By Other approach, app developer uses POSIX_FADV_DONTNEED.
> >> But it has a problem. If kernel meets page is writing
> >> during invalidate_mapping_pages, it can't work.
> >> It is very hard for application programmer to use it.
> >> Because they always have to sync data before calling
> >> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> >> be discardable. At last, they can't use deferred write of kernel
> >> so that they could see performance loss.
> >> (http://insights.oetiker.ch/linux/fadvise.html)
> >
> > If rsync use the above url patch, we don't need your patch.
> > fdatasync() + POSIX_FADV_DONTNEED should work fine.
>
> It works well. But it needs always fdatasync before calling fadvise.
> For small file, it hurt performance since we can't use the deferred write.

fdatasync() is (much) better than nothing, but a userspace application
which is carefully managing its IO scheduling should use
sync_file_range(SYNC_FILE_RANGE_WRITE) to push data at the disk and
should then run fadvise(DONTNEED) against the same data a few seconds
later, after the IO has completed.

That way, the application won't block against the write I/O at all,
unless of course someone else is thrashing the disk as well, etc.

If the app is doing a lot of file I/O (eg, rsync) then this shouldn't
be too hard to arrange. Although the payback will be pretty small
unless the IO-intensive process is also compute-intensive at times.
And such applications are a) fairly rare and b) poorly designed:
shouldn't be doing heavy IO and heavy compute in the same thread!

2010-11-23 07:58:30

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, 23 Nov 2010 16:44:50 +0900 Minchan Kim <[email protected]> wrote:

> On Tue, Nov 23, 2010 at 4:15 PM, Andrew Morton
> <[email protected]> wrote:
> > On Tue, 23 Nov 2010 15:05:39 +0900 Minchan Kim <[email protected]> wrote:
> >
> >> On Tue, Nov 23, 2010 at 2:48 PM, Andrew Morton
> >> >> > move it to the head of the LRU anyway. __But given that the user has
> >> >>
> >> >> Why does it move into head of LRU?
> >> >> If the page which isn't mapped doesn't have PG_referenced, it would be
> >> >> reclaimed.
> >> >
> >> > If it's dirty or under writeback it can't be reclaimed!
> >>
> >> I see your point. And it's why I add it to head of inactive list.
> >
> > But that *guarantees* that the page will get a full trip around the
> > inactive list. __And this will guarantee that potentially useful pages
> > are reclaimed before the pages which we *know* the user doesn't want!
> > Bad!
> >
> > Whereas if we queue it to the tail, it will only get that full trip if
> > reclaim happens to run before the page is cleaned. __And we just agreed
> > that reclaim isn't likely to run immediately, because pages are being
> > freed.
> >
> > So we face a choice between guaranteed eviction of potentially-useful
> > pages (which are very expensive to reestablish) versus a *possible*
> > need to move an unreclaimable page to the head of the LRU, which is
> > cheap.
>
> How about flagging SetPageReclaim when we add it to head of inactive?
> If page write is complete, end_page_writeback would move it to tail of
> inactive.

ooh, that sounds clever. We'd want to do that for both PageDirty() and
for PageWriteback() pages.

But if we do it for PageDirty() pages, we'd need to clear PageReclaim()
if someone reuses the page for some reason. We'll end up with pages
all over the place which have PageReclaim set. I guess we could clear
PageReclaim() in mark_page_accessed(), but that's hardly going to give
us full coverage.

hmm. Maybe just do it for PageWriteback pages. Then userspace can do

sync_file_range(SYNC_FILE_RANGE_WRITE);
fadvise(DONTNEED);

and all those pages which now have PageWriteback set will also get
PageReclaim set.

But we'd need to avoid races against end_io when setting PageReclaim
against the PageWriteback pages - if the interrupt happens while we're
setting PageReclaim, it will end up being incorrectly set.

2010-11-23 08:01:18

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

> Hi KOSAKI,
>
> 2010/11/23 KOSAKI Motohiro <[email protected]>:
> >> By Other approach, app developer uses POSIX_FADV_DONTNEED.
> >> But it has a problem. If kernel meets page is writing
> >> during invalidate_mapping_pages, it can't work.
> >> It is very hard for application programmer to use it.
> >> Because they always have to sync data before calling
> >> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> >> be discardable. At last, they can't use deferred write of kernel
> >> so that they could see performance loss.
> >> (http://insights.oetiker.ch/linux/fadvise.html)
> >
> > If rsync use the above url patch, we don't need your patch.
> > fdatasync() + POSIX_FADV_DONTNEED should work fine.
>
> It works well. But it needs always fdatasync before calling fadvise.
> For small file, it hurt performance since we can't use the deferred write.

I doubt rsync need to call fdatasync. Why?

If rsync continue to do following loop, some POSIX_FADV_DONTNEED
may not drop some dirty pages. But they can be dropped at next loop's
POSIX_FADV_DONTNEED. Then, It doesn't make serious issue.

1) read
2) write
3) POSIX_FADV_DONTNEED
4) goto 1

Am I missing anything?

> > So, I think the core worth of previous PeterZ's patch is in readahead
> > based heuristics. I'm curious why you drop it.
> >
>
> In previous peter's patch, it couldn't move active page into inactive list.
> So it's not what i want and I think invalidation is stronger hint than
> the readahead heuristic.
> But if we need it, I will add it in my series. It can help reclaiming
> unnecessary inactive page asap.
> but before that, I hope we make sure fadvise works well enough.

I've got it.Yeah, 1) implement manual oepration 2) add automatic heuristic
is right order. I think. we can easily test your one.

> >> In fact, invalidate is very big hint to reclaimer.
> >> It means we don't use the page any more. So let's move
> >> the writing page into inactive list's head.
> >
> > But, I agree this.
>
> Thank you.
>
> >> +static void __pagevec_lru_deactive(struct pagevec *pvec)
> >> +{
> >> + ? ? int i, lru, file;
> >> +
> >> + ? ? struct zone *zone = NULL;
> >> +
> >> + ? ? for (i = 0; i < pagevec_count(pvec); i++) {
> >> + ? ? ? ? ? ? struct page *page = pvec->pages[i];
> >> + ? ? ? ? ? ? struct zone *pagezone = page_zone(page);
> >> +
> >> + ? ? ? ? ? ? if (pagezone != zone) {
> >> + ? ? ? ? ? ? ? ? ? ? if (zone)
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? spin_unlock_irq(&zone->lru_lock);
> >> + ? ? ? ? ? ? ? ? ? ? zone = pagezone;
> >> + ? ? ? ? ? ? ? ? ? ? spin_lock_irq(&zone->lru_lock);
> >> + ? ? ? ? ? ? }
> >> +
> >> + ? ? ? ? ? ? if (PageLRU(page)) {
> >> + ? ? ? ? ? ? ? ? ? ? if (PageActive(page)) {
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? file = page_is_file_cache(page);
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru = page_lru_base_type(page);
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? del_page_from_lru_list(zone, page,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru + LRU_ACTIVE);
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ClearPageActive(page);
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ClearPageReferenced(page);
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? add_page_to_lru_list(zone, page, lru);
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? __count_vm_event(PGDEACTIVATE);
> >> +
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? update_page_reclaim_stat(zone, page, file, 0);
> >
> > When PageActive is unset, we need to change cgroup lru too.
>
> Doesn't add_page_to_lru_list/del_page_from_lru_list do it?

Grr, my fault. I've forgot to we changed add_page_to_lru_list.

2010-11-23 08:02:21

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, Nov 23, 2010 at 4:53 PM, Andrew Morton
<[email protected]> wrote:
> On Tue, 23 Nov 2010 16:44:50 +0900 Minchan Kim <[email protected]> wrote:
>
>> On Tue, Nov 23, 2010 at 4:15 PM, Andrew Morton
>> <[email protected]> wrote:
>> > On Tue, 23 Nov 2010 15:05:39 +0900 Minchan Kim <[email protected]> wrote:
>> >
>> >> On Tue, Nov 23, 2010 at 2:48 PM, Andrew Morton
>> >> >> > move it to the head of the LRU anyway. __But given that the user has
>> >> >>
>> >> >> Why does it move into head of LRU?
>> >> >> If the page which isn't mapped doesn't have PG_referenced, it would be
>> >> >> reclaimed.
>> >> >
>> >> > If it's dirty or under writeback it can't be reclaimed!
>> >>
>> >> I see your point. And it's why I add it to head of inactive list.
>> >
>> > But that *guarantees* that the page will get a full trip around the
>> > inactive list. __And this will guarantee that potentially useful pages
>> > are reclaimed before the pages which we *know* the user doesn't want!
>> > Bad!
>> >
>> > Whereas if we queue it to the tail, it will only get that full trip if
>> > reclaim happens to run before the page is cleaned. __And we just agreed
>> > that reclaim isn't likely to run immediately, because pages are being
>> > freed.
>> >
>> > So we face a choice between guaranteed eviction of potentially-useful
>> > pages (which are very expensive to reestablish) versus a *possible*
>> > need to move an unreclaimable page to the head of the LRU, which is
>> > cheap.
>>
>> How about flagging SetPageReclaim when we add it to head of inactive?
>> If page write is complete, end_page_writeback would move it to tail of
>> inactive.
>
> ooh, that sounds clever. ?We'd want to do that for both PageDirty() and
> for PageWriteback() pages.
>
> But if we do it for PageDirty() pages, we'd need to clear PageReclaim()
> if someone reuses the page for some reason. ?We'll end up with pages
> all over the place which have PageReclaim set. ?I guess we could clear
> PageReclaim() in mark_page_accessed(), but that's hardly going to give
> us full coverage.
>
> hmm. ?Maybe just do it for PageWriteback pages. ?Then userspace can do
>
> ? ? ? ?sync_file_range(SYNC_FILE_RANGE_WRITE);
> ? ? ? ?fadvise(DONTNEED);
>
> and all those pages which now have PageWriteback set will also get
> PageReclaim set.
>
> But we'd need to avoid races against end_io when setting PageReclaim
> against the PageWriteback pages - if the interrupt happens while we're
> setting PageReclaim, it will end up being incorrectly set.
>

Okay. I will see it and resend new version.
Thanks for good comment, Andrew.

--
Kind regards,
Minchan Kim

2010-11-23 08:45:00

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, Nov 23, 2010 at 5:01 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> Hi KOSAKI,
>>
>> 2010/11/23 KOSAKI Motohiro <[email protected]>:
>> >> By Other approach, app developer uses POSIX_FADV_DONTNEED.
>> >> But it has a problem. If kernel meets page is writing
>> >> during invalidate_mapping_pages, it can't work.
>> >> It is very hard for application programmer to use it.
>> >> Because they always have to sync data before calling
>> >> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
>> >> be discardable. At last, they can't use deferred write of kernel
>> >> so that they could see performance loss.
>> >> (http://insights.oetiker.ch/linux/fadvise.html)
>> >
>> > If rsync use the above url patch, we don't need your patch.
>> > fdatasync() + POSIX_FADV_DONTNEED should work fine.
>>
>> It works well. But it needs always fdatasync before calling fadvise.
>> For small file, it hurt performance since we can't use the deferred write.
>
> I doubt rsync need to call fdatasync. Why?
>
> If rsync continue to do following loop, some POSIX_FADV_DONTNEED
> may not drop some dirty pages. But they can be dropped at next loop's
> POSIX_FADV_DONTNEED. Then, It doesn't make serious issue.
>
> 1) read
> 2) write
> 3) POSIX_FADV_DONTNEED
> 4) goto 1

fadvise need pair (offset and len).
if the pair in next turn is different with one's previous turn, it
couldn't be dropped.

--
Kind regards,
Minchan Kim

2010-11-23 09:02:06

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

> On Tue, Nov 23, 2010 at 5:01 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> >> Hi KOSAKI,
> >>
> >> 2010/11/23 KOSAKI Motohiro <[email protected]>:
> >> >> By Other approach, app developer uses POSIX_FADV_DONTNEED.
> >> >> But it has a problem. If kernel meets page is writing
> >> >> during invalidate_mapping_pages, it can't work.
> >> >> It is very hard for application programmer to use it.
> >> >> Because they always have to sync data before calling
> >> >> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> >> >> be discardable. At last, they can't use deferred write of kernel
> >> >> so that they could see performance loss.
> >> >> (http://insights.oetiker.ch/linux/fadvise.html)
> >> >
> >> > If rsync use the above url patch, we don't need your patch.
> >> > fdatasync() + POSIX_FADV_DONTNEED should work fine.
> >>
> >> It works well. But it needs always fdatasync before calling fadvise.
> >> For small file, it hurt performance since we can't use the deferred write.
> >
> > I doubt rsync need to call fdatasync. Why?
> >
> > If rsync continue to do following loop, some POSIX_FADV_DONTNEED
> > may not drop some dirty pages. But they can be dropped at next loop's
> > POSIX_FADV_DONTNEED. Then, It doesn't make serious issue.
> >
> > 1) read
> > 2) write
> > 3) POSIX_FADV_DONTNEED
> > 4) goto 1
>
> fadvise need pair (offset and len).
> if the pair in next turn is different with one's previous turn, it
> couldn't be dropped.

invalidate_mapping_pages() are using pagevec_lookup() and pagevec_lookup()
are using radix tree lookup. Then, Even if rsync always use [0, inf) pair, I don't think
it makes much slowdown.

2010-11-23 09:05:24

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, Nov 23, 2010 at 6:02 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> On Tue, Nov 23, 2010 at 5:01 PM, KOSAKI Motohiro
>> <[email protected]> wrote:
>> >> Hi KOSAKI,
>> >>
>> >> 2010/11/23 KOSAKI Motohiro <[email protected]>:
>> >> >> By Other approach, app developer uses POSIX_FADV_DONTNEED.
>> >> >> But it has a problem. If kernel meets page is writing
>> >> >> during invalidate_mapping_pages, it can't work.
>> >> >> It is very hard for application programmer to use it.
>> >> >> Because they always have to sync data before calling
>> >> >> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
>> >> >> be discardable. At last, they can't use deferred write of kernel
>> >> >> so that they could see performance loss.
>> >> >> (http://insights.oetiker.ch/linux/fadvise.html)
>> >> >
>> >> > If rsync use the above url patch, we don't need your patch.
>> >> > fdatasync() + POSIX_FADV_DONTNEED should work fine.
>> >>
>> >> It works well. But it needs always fdatasync before calling fadvise.
>> >> For small file, it hurt performance since we can't use the deferred write.
>> >
>> > I doubt rsync need to call fdatasync. Why?
>> >
>> > If rsync continue to do following loop, some POSIX_FADV_DONTNEED
>> > may not drop some dirty pages. But they can be dropped at next loop's
>> > POSIX_FADV_DONTNEED. Then, It doesn't make serious issue.
>> >
>> > 1) read
>> > 2) write
>> > 3) POSIX_FADV_DONTNEED
>> > 4) goto 1
>>
>> fadvise need pair (offset and len).
>> if the pair in next turn is different with one's previous turn, it
>> couldn't be dropped.
>
> invalidate_mapping_pages() are using pagevec_lookup() and pagevec_lookup()
> are using radix tree lookup. Then, Even if rsync always use [0, inf) pair, I don't think
> it makes much slowdown.
>

I mean fdatasync causes slowdown, not fadvise.
if you fadvise(don't need) without fdatasync, you could loss the data.
--
Kind regards,
Minchan Kim

2010-11-23 09:07:22

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, Nov 23, 2010 at 6:05 PM, Minchan Kim <[email protected]> wrote:
> On Tue, Nov 23, 2010 at 6:02 PM, KOSAKI Motohiro
> <[email protected]> wrote:
>>> On Tue, Nov 23, 2010 at 5:01 PM, KOSAKI Motohiro
>>> <[email protected]> wrote:
>>> >> Hi KOSAKI,
>>> >>
>>> >> 2010/11/23 KOSAKI Motohiro <[email protected]>:
>>> >> >> By Other approach, app developer uses POSIX_FADV_DONTNEED.
>>> >> >> But it has a problem. If kernel meets page is writing
>>> >> >> during invalidate_mapping_pages, it can't work.
>>> >> >> It is very hard for application programmer to use it.
>>> >> >> Because they always have to sync data before calling
>>> >> >> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
>>> >> >> be discardable. At last, they can't use deferred write of kernel
>>> >> >> so that they could see performance loss.
>>> >> >> (http://insights.oetiker.ch/linux/fadvise.html)
>>> >> >
>>> >> > If rsync use the above url patch, we don't need your patch.
>>> >> > fdatasync() + POSIX_FADV_DONTNEED should work fine.
>>> >>
>>> >> It works well. But it needs always fdatasync before calling fadvise.
>>> >> For small file, it hurt performance since we can't use the deferred write.
>>> >
>>> > I doubt rsync need to call fdatasync. Why?
>>> >
>>> > If rsync continue to do following loop, some POSIX_FADV_DONTNEED
>>> > may not drop some dirty pages. But they can be dropped at next loop's
>>> > POSIX_FADV_DONTNEED. Then, It doesn't make serious issue.
>>> >
>>> > 1) read
>>> > 2) write
>>> > 3) POSIX_FADV_DONTNEED
>>> > 4) goto 1
>>>
>>> fadvise need pair (offset and len).
>>> if the pair in next turn is different with one's previous turn, it
>>> couldn't be dropped.
>>
>> invalidate_mapping_pages() are using pagevec_lookup() and pagevec_lookup()
>> are using radix tree lookup. Then, Even if rsync always use [0, inf) pair, I don't think
>> it makes much slowdown.
>>
>
> I mean fdatasync causes slowdown, not fadvise.
> if you fadvise(don't need) without fdatasync, you could loss the data.

Oops sorry. Not lose data.
Frequent fdatasync call doesn't use buffered write so that it causes
slowdown than using deferred write.

> --
> Kind regards,
> Minchan Kim
>

--
Kind regards,
Minchan Kim

2010-11-23 09:28:44

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Sun, Nov 21, 2010 at 11:30:23PM +0900, Minchan Kim wrote:
> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
>
> Some app developer want to support POSIX_FADV_NOREUSE.
> But other OSes don't support it, either.
> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>
> By Other approach, app developer uses POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It is very hard for application programmer to use it.
> Because they always have to sync data before calling
> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> be discardable. At last, they can't use deferred write of kernel
> so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)
>
> In fact, invalidate is very big hint to reclaimer.
> It means we don't use the page any more. So let's move
> the writing page into inactive list's head.
>
> If it is real working set, it could have a enough time to
> activate the page since we always try to keep many pages in
> inactive list.
>
> I reuse lru_demote of Peter with some change.
>
> Reported-by: Ben Gamari <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
>
> Ben, Remain thing is to modify rsync and use
> fadvise(POSIX_FADV_DONTNEED). Could you test it?
> ---
> include/linux/swap.h | 1 +
> mm/swap.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++
> mm/truncate.c | 11 +++++---
> 3 files changed, 69 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index eba53e7..a3c9248 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -213,6 +213,7 @@ extern void mark_page_accessed(struct page *);
> extern void lru_add_drain(void);
> extern int lru_add_drain_all(void);
> extern void rotate_reclaimable_page(struct page *page);
> +extern void lru_deactive_page(struct page *page);
> extern void swap_setup(void);
>
> extern void add_page_to_unevictable_list(struct page *page);
> diff --git a/mm/swap.c b/mm/swap.c
> index 3f48542..56fa298 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -39,6 +39,8 @@ int page_cluster;
>
> static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_deactive_pvecs);
> +
>
> /*
> * This path almost never happens for VM activity - pages are normally
> @@ -266,6 +268,45 @@ void add_page_to_unevictable_list(struct page *page)
> spin_unlock_irq(&zone->lru_lock);
> }
>
> +static void __pagevec_lru_deactive(struct pagevec *pvec)
> +{

Might be worth commenting that this function must be called with pre-emption
disabled. FWIW, I am reasonably sure your implementation is prefectly safe
but a note wouldn't hurt.

> + int i, lru, file;
> +
> + struct zone *zone = NULL;
> +
> + for (i = 0; i < pagevec_count(pvec); i++) {
> + struct page *page = pvec->pages[i];
> + struct zone *pagezone = page_zone(page);
> +
> + if (pagezone != zone) {
> + if (zone)
> + spin_unlock_irq(&zone->lru_lock);
> + zone = pagezone;
> + spin_lock_irq(&zone->lru_lock);
> + }
> +
> + if (PageLRU(page)) {
> + if (PageActive(page)) {
> + file = page_is_file_cache(page);
> + lru = page_lru_base_type(page);
> + del_page_from_lru_list(zone, page,
> + lru + LRU_ACTIVE);
> + ClearPageActive(page);
> + ClearPageReferenced(page);
> + add_page_to_lru_list(zone, page, lru);
> + __count_vm_event(PGDEACTIVATE);
> +

What about memcg, do we not need to be calling mem_cgroup_add_lru_list() here
as well? I'm looking at the differences between what move_active_pages_to_lru()
is doing and this. I'm wondering if it'd be worth your whole building a list
of active pages that are to be moved to the inactive list and passing them
to move_active_pages_to_lru() ? I confuess I have not thought about it deeply
so it might be a terrible suggestion but it might reduce duplication of code.

> + update_page_reclaim_stat(zone, page, file, 0);
> + }
> + }
> + }
> + if (zone)
> + spin_unlock_irq(&zone->lru_lock);
> +
> + release_pages(pvec->pages, pvec->nr, pvec->cold);
> + pagevec_reinit(pvec);
> +}
> +
> /*
> * Drain pages out of the cpu's pagevecs.
> * Either "cpu" is the current CPU, and preemption has already been
> @@ -292,8 +333,28 @@ static void drain_cpu_pagevecs(int cpu)
> pagevec_move_tail(pvec);
> local_irq_restore(flags);
> }
> +
> + pvec = &per_cpu(lru_deactive_pvecs, cpu);
> + if (pagevec_count(pvec))
> + __pagevec_lru_deactive(pvec);
> +}
> +
> +/*
> + * Function used to forecefully demote a page to the head of the inactive
> + * list.

s/forecefully/forcefully/

The comment should also state *why* and under what circumstances we move
pages to the inactive list like this. Also based on the discussions
elsewhere in this thread, it'd be nice to include a comment why it's the
head of the inactive list and not the tail.

> + */
> +void lru_deactive_page(struct page *page)
> +{
> + if (likely(get_page_unless_zero(page))) {
> + struct pagevec *pvec = &get_cpu_var(lru_deactive_pvecs);
> +
> + if (!pagevec_add(pvec, page))
> + __pagevec_lru_deactive(pvec);
> + put_cpu_var(lru_deactive_pvecs);
> + }
> }
>
> +
> void lru_add_drain(void)
> {
> drain_cpu_pagevecs(get_cpu());
> diff --git a/mm/truncate.c b/mm/truncate.c
> index cd94607..c73fb19 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -332,7 +332,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> {
> struct pagevec pvec;
> pgoff_t next = start;
> - unsigned long ret = 0;
> + unsigned long ret;
> + unsigned long count = 0;
> int i;
>
> pagevec_init(&pvec, 0);
> @@ -359,8 +360,10 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> if (lock_failed)
> continue;
>
> - ret += invalidate_inode_page(page);
> -
> + ret = invalidate_inode_page(page);
> + if (!ret)
> + lru_deactive_page(page);
> + count += ret;
> unlock_page(page);
> if (next > end)
> break;
> @@ -369,7 +372,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> mem_cgroup_uncharge_end();
> cond_resched();
> }
> - return ret;
> + return count;
> }
> EXPORT_SYMBOL(invalidate_mapping_pages);
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-23 09:39:16

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Mon, Nov 22, 2010 at 09:01:32PM -0800, Andrew Morton wrote:
> On Tue, 23 Nov 2010 13:52:05 +0900 Minchan Kim <[email protected]> wrote:
>
> > >> +/*
> > >> + * Function used to forecefully demote a page to the head of the inactive
> > >> + * list.
> > >> + */
> > >
> > > This comment is wrong? __The page gets moved to the _tail_ of the
> > > inactive list?
> >
> > No. I add it in _head_ of the inactive list intentionally.
> > Why I don't add it to _tail_ is that I don't want to be aggressive.
> > The page might be real working set. So I want to give a chance to
> > activate it again.
>
> Well.. why? The user just tried to toss the page away altogether. If
> the kernel wasn't able to do that immediately, the best it can do is to
> toss the page away asap?
>

I'm just guessing here on the motivation but maybe it is in case FADV_DONENEED
was called on a page in use by another process (via read/write more do than
being mapped). Process A says "I don't need this" but by moving it to the
head of the list we give Process B a chance to reference it and reactivate
without incurring a major fault?

> > If it's not working set, it can be reclaimed easily and it can prevent
> > active page demotion since inactive list size would be big enough for
> > not calling shrink_active_list.
>
> What is "working set"? Mapped and unmapped pagecache, or are you
> referring solely to mapped pagecache?
>
> If it's mapped pagecache then the user was being a bit silly (or didn't
> know that some other process had mapped the file). In which case we
> need to decide what to do - leave the page alone, deactivate it, or
> half-deactivate it as this patch does.
>

What are the odds of an fadvise() user having used mincore() in advance
to determine if the page was in use by another process? I would guess
"low" so this half-deactivate gives a chance for the page to be promoted
again as well as a chance for the flusher threads to clean the page if
it really is to be reclaimed.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-23 09:44:14

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Mon, Nov 22, 2010 at 09:48:14PM -0800, Andrew Morton wrote:
> On Tue, 23 Nov 2010 14:45:15 +0900 Minchan Kim <[email protected]> wrote:
>
> > On Tue, Nov 23, 2010 at 2:22 PM, Andrew Morton
> > <[email protected]> wrote:
> > > On Tue, 23 Nov 2010 14:23:33 +0900 Minchan Kim <[email protected]> wrote:
> > >
> > >> On Tue, Nov 23, 2010 at 2:01 PM, Andrew Morton
> > >> <[email protected]> wrote:
> > >> > On Tue, 23 Nov 2010 13:52:05 +0900 Minchan Kim <[email protected]> wrote:
> > >> >
> > >> >> >> +/*
> > >> >> >> + * Function used to forecefully demote a page to the head of the inactive
> > >> >> >> + * list.
> > >> >> >> + */
> > >> >> >
> > >> >> > This comment is wrong? __The page gets moved to the _tail_ of the
> > >> >> > inactive list?
> > >> >>
> > >> >> No. I add it in _head_ of the inactive list intentionally.
> > >> >> Why I don't add it to _tail_ is that I don't want to be aggressive.
> > >> >> The page might be real working set. So I want to give a chance to
> > >> >> activate it again.
> > >> >
> > >> > Well.. __why? __The user just tried to toss the page away altogether. __If
> > >> > the kernel wasn't able to do that immediately, the best it can do is to
> > >> > toss the page away asap?
> > >> >
> > >> >> If it's not working set, it can be reclaimed easily and it can prevent
> > >> >> active page demotion since inactive list size would be big enough for
> > >> >> not calling shrink_active_list.
> > >> >
> > >> > What is "working set"? __Mapped and unmapped pagecache, or are you
> > >> > referring solely to mapped pagecache?
> > >>
> > >> I mean it's mapped by other processes.
> > >>
> > >> >
> > >> > If it's mapped pagecache then the user was being a bit silly (or didn't
> > >> > know that some other process had mapped the file). __In which case we
> > >> > need to decide what to do - leave the page alone, deactivate it, or
> > >> > half-deactivate it as this patch does.
> > >>
> > >>
> > >> What I want is the half-deactivate.
> > >>
> > >> Okay. We will use the result of invalidate_inode_page.
> > >> If fail happens by page_mapped, we can do half-deactivate.
> > >> But if fail happens by dirty(ex, writeback), we can add it to tail.
> > >> Does it make sense?
> > >
> > > Spose so. __It's unobvious.
> > >
> > > If the page is dirty or under writeback then reclaim will immediately
> > > move it to the head of the LRU anyway. __But given that the user has
> >
> > Why does it move into head of LRU?
> > If the page which isn't mapped doesn't have PG_referenced, it would be
> > reclaimed.
>
> If it's dirty or under writeback it can't be reclaimed!
>
> > > just freed a bunch of pages with invalidate(), it's unlikely that
> > > reclaim will be running soon.
> >
> > If reclaim doesn't start soon, it's good. That's because we have a
> > time to activate it and
> > when reclaim happens, reclaimer can reclaim pages easily.
> >
> > If I don't understand your point, could you elaborate on it?
>
> If reclaim doesn't happen soon and the page was dirty or under
> writeback (and hence unreclaimable) then there's a better chance that
> it _will_ be reclaimable by the time reclaim comes along and has a look
> at it. Yes, that's good.
>
> And a note to Mel: this is one way in which we can get significant
> (perhaps tremendous) numbers of dirty pages coming off the tail of the
> LRU, and hence eligible for pageout() treatment.
>

Agreed. This is why I'd be ok with the pages always being added to the head
of the inactive list to take into consideration another process might be
using the page (via read/write so it's not very obvious) and to give flusher
threads a chance to clean the page. Ideally the pages being deactivated would
be prioritised for cleaning but I don't think we have a mechanism for it at
the moment.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-23 09:51:10

[permalink] [raw]

Subject: Re: [RFC 2/2] Prevent promotion of page in madvise_dontneed

On Mon, Nov 22, 2010 at 02:21:09PM -0800, Andrew Morton wrote:
> On Sun, 21 Nov 2010 23:30:24 +0900
> Minchan Kim <[email protected]> wrote:
>
> > Now zap_pte_range alwayas promotes pages which are pte_young &&
> > !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
> > it's unnecessary since the page wouldn't use any more.
> >
> > If the page is sharred by other processes and it's real working set
>
> This patch doesn't actually do anything. It passes variable `promote'
> all the way down to unmap_vmas(), but unmap_vmas() doesn't use that new
> variable.
>
> Have a comment fixlet:
>
> --- a/mm/memory.c~mm-prevent-promotion-of-page-in-madvise_dontneed-fix
> +++ a/mm/memory.c
> @@ -1075,7 +1075,7 @@ static unsigned long unmap_page_range(st
> * @end_addr: virtual address at which to end unmapping
> * @nr_accounted: Place number of unmapped pages in vm-accountable vma's here
> * @details: details of nonlinear truncation or shared cache invalidation
> - * @promote: whether pages inclued vma would be promoted or not
> + * @promote: whether pages included in the vma should be promoted or not
> *
> * Returns the end address of the unmapping (restart addr if interrupted).
> *
> _
>
> Also, I'd suggest that we avoid introducing the term "promote".

Promote also has special meaning for huge pages. Demoting or promoting a
page refers to changing its size. The same applies to the other patch -
s/demote/deactive/ s/promote/activate/ . Currently this is no confusion
within the VM but when Andrea's THP patches are merged, it'll become an
issue.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-23 13:55:38

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, 23 Nov 2010 16:16:55 +0900 (JST), KOSAKI Motohiro <[email protected]> wrote:
> > On Sun, 21 Nov 2010 23:30:23 +0900, Minchan Kim <[email protected]> wrote:
> > >
> > > Ben, Remain thing is to modify rsync and use
> > > fadvise(POSIX_FADV_DONTNEED). Could you test it?
> >
> > Thanks a ton for the patch. Looks good. Testing as we speak.
>
For the record, this was a little premature. As I spoke the kernel was
building but I still haven't had a chance to take any data. Any
suggestions for how to determine the effect (or hopefully lack thereof)
of rsync on the system's working set?

> If possible, can you please post your rsync patch and your testcase
> (or your rsync option + system memory size info + data size info)?
>
Patch coming right up.

The original test case is a backup script for my home directory. rsync
is invoked with,

rsync --archive --update --progress --delete --delete-excluded
--exclude-from=~/.backup/exclude --log-file=~/.backup/rsync.log -e ssh
/home/ben ben@myserver:/mnt/backup/current

My home directory is 120 GB with typical delta sizes of tens of
megabytes between backups (although sometimes deltas can be gigabytes,
after which the server has severe interactivity issues). The server is
unfortunately quite memory constrained with only 1.5GB of memory (old
inherited hardware). Given the size of my typical deltas, I'm worried
that even simply walking the directory hierarchy might be enough to push
out my working set.

Looking at the rsync access pattern with strace it seems that it does
a very good job of avoid duplicate reads which is good news for these
patches.

Cheers,

- Ben

2010-11-23 14:50:23

[permalink] [raw]

Subject: [PATCH 1/3] Add fadvise interface wrapper

With recent discussion on the LKML[1], it seems likely that Linux will
finally support posix_fadvise in a useful way with the FADV_DONTNEED
flag. This should allow us to minimize the effect of rsync on the
system's working set. Add the necessary wrapper to syscall.c.

[1] http://lkml.org/lkml/2010/11/21/59
---
syscall.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/syscall.c b/syscall.c
index cfabc3e..9f5b1c3 100644
--- a/syscall.c
+++ b/syscall.c
@@ -28,6 +28,7 @@
#ifdef HAVE_SYS_ATTR_H
#include <sys/attr.h>
#endif
+#include <fcntl.h>

extern int dry_run;
extern int am_root;
@@ -282,3 +283,13 @@ OFF_T do_lseek(int fd, OFF_T offset, int whence)
return lseek(fd, offset, whence);
#endif
}
+
+#if _XOPEN_SOURCE >= 600
+int do_fadvise(int fd, OFF_T offset, OFF_T len, int advise)
+{
+ return posix_fadvise(fd, offset, len, advise);
+}
+#else
+#define do_fadvise()
+#endif
+
--
1.7.1

2010-11-23 14:50:28

[permalink] [raw]

Subject: [PATCH 3/3] Inform kernel of FADV_DONTNEED hint in receiver

Use the FADV_DONTNEED fadvise hint after finishing writing to a
destinataion fd in the receiver.
---
receiver.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/receiver.c b/receiver.c
index 39c5e49..33b21fb 100644
--- a/receiver.c
+++ b/receiver.c
@@ -721,6 +721,12 @@ int recv_files(int f_in, char *local_name)
recv_ok = receive_data(f_in, fnamecmp, fd1, st.st_size,
fname, fd2, F_LENGTH(file));

+ if (do_fadvise(fd2, 0, 0, POSIX_FADV_DONTNEED) != 0) {
+ rsyserr(FERROR_XFER, errno,
+ "fadvise failed in writing %s",
+ full_fname(fname));
+ }
+
log_item(log_code, file, &initial_stats, iflags, NULL);

if (fd1 != -1)
--
1.7.1

2010-11-23 14:50:48

[permalink] [raw]

Subject: [PATCH 2/3] Inform kernel of FADV_DONTNEED hint in sender

Use the FADV_DONTNEED fadvise hint after finishing reading an origin fd
in the sender.
---
sender.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/sender.c b/sender.c
index 59dae7d..a934bfe 100644
--- a/sender.c
+++ b/sender.c
@@ -338,6 +338,12 @@ void send_files(int f_in, int f_out)
if (do_progress)
end_progress(st.st_size);

+ if (do_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) != 0) {
+ rsyserr(FERROR_XFER, errno,
+ "fadvise failed in sending %s",
+ full_fname(fname));
+ }
+
log_item(log_code, file, &initial_stats, iflags, NULL);

if (mbuf) {
--
1.7.1

2010-11-23 14:50:18

[permalink] [raw]

Subject: [RFC PATCH] fadvise support in rsync

Warning for kernel folks: I'm not much of an mm person; let me know if I got
anything horribly wrong.

Many folks use rsync in their nightly backup jobs. In these applications, speed
is of minimal concern and should be sacrificed in order to minimize the effect
of rsync on the rest of the machine. When rsync is working on a large directory
it can quickly fill the page cache with written data, displacing the rest of
the system's working set. The solution for this is to inform the kernel that
our written pages will no longer be needed and should be expelled to disk. The
POSIX interface for this, posix_fadvise, has existed for some time, but there
has been no useable implementation in any of the major operating systems.

Attempts have been made in the past[1] to use the fadvise interface, but kernel
limitations have made this quite messy. In particular, the kernel supports
FADV_DONTNEED as a single-shot hint; i.e. if the page is clean when the hint is
given it will be freed, otherwise the hint is ignored. For this reason it is
necessary to fdatasync() against dirtied pages before giving the hint. This,
however, requires that rsync do some accounting, calling fdatasync() and
fadvise() only after giving the kernel an opportunity to flush the data itself.

Moreover, fadvise(DONTNEED) frees pages regardless of whether the hinting
process is the only referrer. For this reason, the previous fadvise patch also
used mincore to identify which pages are needed by other processes. Altogether,
this makes using fadvise very expensive from a complexity standpoint. This is
very unfortunately since the interface could be quite usable with a few minor
changes.

I recently asked about this on the LKML[2], where Minchan Kim was nice enough
to put together a patch improving support for the FADV_DONTNEED hint. His patch
adds invalidated flagged pages to the inactive list. This obviates the need for
fdatasync() since the page will be reclaimed by the kernel in the standard
inactive reclaim path. Moreover, by adding hinted pages to the head of the
inactive list, other processes are given ample time to call the pages back to
the active list, eliminating the need for the previous mincore() hack.

Here is my attempt at adding fadvise support to rsync (against v3.0.7). I do
this in both the sender (hinting after match_sums()) and the receiver (hinting
after receive_data()). In principle we could get better granularity if this was
hooked up within match_sums() (or even the map_ptr() interface) and the receive
loop in receive_data(), but I wanted to keep things simple at first (any
comments on these ideas?) . At the moment is for little more than testing.
Considering the potential negative effects of using FADV_DONTNEED on older
kernels, it is likely we will want this functionality off by default with a
command line flag to enable.

Cheers,

- Ben

[1] http://insights.oetiker.ch/linux/fadvise.html
[2] http://lkml.org/lkml/2010/11/21/59

2010-11-23 14:55:55

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, 23 Nov 2010 09:38:59 +0000, Mel Gorman <[email protected]> wrote:
> > If it's mapped pagecache then the user was being a bit silly (or didn't
> > know that some other process had mapped the file). In which case we
> > need to decide what to do - leave the page alone, deactivate it, or
> > half-deactivate it as this patch does.
> >
>
> What are the odds of an fadvise() user having used mincore() in advance
> to determine if the page was in use by another process? I would guess
> "low" so this half-deactivate gives a chance for the page to be promoted
> again as well as a chance for the flusher threads to clean the page if
> it really is to be reclaimed.
>
Do we really want to make the user jump through such hoops as using
mincore() just to get the kernel to handle use-once pages properly?
I hope the answer is no. I know that fadvise isn't supposed to be a
magic bullet, but it would be nice if more processes would use it to
indicate their access patterns and the only way that will happen is if
it is reasonably straightforward to use.

- Ben

2010-11-23 14:57:41

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, 23 Nov 2010 16:16:55 +0900 (JST), KOSAKI Motohiro <[email protected]> wrote:
> > By Other approach, app developer uses POSIX_FADV_DONTNEED.
> > But it has a problem. If kernel meets page is writing
> > during invalidate_mapping_pages, it can't work.
> > It is very hard for application programmer to use it.
> > Because they always have to sync data before calling
> > fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> > be discardable. At last, they can't use deferred write of kernel
> > so that they could see performance loss.
> > (http://insights.oetiker.ch/linux/fadvise.html)
>
> If rsync use the above url patch, we don't need your patch.
> fdatasync() + POSIX_FADV_DONTNEED should work fine.
>
This is quite true, but the patch itself is fairly invasive and
unnecessarily so which makes it unsuitable for merging in the eyes of
the rsync maintainers (not that I can blame them). This is by no fault
of its author; using fadvise is just far harder than it should be.

- Ben

2010-11-23 14:59:13

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, Nov 23, 2010 at 09:55:49AM -0500, Ben Gamari wrote:
> On Tue, 23 Nov 2010 09:38:59 +0000, Mel Gorman <[email protected]> wrote:
> > > If it's mapped pagecache then the user was being a bit silly (or didn't
> > > know that some other process had mapped the file). In which case we
> > > need to decide what to do - leave the page alone, deactivate it, or
> > > half-deactivate it as this patch does.
> > >
> >
> > What are the odds of an fadvise() user having used mincore() in advance
> > to determine if the page was in use by another process? I would guess
> > "low" so this half-deactivate gives a chance for the page to be promoted
> > again as well as a chance for the flusher threads to clean the page if
> > it really is to be reclaimed.
> >
> Do we really want to make the user jump through such hoops as using
> mincore() just to get the kernel to handle use-once pages properly?

I would think "no" which is why I support half-deactivating pages so they won't
have to. The downside is that it's essentially a race window as another process
needs to reactivate the page before it gets reclaimed to avoid a major fault.

> I hope the answer is no. I know that fadvise isn't supposed to be a
> magic bullet, but it would be nice if more processes would use it to
> indicate their access patterns and the only way that will happen is if
> it is reasonably straightforward to use.
>
> - Ben
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-23 15:36:45

by Pádraig Brady

[permalink] [raw]

Subject: Re: [RFC PATCH] fadvise support in rsync

On 23/11/10 14:49, Ben Gamari wrote:
> Warning for kernel folks: I'm not much of an mm person; let me know if I got
> anything horribly wrong.
>
> Many folks use rsync in their nightly backup jobs. In these applications, speed
> is of minimal concern and should be sacrificed in order to minimize the effect
> of rsync on the rest of the machine. When rsync is working on a large directory
> it can quickly fill the page cache with written data, displacing the rest of
> the system's working set. The solution for this is to inform the kernel that
> our written pages will no longer be needed and should be expelled to disk. The
> POSIX interface for this, posix_fadvise, has existed for some time, but there
> has been no useable implementation in any of the major operating systems.
>
> Attempts have been made in the past[1] to use the fadvise interface, but kernel
> limitations have made this quite messy. In particular, the kernel supports
> FADV_DONTNEED as a single-shot hint; i.e. if the page is clean when the hint is
> given it will be freed, otherwise the hint is ignored. For this reason it is
> necessary to fdatasync() against dirtied pages before giving the hint. This,
> however, requires that rsync do some accounting, calling fdatasync() and
> fadvise() only after giving the kernel an opportunity to flush the data itself.
>
> Moreover, fadvise(DONTNEED) frees pages regardless of whether the hinting
> process is the only referrer. For this reason, the previous fadvise patch also
> used mincore to identify which pages are needed by other processes. Altogether,
> this makes using fadvise very expensive from a complexity standpoint. This is
> very unfortunately since the interface could be quite usable with a few minor
> changes.

Thanks for the easy to understand summary,
and for working through a solution.

> I recently asked about this on the LKML[2], where Minchan Kim was nice enough
> to put together a patch improving support for the FADV_DONTNEED hint. His patch
> adds invalidated flagged pages to the inactive list. This obviates the need for
> fdatasync() since the page will be reclaimed by the kernel in the standard
> inactive reclaim path. Moreover, by adding hinted pages to the head of the
> inactive list, other processes are given ample time to call the pages back to
> the active list, eliminating the need for the previous mincore() hack.
>
> Here is my attempt at adding fadvise support to rsync (against v3.0.7). I do
> this in both the sender (hinting after match_sums()) and the receiver (hinting
> after receive_data()).

So for the moment you still thrash the cache when
backing up large files. Fair enough.

> In principle we could get better granularity if this was
> hooked up within match_sums() (or even the map_ptr() interface) and the receive
> loop in receive_data(), but I wanted to keep things simple at first (any
> comments on these ideas?) .

I implemented finer grained fadvise(DONTNEED) in dvd-vr?
and noted that I had to be careful not to invalidate
read ahead data from the cache. I.E. I had to specify
the specific range I had processed.

> At the moment is for little more than testing.
> Considering the potential negative effects of using FADV_DONTNEED on older
> kernels, it is likely we will want this functionality off by default with a
> command line flag to enable.

The downside being, files opened by other apps may be dropped?
Well if you're thrashing the cache anyway I'd just enable by
default and get the good behavior on new kernels.

cheers,
P?draig.

[1] http://www.pixelbeat.org/programs/dvd-vr/dvd-vr-0.9.7.tar.gz

2010-11-23 20:36:33

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, 23 Nov 2010 14:58:56 +0000
Mel Gorman <[email protected]> wrote:

> On Tue, Nov 23, 2010 at 09:55:49AM -0500, Ben Gamari wrote:
> > On Tue, 23 Nov 2010 09:38:59 +0000, Mel Gorman <[email protected]> wrote:
> > > > If it's mapped pagecache then the user was being a bit silly (or didn't
> > > > know that some other process had mapped the file). In which case we
> > > > need to decide what to do - leave the page alone, deactivate it, or
> > > > half-deactivate it as this patch does.
> > > >
> > >
> > > What are the odds of an fadvise() user having used mincore() in advance
> > > to determine if the page was in use by another process? I would guess
> > > "low" so this half-deactivate gives a chance for the page to be promoted
> > > again as well as a chance for the flusher threads to clean the page if
> > > it really is to be reclaimed.
> > >
> > Do we really want to make the user jump through such hoops as using
> > mincore() just to get the kernel to handle use-once pages properly?
>
> I would think "no" which is why I support half-deactivating pages so they won't
> have to.

If the page is page_mapped() then we can assume that some other process
is using it and we leave it alone *altogether*.

If the page is dirty or under writeback (and !page_mapped()) then we
should assume that we should free it asap. The PageReclaim() trick
might help with that.

I just don't see any argument for moving the page to the head of the
inactive LRU as a matter of policy. We can park it there because we
can't think of anythnig else to do with it, but it's the wrong place
for it.

2010-11-23 22:11:07

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, Nov 23, 2010 at 12:35:35PM -0800, Andrew Morton wrote:
> On Tue, 23 Nov 2010 14:58:56 +0000
> Mel Gorman <[email protected]> wrote:
>
> > On Tue, Nov 23, 2010 at 09:55:49AM -0500, Ben Gamari wrote:
> > > On Tue, 23 Nov 2010 09:38:59 +0000, Mel Gorman <[email protected]> wrote:
> > > > > If it's mapped pagecache then the user was being a bit silly (or didn't
> > > > > know that some other process had mapped the file). In which case we
> > > > > need to decide what to do - leave the page alone, deactivate it, or
> > > > > half-deactivate it as this patch does.
> > > > >
> > > >
> > > > What are the odds of an fadvise() user having used mincore() in advance
> > > > to determine if the page was in use by another process? I would guess
> > > > "low" so this half-deactivate gives a chance for the page to be promoted
> > > > again as well as a chance for the flusher threads to clean the page if
> > > > it really is to be reclaimed.
> > > >
> > > Do we really want to make the user jump through such hoops as using
> > > mincore() just to get the kernel to handle use-once pages properly?
> >
> > I would think "no" which is why I support half-deactivating pages so they won't
> > have to.
>
> If the page is page_mapped() then we can assume that some other process
> is using it and we leave it alone *altogether*.
>

Agreed, that makes perfect sense.

> If the page is dirty or under writeback (and !page_mapped()) then we
> should assume that we should free it asap. The PageReclaim() trick
> might help with that.
>

Again agreed.

> I just don't see any argument for moving the page to the head of the
> inactive LRU as a matter of policy. We can park it there because we
> can't think of anythnig else to do with it, but it's the wrong place
> for it.
>

Is there a better alternative? One thing that springs to mind is that we are
not exactly tracking very well what effect these policy changes have. The
analysis scripts I have do a reasonable job on tracking reclaim activity
(although only as part of the mmtests tarball, I should split them out as
a standalone tool) but not the impact - namely minor and major faults. I
should sort that out so we can put better reclaim analysis in place.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-23 23:24:39

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

Hi Mel,

On Tue, Nov 23, 2010 at 6:28 PM, Mel Gorman <[email protected]> wrote:
> On Sun, Nov 21, 2010 at 11:30:23PM +0900, Minchan Kim wrote:
>> Recently, there are reported problem about thrashing.
>> (http://marc.info/?l=rsync&m=128885034930933&w=2)
>> It happens by backup workloads(ex, nightly rsync).
>> That's because the workload makes just use-once pages
>> and touches pages twice. It promotes the page into
>> active list so that it results in working set page eviction.
>>
>> Some app developer want to support POSIX_FADV_NOREUSE.
>> But other OSes don't support it, either.
>> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>>
>> By Other approach, app developer uses POSIX_FADV_DONTNEED.
>> But it has a problem. If kernel meets page is writing
>> during invalidate_mapping_pages, it can't work.
>> It is very hard for application programmer to use it.
>> Because they always have to sync data before calling
>> fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
>> be discardable. At last, they can't use deferred write of kernel
>> so that they could see performance loss.
>> (http://insights.oetiker.ch/linux/fadvise.html)
>>
>> In fact, invalidate is very big hint to reclaimer.
>> It means we don't use the page any more. So let's move
>> the writing page into inactive list's head.
>>
>> If it is real working set, it could have a enough time to
>> activate the page since we always try to keep many pages in
>> inactive list.
>>
>> I reuse lru_demote of Peter with some change.
>>
>> Reported-by: Ben Gamari <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>> Signed-off-by: Peter Zijlstra <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>>
>> Ben, Remain thing is to modify rsync and use
>> fadvise(POSIX_FADV_DONTNEED). Could you test it?
>> ---
>> ?include/linux/swap.h | ? ?1 +
>> ?mm/swap.c ? ? ? ? ? ?| ? 61 ++++++++++++++++++++++++++++++++++++++++++++++++++
>> ?mm/truncate.c ? ? ? ?| ? 11 +++++---
>> ?3 files changed, 69 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index eba53e7..a3c9248 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -213,6 +213,7 @@ extern void mark_page_accessed(struct page *);
>> ?extern void lru_add_drain(void);
>> ?extern int lru_add_drain_all(void);
>> ?extern void rotate_reclaimable_page(struct page *page);
>> +extern void lru_deactive_page(struct page *page);
>> ?extern void swap_setup(void);
>>
>> ?extern void add_page_to_unevictable_list(struct page *page);
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 3f48542..56fa298 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -39,6 +39,8 @@ int page_cluster;
>>
>> ?static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
>> ?static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>> +static DEFINE_PER_CPU(struct pagevec, lru_deactive_pvecs);
>> +
>>
>> ?/*
>> ? * This path almost never happens for VM activity - pages are normally
>> @@ -266,6 +268,45 @@ void add_page_to_unevictable_list(struct page *page)
>> ? ? ? spin_unlock_irq(&zone->lru_lock);
>> ?}
>>
>> +static void __pagevec_lru_deactive(struct pagevec *pvec)
>> +{
>
> Might be worth commenting that this function must be called with pre-emption
> disabled. FWIW, I am reasonably sure your implementation is prefectly safe
> but a note wouldn't hurt.

Will fix.

>
>> + ? ? int i, lru, file;
>> +
>> + ? ? struct zone *zone = NULL;
>> +
>> + ? ? for (i = 0; i < pagevec_count(pvec); i++) {
>> + ? ? ? ? ? ? struct page *page = pvec->pages[i];
>> + ? ? ? ? ? ? struct zone *pagezone = page_zone(page);
>> +
>> + ? ? ? ? ? ? if (pagezone != zone) {
>> + ? ? ? ? ? ? ? ? ? ? if (zone)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? spin_unlock_irq(&zone->lru_lock);
>> + ? ? ? ? ? ? ? ? ? ? zone = pagezone;
>> + ? ? ? ? ? ? ? ? ? ? spin_lock_irq(&zone->lru_lock);
>> + ? ? ? ? ? ? }
>> +
>> + ? ? ? ? ? ? if (PageLRU(page)) {
>> + ? ? ? ? ? ? ? ? ? ? if (PageActive(page)) {
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? file = page_is_file_cache(page);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru = page_lru_base_type(page);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? del_page_from_lru_list(zone, page,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru + LRU_ACTIVE);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ClearPageActive(page);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ClearPageReferenced(page);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? add_page_to_lru_list(zone, page, lru);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? __count_vm_event(PGDEACTIVATE);
>> +
>
> What about memcg, do we not need to be calling mem_cgroup_add_lru_list() here
> as well? I'm looking at the differences between what move_active_pages_to_lru()

Recently, add_page_to_lru_list contains mem_cgroup_add_lru_list.

> is doing and this. I'm wondering if it'd be worth your whole building a list
> of active pages that are to be moved to the inactive list and passing them
> to move_active_pages_to_lru() ? I confuess I have not thought about it deeply
> so it might be a terrible suggestion but it might reduce duplication of code.

Firstly I tried it so I sent a patch about making
move_to_active_pages_to_lru more generic.
move_to_active_pages_to_lru needs zone argument so I need gathering
pages per zone in truncate.
I don't want for user of the function to consider even zone and
zone->lru_lock handling.

I think the lru_demote_pages could be used elsewhere(ex, readahead max
size heuristic).
So it's generic and easy to use. :)

>
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? update_page_reclaim_stat(zone, page, file, 0);
>> + ? ? ? ? ? ? ? ? ? ? }
>> + ? ? ? ? ? ? }
>> + ? ? }
>> + ? ? if (zone)
>> + ? ? ? ? ? ? spin_unlock_irq(&zone->lru_lock);
>> +
>> + ? ? release_pages(pvec->pages, pvec->nr, pvec->cold);
>> + ? ? pagevec_reinit(pvec);
>> +}
>> +
>> ?/*
>> ? * Drain pages out of the cpu's pagevecs.
>> ? * Either "cpu" is the current CPU, and preemption has already been
>> @@ -292,8 +333,28 @@ static void drain_cpu_pagevecs(int cpu)
>> ? ? ? ? ? ? ? pagevec_move_tail(pvec);
>> ? ? ? ? ? ? ? local_irq_restore(flags);
>> ? ? ? }
>> +
>> + ? ? pvec = &per_cpu(lru_deactive_pvecs, cpu);
>> + ? ? if (pagevec_count(pvec))
>> + ? ? ? ? ? ? __pagevec_lru_deactive(pvec);
>> +}
>> +
>> +/*
>> + * Function used to forecefully demote a page to the head of the inactive
>> + * list.
>
> s/forecefully/forcefully/
>
> The comment should also state *why* and under what circumstances we move
> pages to the inactive list like this. Also based on the discussions
> elsewhere in this thread, it'd be nice to include a comment why it's the
> head of the inactive list and not the tail.

Fair enough.

Thanks for the comment, Mel.

--
Kind regards,
Minchan Kim

2010-11-23 23:32:29

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, Nov 23, 2010 at 6:43 PM, Mel Gorman <[email protected]> wrote:
> On Mon, Nov 22, 2010 at 09:48:14PM -0800, Andrew Morton wrote:
>> On Tue, 23 Nov 2010 14:45:15 +0900 Minchan Kim <[email protected]> wrote:
>>
>> > On Tue, Nov 23, 2010 at 2:22 PM, Andrew Morton
>> > <[email protected]> wrote:
>> > > On Tue, 23 Nov 2010 14:23:33 +0900 Minchan Kim <[email protected]> wrote:
>> > >
>> > >> On Tue, Nov 23, 2010 at 2:01 PM, Andrew Morton
>> > >> <[email protected]> wrote:
>> > >> > On Tue, 23 Nov 2010 13:52:05 +0900 Minchan Kim <[email protected]> wrote:
>> > >> >
>> > >> >> >> +/*
>> > >> >> >> + * Function used to forecefully demote a page to the head of the inactive
>> > >> >> >> + * list.
>> > >> >> >> + */
>> > >> >> >
>> > >> >> > This comment is wrong? __The page gets moved to the _tail_ of the
>> > >> >> > inactive list?
>> > >> >>
>> > >> >> No. I add it in _head_ of the inactive list intentionally.
>> > >> >> Why I don't add it to _tail_ is that I don't want to be aggressive.
>> > >> >> The page might be real working set. So I want to give a chance to
>> > >> >> activate it again.
>> > >> >
>> > >> > Well.. __why? __The user just tried to toss the page away altogether. __If
>> > >> > the kernel wasn't able to do that immediately, the best it can do is to
>> > >> > toss the page away asap?
>> > >> >
>> > >> >> If it's not working set, it can be reclaimed easily and it can prevent
>> > >> >> active page demotion since inactive list size would be big enough for
>> > >> >> not calling shrink_active_list.
>> > >> >
>> > >> > What is "working set"? __Mapped and unmapped pagecache, or are you
>> > >> > referring solely to mapped pagecache?
>> > >>
>> > >> I mean it's mapped by other processes.
>> > >>
>> > >> >
>> > >> > If it's mapped pagecache then the user was being a bit silly (or didn't
>> > >> > know that some other process had mapped the file). __In which case we
>> > >> > need to decide what to do - leave the page alone, deactivate it, or
>> > >> > half-deactivate it as this patch does.
>> > >>
>> > >>
>> > >> What I want is the half-deactivate.
>> > >>
>> > >> Okay. We will use the result of invalidate_inode_page.
>> > >> If fail happens by page_mapped, we can do half-deactivate.
>> > >> But if fail happens by dirty(ex, writeback), we can add it to tail.
>> > >> Does it make sense?
>> > >
>> > > Spose so. __It's unobvious.
>> > >
>> > > If the page is dirty or under writeback then reclaim will immediately
>> > > move it to the head of the LRU anyway. __But given that the user has
>> >
>> > Why does it move into head of LRU?
>> > If the page which isn't mapped doesn't have PG_referenced, it would be
>> > reclaimed.
>>
>> If it's dirty or under writeback it can't be reclaimed!
>>
>> > > just freed a bunch of pages with invalidate(), it's unlikely that
>> > > reclaim will be running soon.
>> >
>> > If reclaim doesn't start soon, it's good. That's because we have a
>> > time to activate it and
>> > when reclaim happens, reclaimer can reclaim pages easily.
>> >
>> > If I don't understand your point, could you elaborate on it?
>>
>> If reclaim doesn't happen soon and the page was dirty or under
>> writeback (and hence unreclaimable) then there's a better chance that
>> it _will_ be reclaimable by the time reclaim comes along and has a look
>> at it. ?Yes, that's good.
>>
>> And a note to Mel: this is one way in which we can get significant
>> (perhaps tremendous) numbers of dirty pages coming off the tail of the
>> LRU, and hence eligible for pageout() treatment.
>>
>
> Agreed. This is why I'd be ok with the pages always being added to the head
> of the inactive list to take into consideration another process might be
> using the page (via read/write so it's not very obvious) and to give flusher
> threads a chance to clean the page. Ideally the pages being deactivated would
> be prioritised for cleaning but I don't think we have a mechanism for it at
> the moment.

In view point of reclaim, it's right. but might not for efficient write.
If we have a such mechanism, current pageout effect(random write)
could happen still.
I think what we can do is to prevent writeout by pageout.
For it, it's good to put the page head of inacitve and as soon as
write complete, we move the page to tail of inactive.

>
> --
> Mel Gorman
> Part-time Phd Student ? ? ? ? ? ? ? ? ? ? ? ? ?Linux Technology Center
> University of Limerick ? ? ? ? ? ? ? ? ? ? ? ? IBM Dublin Software Lab
>

--
Kind regards,
Minchan Kim

2010-11-23 23:45:22

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Wed, Nov 24, 2010 at 7:10 AM, Mel Gorman <[email protected]> wrote:
> On Tue, Nov 23, 2010 at 12:35:35PM -0800, Andrew Morton wrote:
>> On Tue, 23 Nov 2010 14:58:56 +0000
>> Mel Gorman <[email protected]> wrote:
>>
>> > On Tue, Nov 23, 2010 at 09:55:49AM -0500, Ben Gamari wrote:
>> > > On Tue, 23 Nov 2010 09:38:59 +0000, Mel Gorman <[email protected]> wrote:
>> > > > > If it's mapped pagecache then the user was being a bit silly (or didn't
>> > > > > know that some other process had mapped the file). ?In which case we
>> > > > > need to decide what to do - leave the page alone, deactivate it, or
>> > > > > half-deactivate it as this patch does.
>> > > > >
>> > > >
>> > > > What are the odds of an fadvise() user having used mincore() in advance
>> > > > to determine if the page was in use by another process? I would guess
>> > > > "low" so this half-deactivate gives a chance for the page to be promoted
>> > > > again as well as a chance for the flusher threads to clean the page if
>> > > > it really is to be reclaimed.
>> > > >
>> > > Do we really want to make the user jump through such hoops as using
>> > > mincore() just to get the kernel to handle use-once pages properly?
>> >
>> > I would think "no" which is why I support half-deactivating pages so they won't
>> > have to.
>>
>> If the page is page_mapped() then we can assume that some other process
>> is using it and we leave it alone *altogether*.
>>
>
> Agreed, that makes perfect sense.
>
>> If the page is dirty or under writeback (and !page_mapped()) then we
>> should assume that we should free it asap. ?The PageReclaim() trick
>> might help with that.
>>
>
> Again agreed.
>
>> I just don't see any argument for moving the page to the head of the
>> inactive LRU as a matter of policy. ?We can park it there because we
>> can't think of anythnig else to do with it, but it's the wrong place
>> for it.
>>
>
> Is there a better alternative? One thing that springs to mind is that we are
> not exactly tracking very well what effect these policy changes have. The
> analysis scripts I have do a reasonable job on tracking reclaim activity
> (although only as part of the mmtests tarball, I should split them out as
> a standalone tool) but not the impact - namely minor and major faults. I
> should sort that out so we can put better reclaim analysis in place.

It can help very much. :)

Also, I need time since I am so busy.

>
> --
> Mel Gorman
> Part-time Phd Student ? ? ? ? ? ? ? ? ? ? ? ? ?Linux Technology Center
> University of Limerick ? ? ? ? ? ? ? ? ? ? ? ? IBM Dublin Software Lab
>

--
Kind regards,
Minchan Kim

2010-11-23 23:48:51

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Tue, Nov 23, 2010 at 10:48 PM, Ben Gamari <[email protected]> wrote:
> On Tue, 23 Nov 2010 16:16:55 +0900 (JST), KOSAKI Motohiro <[email protected]> wrote:
>> > On Sun, 21 Nov 2010 23:30:23 +0900, Minchan Kim <[email protected]> wrote:
>> > >
>> > > Ben, Remain thing is to modify rsync and use
>> > > fadvise(POSIX_FADV_DONTNEED). Could you test it?
>> >
>> > Thanks a ton for the patch. Looks good. Testing as we speak.
>>
> For the record, this was a little premature. As I spoke the kernel was
> building but I still haven't had a chance to take any data. Any
> suggestions for how to determine the effect (or hopefully lack thereof)
> of rsync on the system's working set?
>
>> If possible, can you please post your rsync patch and your testcase
>> (or your rsync option + system memory size info + data size info)?
>>
> Patch coming right up.
>
> The original test case is a backup script for my home directory. rsync
> is invoked with,
>
> rsync --archive --update --progress --delete --delete-excluded
> --exclude-from=~/.backup/exclude --log-file=~/.backup/rsync.log -e ssh
> /home/ben ben@myserver:/mnt/backup/current
>
> My home directory is 120 GB with typical delta sizes of tens of
> megabytes between backups (although sometimes deltas can be gigabytes,
> after which the server has severe interactivity issues). The server is
> unfortunately quite memory constrained with only 1.5GB of memory (old
> inherited hardware). Given the size of my typical deltas, I'm worried
> that even simply walking the directory hierarchy might be enough to push
> out my working set.
>
> Looking at the rsync access pattern with strace it seems that it does
> a very good job of avoid duplicate reads which is good news for these
> patches.

Thanks for the notice. Ben.
FYI, we have a plan to change the policy as you look this thread.
Maybe It would be good than my current policy in the page.

Please recognize it. :)

>
> Cheers,
>
> - Ben
>
>
>

--
Kind regards,
Minchan Kim

2010-11-23 23:49:52

[permalink] [raw]

Subject: Re: [RFC 2/2] Prevent promotion of page in madvise_dontneed

On Tue, Nov 23, 2010 at 6:50 PM, Mel Gorman <[email protected]> wrote:
> On Mon, Nov 22, 2010 at 02:21:09PM -0800, Andrew Morton wrote:
>> On Sun, 21 Nov 2010 23:30:24 +0900
>> Minchan Kim <[email protected]> wrote:
>>
>> > Now zap_pte_range alwayas promotes pages which are pte_young &&
>> > !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
>> > it's unnecessary since the page wouldn't use any more.
>> >
>> > If the page is sharred by other processes and it's real working set
>>
>> This patch doesn't actually do anything. ?It passes variable `promote'
>> all the way down to unmap_vmas(), but unmap_vmas() doesn't use that new
>> variable.
>>
>> Have a comment fixlet:
>>
>> --- a/mm/memory.c~mm-prevent-promotion-of-page-in-madvise_dontneed-fix
>> +++ a/mm/memory.c
>> @@ -1075,7 +1075,7 @@ static unsigned long unmap_page_range(st
>> ? * @end_addr: virtual address at which to end unmapping
>> ? * @nr_accounted: Place number of unmapped pages in vm-accountable vma's here
>> ? * @details: details of nonlinear truncation or shared cache invalidation
>> - * @promote: whether pages inclued vma would be promoted or not
>> + * @promote: whether pages included in the vma should be promoted or not
>> ? *
>> ? * Returns the end address of the unmapping (restart addr if interrupted).
>> ? *
>> _
>>
>> Also, I'd suggest that we avoid introducing the term "promote".
>
> Promote also has special meaning for huge pages. Demoting or promoting a
> page refers to changing its size. The same applies to the other patch -
> s/demote/deactive/ s/promote/activate/ . Currently this is no confusion
> within the VM but when Andrea's THP patches are merged, it'll become an
> issue.

Thanks for the information.

>
> --
> Mel Gorman
> Part-time Phd Student ? ? ? ? ? ? ? ? ? ? ? ? ?Linux Technology Center
> University of Limerick ? ? ? ? ? ? ? ? ? ? ? ? IBM Dublin Software Lab
>

--
Kind regards,
Minchan Kim

2010-11-24 00:13:57

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

> On Tue, 23 Nov 2010 16:16:55 +0900 (JST), KOSAKI Motohiro <[email protected]> wrote:
> > > By Other approach, app developer uses POSIX_FADV_DONTNEED.
> > > But it has a problem. If kernel meets page is writing
> > > during invalidate_mapping_pages, it can't work.
> > > It is very hard for application programmer to use it.
> > > Because they always have to sync data before calling
> > > fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
> > > be discardable. At last, they can't use deferred write of kernel
> > > so that they could see performance loss.
> > > (http://insights.oetiker.ch/linux/fadvise.html)
> >
> > If rsync use the above url patch, we don't need your patch.
> > fdatasync() + POSIX_FADV_DONTNEED should work fine.
> >
> This is quite true, but the patch itself is fairly invasive and
> unnecessarily so which makes it unsuitable for merging in the eyes of
> the rsync maintainers (not that I can blame them). This is by no fault
> of its author; using fadvise is just far harder than it should be.

Yeah. I agree with this patch don't have negative impact. I was only
curious why Minchan drop autodetect logic.

Please don't think I'm against your effort.

Thanks.

2010-11-24 00:18:01

[permalink] [raw]

Subject: Re: [RFC PATCH] fadvise support in rsync

> Here is my attempt at adding fadvise support to rsync (against v3.0.7). I do
> this in both the sender (hinting after match_sums()) and the receiver (hinting
> after receive_data()). In principle we could get better granularity if this was
> hooked up within match_sums() (or even the map_ptr() interface) and the receive
> loop in receive_data(), but I wanted to keep things simple at first (any
> comments on these ideas?) . At the moment is for little more than testing.
> Considering the potential negative effects of using FADV_DONTNEED on older
> kernels, it is likely we will want this functionality off by default with a
> command line flag to enable.

Great!

As far as you don't hesitate userland app, we have no reason to hesitate
kernel change. We are only worry about to create no user interface. Because
it's maintainance nightmare.

2010-11-24 10:03:00

[permalink] [raw]

Subject: Re: [RFC 1/2] deactive invalidated pages

On Wed, Nov 24, 2010 at 08:24:35AM +0900, Minchan Kim wrote:
> >> <SNIP>
> >>
> >> +static void __pagevec_lru_deactive(struct pagevec *pvec)
> >> +{
> >
> > Might be worth commenting that this function must be called with pre-emption
> > disabled. FWIW, I am reasonably sure your implementation is prefectly safe
> > but a note wouldn't hurt.
>
> Will fix.
>

Thanks

> >
> >> + ? ? int i, lru, file;
> >> +
> >> + ? ? struct zone *zone = NULL;
> >> +
> >> + ? ? for (i = 0; i < pagevec_count(pvec); i++) {
> >> + ? ? ? ? ? ? struct page *page = pvec->pages[i];
> >> + ? ? ? ? ? ? struct zone *pagezone = page_zone(page);
> >> +
> >> + ? ? ? ? ? ? if (pagezone != zone) {
> >> + ? ? ? ? ? ? ? ? ? ? if (zone)
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? spin_unlock_irq(&zone->lru_lock);
> >> + ? ? ? ? ? ? ? ? ? ? zone = pagezone;
> >> + ? ? ? ? ? ? ? ? ? ? spin_lock_irq(&zone->lru_lock);
> >> + ? ? ? ? ? ? }
> >> +
> >> + ? ? ? ? ? ? if (PageLRU(page)) {
> >> + ? ? ? ? ? ? ? ? ? ? if (PageActive(page)) {
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? file = page_is_file_cache(page);
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru = page_lru_base_type(page);
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? del_page_from_lru_list(zone, page,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru + LRU_ACTIVE);
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ClearPageActive(page);
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ClearPageReferenced(page);
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? add_page_to_lru_list(zone, page, lru);
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? __count_vm_event(PGDEACTIVATE);
> >> +
> >
> > What about memcg, do we not need to be calling mem_cgroup_add_lru_list() here
> > as well? I'm looking at the differences between what move_active_pages_to_lru()
>
> Recently, add_page_to_lru_list contains mem_cgroup_add_lru_list.
>

My bad, you're right. I was thrown by move_active_pages_to_lru() needing to
do memcg stuff manually and didn't spot why it ok to miss it here.

> > is doing and this. I'm wondering if it'd be worth your whole building a list
> > of active pages that are to be moved to the inactive list and passing them
> > to move_active_pages_to_lru() ? I confuess I have not thought about it deeply
> > so it might be a terrible suggestion but it might reduce duplication of code.
>
> Firstly I tried it so I sent a patch about making
> move_to_active_pages_to_lru more generic.
> move_to_active_pages_to_lru needs zone argument so I need gathering
> pages per zone in truncate.
> I don't want for user of the function to consider even zone and
> zone->lru_lock handling.
>
> I think the lru_demote_pages could be used elsewhere(ex, readahead max
> size heuristic).
> So it's generic and easy to use. :)
>

Ok, that's fair enough.

> >
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? update_page_reclaim_stat(zone, page, file, 0);
> >> + ? ? ? ? ? ? ? ? ? ? }
> >> + ? ? ? ? ? ? }
> >> + ? ? }
> >> + ? ? if (zone)
> >> + ? ? ? ? ? ? spin_unlock_irq(&zone->lru_lock);
> >> +
> >> + ? ? release_pages(pvec->pages, pvec->nr, pvec->cold);
> >> + ? ? pagevec_reinit(pvec);
> >> +}
> >> +
> >> ?/*
> >> ? * Drain pages out of the cpu's pagevecs.
> >> ? * Either "cpu" is the current CPU, and preemption has already been
> >> @@ -292,8 +333,28 @@ static void drain_cpu_pagevecs(int cpu)
> >> ? ? ? ? ? ? ? pagevec_move_tail(pvec);
> >> ? ? ? ? ? ? ? local_irq_restore(flags);
> >> ? ? ? }
> >> +
> >> + ? ? pvec = &per_cpu(lru_deactive_pvecs, cpu);
> >> + ? ? if (pagevec_count(pvec))
> >> + ? ? ? ? ? ? __pagevec_lru_deactive(pvec);
> >> +}
> >> +
> >> +/*
> >> + * Function used to forecefully demote a page to the head of the inactive
> >> + * list.
> >
> > s/forecefully/forcefully/
> >
> > The comment should also state *why* and under what circumstances we move
> > pages to the inactive list like this. Also based on the discussions
> > elsewhere in this thread, it'd be nice to include a comment why it's the
> > head of the inactive list and not the tail.
>
> Fair enough.
>
> Thanks for the comment, Mel.
>
> --
> Kind regards,
> Minchan Kim
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-24 18:01:39