2012-10-30 01:24:12

by Minchan Kim

[permalink] [raw]
Subject: [RFC v2] Support volatile range for anon vma

This patch introudces new madvise behavior MADV_VOLATILE and
MADV_NOVOLATILE for anonymous pages. It's different with
John Stultz's version which considers only tmpfs while this patch
considers only anonymous pages so this cannot cover John's one.
If below idea is proved as reasonable, I hope we can unify both
concepts by madvise/fadvise.

Rationale is following as.
Many allocators call munmap(2) when user call free(3) if ptr is
in mmaped area. But munmap isn't cheap because it have to clean up
all pte entries and unlinking a vma so overhead would be increased
linearly by mmaped area's size.

Volatile conecept of Robert Love could be very useful for reducing
free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
munmap(2)(Of course, they need to manage volatile mmaped area to
reduce shortage of address space and sometime ends up unmaping them).
The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because

1) it just marks the flag in VMA and
2) if memory pressure happens, VM can discard pages of volatile VMA
instead of swapping out when volatile pages is selected as victim
by normal VM aging policy.
3) freed mmaped area doesn't include any meaningful data so there
is no point to swap them out.

Allocator should call madvise(MADV_NOVOLATILE) before reusing for
allocating that area to user. Otherwise, accessing of volatile range
will meet SIGBUS error.

The downside is that we have to age anon lru list although we don't
have swap because I don't want to discard volatile pages by top priority
when memory pressure happens as volatile in this patch means "We don't
need to swap out because user can handle the situation which data are
disappear suddenly", NOT "They are useless so hurry up to reclaim them".
So I want to apply same aging rule of nomal pages to them.

Anon background aging of non-swap system would be a trade-off for
getting good feature. Even, we had done it two years ago until merge
[1] and I believe free(3) performance gain will beat loss of anon lru
aging's overead once all of allocator start to use madvise.
(This patch doesn't include background aging in case of non-swap system
but it's trivial if we decide)

I hope seeing opinions from others before diving into glibc or bionic.
Welcome to any comment.

[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system

Changelog
* from RFC v1
* add clear comment of purge - Christoph
* Change patch descritpion

Cc: John Stultz <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/asm-generic/mman-common.h | 3 +
include/linux/mm.h | 9 ++-
include/linux/mm_types.h | 5 ++
include/linux/rmap.h | 24 ++++++-
mm/ksm.c | 4 +-
mm/madvise.c | 32 +++++++++-
mm/memory.c | 2 +
mm/migrate.c | 6 +-
mm/rmap.c | 126 +++++++++++++++++++++++++++++++++++--
mm/vmscan.c | 3 +
10 files changed, 202 insertions(+), 12 deletions(-)

diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
index d030d2c..5f8090d 100644
--- a/include/asm-generic/mman-common.h
+++ b/include/asm-generic/mman-common.h
@@ -34,6 +34,9 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_VOLATILE 5 /* pages will disappear suddenly */
+#define MADV_NOVOLATILE 6 /* pages will not disappear */
+

/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..78c8c08 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -120,6 +120,13 @@ extern unsigned int kobjsize(const void *objp);
#define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */
#define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */

+/*
+ * Recently, Konstantin removed a few flags but not merged yet
+ * so we will get a room for new flag for supporting 32 bit.
+ * Thanks, Konstantin!
+ */
+#define VM_VOLATILE 0x100000000
+
/* Bits set in the VMA until the stack is in its final location */
#define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ)

@@ -143,7 +150,7 @@ extern unsigned int kobjsize(const void *objp);
* Special vmas that are non-mergable, non-mlock()able.
* Note: mm/huge_memory.c VM_NO_THP depends on this definition.
*/
-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
+#define VM_SPECIAL (VM_IO|VM_DONTEXPAND|VM_RESERVED|VM_PFNMAP|VM_VOLATILE)

/*
* mapping from the currently active vm_flags protection bits (the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bf78672..c813daa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -279,6 +279,11 @@ struct vm_area_struct {
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
+ /*
+ * True if more than a page in this vma is reclaimed.
+ * It's protected by anon_vma->mutex.
+ */
+ bool purged;
};

struct core_thread {
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 3fce545..65b9f33 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -67,6 +67,10 @@ struct anon_vma_chain {
struct list_head same_anon_vma; /* locked by anon_vma->mutex */
};

+
+void volatile_lock(struct vm_area_struct *vma);
+void volatile_unlock(struct vm_area_struct *vma);
+
#ifdef CONFIG_MMU
static inline void get_anon_vma(struct anon_vma *anon_vma)
{
@@ -170,12 +174,14 @@ enum ttu_flags {
TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+ TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
};
#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)

int try_to_unmap(struct page *, enum ttu_flags flags);
int try_to_unmap_one(struct page *, struct vm_area_struct *,
- unsigned long address, enum ttu_flags flags);
+ unsigned long address, enum ttu_flags flags,
+ bool *is_volatile);

/*
* Called from mm/filemap_xip.c to unmap empty zero page
@@ -194,6 +200,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
return ptep;
}

+pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
+ unsigned long, spinlock_t **);
+
+static inline pte_t *page_check_volatile_address(struct page *page,
+ struct mm_struct *mm,
+ unsigned long address,
+ spinlock_t **ptlp)
+{
+ pte_t *ptep;
+
+ __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
+ mm, address, ptlp));
+ return ptep;
+}
+
/*
* Used by swapoff to help locate where page is expected in vma.
*/
@@ -257,5 +278,6 @@ static inline int page_mkclean(struct page *page)
#define SWAP_AGAIN 1
#define SWAP_FAIL 2
#define SWAP_MLOCK 3
+#define SWAP_DISCARD 4

#endif /* _LINUX_RMAP_H */
diff --git a/mm/ksm.c b/mm/ksm.c
index 47c8853..22c54d2 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1653,6 +1653,7 @@ int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
struct rmap_item *rmap_item;
int ret = SWAP_AGAIN;
int search_new_forks = 0;
+ bool dummy_volatile;

VM_BUG_ON(!PageKsm(page));
VM_BUG_ON(!PageLocked(page));
@@ -1682,7 +1683,8 @@ again:
continue;

ret = try_to_unmap_one(page, vma,
- rmap_item->address, flags);
+ rmap_item->address, flags,
+ &dummy_volatile);
if (ret != SWAP_AGAIN || !page_mapped(page)) {
anon_vma_unlock(anon_vma);
goto out;
diff --git a/mm/madvise.c b/mm/madvise.c
index 14d260f..53cd77f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
if (error)
goto out;
break;
+ case MADV_VOLATILE:
+ if (vma->vm_flags & VM_LOCKED) {
+ error = -EINVAL;
+ goto out;
+ }
+ new_flags |= VM_VOLATILE;
+ vma->purged = false;
+ break;
+ case MADV_NOVOLATILE:
+ if (!(vma->vm_flags & VM_VOLATILE)) {
+ error = -EINVAL;
+ goto out;
+ }
+
+ new_flags &= ~VM_VOLATILE;
+ break;
}

if (new_flags == vma->vm_flags) {
@@ -118,9 +134,15 @@ static long madvise_behavior(struct vm_area_struct * vma,
success:
/*
* vm_flags is protected by the mmap_sem held in write mode.
+ * In case of VOLATILE, we need volatile_lock, additionally.
*/
+ if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
+ volatile_lock(vma);
vma->vm_flags = new_flags;
-
+ if (behavior == MADV_NOVOLATILE)
+ error = vma->purged;
+ if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
+ volatile_unlock(vma);
out:
if (error == -ENOMEM)
error = -EAGAIN;
@@ -310,6 +332,8 @@ madvise_behavior_valid(int behavior)
#endif
case MADV_DONTDUMP:
case MADV_DODUMP:
+ case MADV_VOLATILE:
+ case MADV_NOVOLATILE:
return 1;

default:
@@ -383,7 +407,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)

if (start & ~PAGE_MASK)
goto out;
- len = (len_in + ~PAGE_MASK) & PAGE_MASK;
+
+ if (behavior != MADV_VOLATILE && behavior != MADV_NOVOLATILE)
+ len = (len_in + ~PAGE_MASK) & PAGE_MASK;
+ else
+ len = len_in & PAGE_MASK;

/* Check to see whether len was rounded up from small -ve to zero */
if (len_in && !len)
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..26b3f73 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3441,6 +3441,8 @@ int handle_pte_fault(struct mm_struct *mm,
entry = *pte;
if (!pte_present(entry)) {
if (pte_none(entry)) {
+ if (unlikely(vma->vm_flags & VM_VOLATILE))
+ return VM_FAULT_SIGBUS;
if (vma->vm_ops) {
if (likely(vma->vm_ops->fault))
return do_linear_fault(mm, vma, address,
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..d1b51af 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
}

/* Establish migration ptes or remove ptes */
- try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+ try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
+ TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);

skip_unmap:
if (!page_mapped(page))
@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
if (PageAnon(hpage))
anon_vma = page_get_anon_vma(hpage);

- try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+ try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
+ TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);

if (!page_mapped(hpage))
rc = move_to_new_page(new_hpage, hpage, 1, mode);
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cd..778abfc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
return vma_address(page, vma);
}

+pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
+ unsigned long address, spinlock_t **ptlp)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+ spinlock_t *ptl;
+
+ swp_entry_t entry = { .val = page_private(page) };
+
+ if (unlikely(PageHuge(page))) {
+ pte = huge_pte_offset(mm, address);
+ ptl = &mm->page_table_lock;
+ goto check;
+ }
+
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
+ return NULL;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return NULL;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
+ return NULL;
+ if (pmd_trans_huge(*pmd))
+ return NULL;
+
+ pte = pte_offset_map(pmd, address);
+ ptl = pte_lockptr(mm, pmd);
+check:
+ spin_lock(ptl);
+ if (PageAnon(page)) {
+ if (!pte_present(*pte) && entry.val ==
+ pte_to_swp_entry(*pte).val) {
+ *ptlp = ptl;
+ return pte;
+ }
+ } else {
+ if (pte_none(*pte)) {
+ *ptlp = ptl;
+ return pte;
+ }
+ }
+ pte_unmap_unlock(pte, ptl);
+ return NULL;
+}
+
/*
* Check that @page is mapped at @address into @mm.
*
@@ -1218,12 +1269,42 @@ out:
mem_cgroup_end_update_page_stat(page, &locked, &flags);
}

+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
+ unsigned long address)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pte_t *pte;
+ pte_t pteval;
+ spinlock_t *ptl;
+
+ pte = page_check_volatile_address(page, mm, address, &ptl);
+ if (!pte)
+ return 0;
+
+ /* Nuke the page table entry. */
+ flush_cache_page(vma, address, page_to_pfn(page));
+ pteval = ptep_clear_flush(vma, address, pte);
+
+ if (PageAnon(page)) {
+ swp_entry_t entry = { .val = page_private(page) };
+ if (PageSwapCache(page)) {
+ dec_mm_counter(mm, MM_SWAPENTS);
+ swap_free(entry);
+ }
+ }
+
+ pte_unmap_unlock(pte, ptl);
+ mmu_notifier_invalidate_page(mm, address);
+ return 1;
+}
+
/*
* Subfunctions of try_to_unmap: try_to_unmap_one called
* repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
*/
int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
- unsigned long address, enum ttu_flags flags)
+ unsigned long address, enum ttu_flags flags,
+ bool *is_volatile)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte;
@@ -1235,6 +1316,8 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
if (!pte)
goto out;

+ if (!(vma->vm_flags & VM_VOLATILE))
+ *is_volatile = false;
/*
* If the page is mlock()d, we cannot swap it out.
* If it's recently referenced (perhaps page_referenced
@@ -1494,6 +1577,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
struct anon_vma *anon_vma;
struct anon_vma_chain *avc;
int ret = SWAP_AGAIN;
+ bool is_volatile = true;
+
+ if (flags & TTU_IGNORE_VOLATILE)
+ is_volatile = false;

anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
@@ -1512,17 +1599,32 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
* temporary VMAs until after exec() completes.
*/
if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
- is_vma_temporary_stack(vma))
+ is_vma_temporary_stack(vma)) {
+ is_volatile = false;
continue;
+ }

address = vma_address(page, vma);
if (address == -EFAULT)
continue;
- ret = try_to_unmap_one(page, vma, address, flags);
+ ret = try_to_unmap_one(page, vma, address, flags, &is_volatile);
if (ret != SWAP_AGAIN || !page_mapped(page))
break;
}

+ if (page_mapped(page) || is_volatile == false)
+ goto out;
+
+ list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+ struct vm_area_struct *vma = avc->vma;
+ unsigned long address;
+
+ address = vma_address(page, vma);
+ if (try_to_zap_one(page, vma, address))
+ vma->purged = true;
+ }
+ ret = SWAP_DISCARD;
+out:
page_unlock_anon_vma(anon_vma);
return ret;
}
@@ -1553,13 +1655,14 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
unsigned long max_nl_cursor = 0;
unsigned long max_nl_size = 0;
unsigned int mapcount;
+ bool dummy;

mutex_lock(&mapping->i_mmap_mutex);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
if (address == -EFAULT)
continue;
- ret = try_to_unmap_one(page, vma, address, flags);
+ ret = try_to_unmap_one(page, vma, address, flags, &dummy);
if (ret != SWAP_AGAIN || !page_mapped(page))
goto out;
}
@@ -1651,6 +1754,7 @@ out:
* SWAP_AGAIN - we missed a mapping, try again later
* SWAP_FAIL - the page is unswappable
* SWAP_MLOCK - page is mlocked.
+ * SWAP_DISCARD - page is volatile.
*/
int try_to_unmap(struct page *page, enum ttu_flags flags)
{
@@ -1665,7 +1769,7 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
ret = try_to_unmap_anon(page, flags);
else
ret = try_to_unmap_file(page, flags);
- if (ret != SWAP_MLOCK && !page_mapped(page))
+ if (ret != SWAP_MLOCK && !page_mapped(page) && ret != SWAP_DISCARD)
ret = SWAP_SUCCESS;
return ret;
}
@@ -1707,6 +1811,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
anon_vma_free(anon_vma);
}

+void volatile_lock(struct vm_area_struct *vma)
+{
+ if (vma->anon_vma)
+ anon_vma_lock(vma->anon_vma);
+}
+
+void volatile_unlock(struct vm_area_struct *vma)
+{
+ if (vma->anon_vma)
+ anon_vma_unlock(vma->anon_vma);
+}
+
#ifdef CONFIG_MIGRATION
/*
* rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99b434b..d5b60d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -789,6 +789,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
if (page_mapped(page) && mapping) {
switch (try_to_unmap(page, TTU_UNMAP)) {
+ case SWAP_DISCARD:
+ goto discard_page;
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
@@ -857,6 +859,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
}

+discard_page:
/*
* If the page has buffers, try to free the buffer mappings
* associated with this page. If we succeed we try to free
--
1.7.9.5


2012-10-31 21:35:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

On Tue, 30 Oct 2012 10:29:54 +0900
Minchan Kim <[email protected]> wrote:

> This patch introudces new madvise behavior MADV_VOLATILE and
> MADV_NOVOLATILE for anonymous pages. It's different with
> John Stultz's version which considers only tmpfs while this patch
> considers only anonymous pages so this cannot cover John's one.
> If below idea is proved as reasonable, I hope we can unify both
> concepts by madvise/fadvise.
>
> Rationale is following as.
> Many allocators call munmap(2) when user call free(3) if ptr is
> in mmaped area. But munmap isn't cheap because it have to clean up
> all pte entries and unlinking a vma so overhead would be increased
> linearly by mmaped area's size.

Presumably the userspace allocator will internally manage memory in
large chunks, so the munmap() call frequency will be much lower than
the free() call frequency. So the performance gains from this change
might be very small.

The whole point of the patch is to improve performance, but we have no
evidence that it was successful in doing that! I do think we'll need
good quantitative testing results before proceeding with such a patch,
please.

Also, it is very desirable that we involve the relevant userspace
(glibc, etc) developers in this. And I understand that the google
tcmalloc project will probably have interest in this - I've cc'ed
various people@google in the hope that they can provide input (please).

Also, it is a userspace API change. Please cc [email protected].

Also, I assume that you have userspace test code. At some stage,
please consider adding a case to tools/testing/selftests. Such a test
would require to creation of memory pressure, which is rather contrary
to the selftests' current philosopy of being a bunch of short-running
little tests. Perhaps you can come up with something. But I suggest
that such work be done later, once it becomes clearer that this code is
actually headed into the kernel.

> Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> allocating that area to user. Otherwise, accessing of volatile range
> will meet SIGBUS error.

Well, why? It would be easy enough for the fault handler to give
userspace a new, zeroed page at that address.

Or we could simply leave the old page in place at that address. If the
page gets touched, we clear MADV_NOVOLATILE on its VMA and give the
page (or all the not-yet-reclaimed pages) back to userspace at their
old addresses.

Various options suggest themselves here. You've chosen one of them but
I would like to see a pretty exhaustive description of the reasoning
behind that decision.

Also, I wonder about the interaction with other vma manipulation
operations. For example, can a VMA get split when in the MADV_VOLATILE
state? If so, what happens?

Also, I see no reason why the code shouldn't work OK with nonlinear VMAs,
but I bet this wasn't tested ;)

> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
> if (error)
> goto out;
> break;
> + case MADV_VOLATILE:
> + if (vma->vm_flags & VM_LOCKED) {
> + error = -EINVAL;
> + goto out;
> + }
> + new_flags |= VM_VOLATILE;
> + vma->purged = false;
> + break;
> + case MADV_NOVOLATILE:
> + if (!(vma->vm_flags & VM_VOLATILE)) {
> + error = -EINVAL;
> + goto out;

I wonder if this really should return an error. Other madvise()
options don't do this, and running MADV_NOVOLATILE against a
not-volatile area seems pretty benign and has clearly defined before-
and after- states.

> + }
> +
> + new_flags &= ~VM_VOLATILE;
> + break;
> }
>
> if (new_flags == vma->vm_flags) {

2012-10-31 21:59:40

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
<[email protected]> wrote:
>
> On Tue, 30 Oct 2012 10:29:54 +0900
> Minchan Kim <[email protected]> wrote:
>
> > This patch introudces new madvise behavior MADV_VOLATILE and
> > MADV_NOVOLATILE for anonymous pages. It's different with
> > John Stultz's version which considers only tmpfs while this patch
> > considers only anonymous pages so this cannot cover John's one.
> > If below idea is proved as reasonable, I hope we can unify both
> > concepts by madvise/fadvise.
> >
> > Rationale is following as.
> > Many allocators call munmap(2) when user call free(3) if ptr is
> > in mmaped area. But munmap isn't cheap because it have to clean up
> > all pte entries and unlinking a vma so overhead would be increased
> > linearly by mmaped area's size.
>
> Presumably the userspace allocator will internally manage memory in
> large chunks, so the munmap() call frequency will be much lower than
> the free() call frequency. So the performance gains from this change
> might be very small.

I don't think I strictly understand the motivation from a
malloc-standpoint here.

These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
to perform discards on Linux. For any reasonable allocator (short
of binding malloc --> mmap, free --> unmap) this seems a better
choice.

Note also from a performance stand-point I doubt any allocator (which
case about performance) is going to want to pay the cost of even a
null syscall about typical malloc/free usage (consider: a tcmalloc
malloc/free pairis currently <20ns). Given then that this cost is
amortized once you start doing discards on larger blocks MADV_DONTNEED
seems a preferable interface:
- You don't need to reconstruct an arena when you do want to allocate
since there's no munmap/mmap for the region to change about
- There are no syscalls involved in later reallocating the block.

The only real additional cost is address-space. Are you strongly
concerned about the 32-bit case?

>
> The whole point of the patch is to improve performance, but we have no
> evidence that it was successful in doing that! I do think we'll need
> good quantitative testing results before proceeding with such a patch,
> please.
>
> Also, it is very desirable that we involve the relevant userspace
> (glibc, etc) developers in this. And I understand that the google
> tcmalloc project will probably have interest in this - I've cc'ed
> various people@google in the hope that they can provide input (please).
>
> Also, it is a userspace API change. Please cc [email protected].
>
> Also, I assume that you have userspace test code. At some stage,
> please consider adding a case to tools/testing/selftests. Such a test
> would require to creation of memory pressure, which is rather contrary
> to the selftests' current philosopy of being a bunch of short-running
> little tests. Perhaps you can come up with something. But I suggest
> that such work be done later, once it becomes clearer that this code is
> actually headed into the kernel.
>
> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> > allocating that area to user. Otherwise, accessing of volatile range
> > will meet SIGBUS error.
>
> Well, why? It would be easy enough for the fault handler to give
> userspace a new, zeroed page at that address.

Note: MADV_DONTNEED already has this (nice) property.

>
> Or we could simply leave the old page in place at that address. If the
> page gets touched, we clear MADV_NOVOLATILE on its VMA and give the
> page (or all the not-yet-reclaimed pages) back to userspace at their
> old addresses.
>
> Various options suggest themselves here. You've chosen one of them but
> I would like to see a pretty exhaustive description of the reasoning
> behind that decision.
>
> Also, I wonder about the interaction with other vma manipulation
> operations. For example, can a VMA get split when in the MADV_VOLATILE
> state? If so, what happens?
>
> Also, I see no reason why the code shouldn't work OK with nonlinear VMAs,
> but I bet this wasn't tested ;)
>
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
> > if (error)
> > goto out;
> > break;
> > + case MADV_VOLATILE:
> > + if (vma->vm_flags & VM_LOCKED) {
> > + error = -EINVAL;
> > + goto out;
> > + }
> > + new_flags |= VM_VOLATILE;
> > + vma->purged = false;
> > + break;
> > + case MADV_NOVOLATILE:
> > + if (!(vma->vm_flags & VM_VOLATILE)) {
> > + error = -EINVAL;
> > + goto out;
>
> I wonder if this really should return an error. Other madvise()
> options don't do this, and running MADV_NOVOLATILE against a
> not-volatile area seems pretty benign and has clearly defined before-
> and after- states.
>
> > + }
> > +
> > + new_flags &= ~VM_VOLATILE;
> > + break;
> > }
> >
> > if (new_flags == vma->vm_flags) {
>

2012-10-31 22:56:28

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

>> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
>> > allocating that area to user. Otherwise, accessing of volatile range
>> > will meet SIGBUS error.
>>
>> Well, why? It would be easy enough for the fault handler to give
>> userspace a new, zeroed page at that address.
>
> Note: MADV_DONTNEED already has this (nice) property.

I don't think I strictly understand this patch. but maybe I can answer why
userland and malloc folks don't like MADV_DONTNEED.

glibc malloc discard freed memory by using MADV_DONTNEED
as tcmalloc. and it is often a source of large performance decrease.
because of MADV_DONTNEED discard memory immediately and
right after malloc() call fall into page fault and pagesize memset() path.
then, using DONTNEED increased zero fill and cache miss rate.

At called free() time, malloc don't have a knowledge when next big malloc()
is called. then, immediate discarding may or may not get good performance
gain. (Ah, ok, the rate is not 5:5. then usually it is worth. but not everytime)


In past, several developers tryied to avoid such situation, likes

- making zero page daemon and avoid pagesize zero fill at page fault
- making new vma or page flags and mark as discardable w/o swap and
vmscan treat it. (like this and/or MADV_FREE)
- making new process option and avoid page zero fill from page fault path.
(yes, it is big incompatibility and insecure. but some embedded folks thought
they are acceptable downside)
- etc


btw, I'm not sure this patch is better for malloc because current MADV_DONTNEED
don't need mmap_sem and works very effectively when a lot of threads case.
taking mmap_sem might bring worse performance than DONTNEED. dunno.

2012-11-01 00:15:23

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

Hi Andrew,

On Wed, Oct 31, 2012 at 02:35:24PM -0700, Andrew Morton wrote:
> On Tue, 30 Oct 2012 10:29:54 +0900
> Minchan Kim <[email protected]> wrote:
>
> > This patch introudces new madvise behavior MADV_VOLATILE and
> > MADV_NOVOLATILE for anonymous pages. It's different with
> > John Stultz's version which considers only tmpfs while this patch
> > considers only anonymous pages so this cannot cover John's one.
> > If below idea is proved as reasonable, I hope we can unify both
> > concepts by madvise/fadvise.
> >
> > Rationale is following as.
> > Many allocators call munmap(2) when user call free(3) if ptr is
> > in mmaped area. But munmap isn't cheap because it have to clean up
> > all pte entries and unlinking a vma so overhead would be increased
> > linearly by mmaped area's size.
>
> Presumably the userspace allocator will internally manage memory in
> large chunks, so the munmap() call frequency will be much lower than
> the free() call frequency. So the performance gains from this change
> might be very small.
>
> The whole point of the patch is to improve performance, but we have no
> evidence that it was successful in doing that! I do think we'll need
> good quantitative testing results before proceeding with such a patch,
> please.

Absolutely. That's why I send it as RFC.
In this time, I would like to reach a concensus on that this idea
makes sense before further investigating because we have lots of
experienced developer pool and one of them might know this is really
needed or not.

>
> Also, it is very desirable that we involve the relevant userspace
> (glibc, etc) developers in this. And I understand that the google
> tcmalloc project will probably have interest in this - I've cc'ed
> various people@google in the hope that they can provide input (please).

Thanks! I should have done. Such input is really one I need now.

>
> Also, it is a userspace API change. Please cc [email protected].

This is RFC so we don't have anything fixed until now.
I will Cc'ed him after everything I should solve goes out and
interface is fixed.

>
> Also, I assume that you have userspace test code. At some stage,
> please consider adding a case to tools/testing/selftests. Such a test
> would require to creation of memory pressure, which is rather contrary
> to the selftests' current philosopy of being a bunch of short-running
> little tests. Perhaps you can come up with something. But I suggest
> that such work be done later, once it becomes clearer that this code is
> actually headed into the kernel.

Yes.

>
> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> > allocating that area to user. Otherwise, accessing of volatile range
> > will meet SIGBUS error.
>
> Well, why? It would be easy enough for the fault handler to give
> userspace a new, zeroed page at that address.

Absolutely. It would be convenient but as a matter of fact, I am considering
to unify John Stultz's fallocate volatile range which consider only tmpfs
pages so madvise/fadvise might be better candidate as API.
In tmpfs case, John implemented it as returning zero page when someone
access volatile region like you mentioned but in this kernel summit, Hugh
pointed out and wanted to return SIGBUS and I think it makes debug better.

Another option is we can put a flag in API which indicates that VM will
return zero page or SIGBUS when user access volatile range so user can do
what they want.

>
> Or we could simply leave the old page in place at that address. If the
> page gets touched, we clear MADV_NOVOLATILE on its VMA and give the
> page (or all the not-yet-reclaimed pages) back to userspace at their
> old addresses.
>
> Various options suggest themselves here. You've chosen one of them but
> I would like to see a pretty exhaustive description of the reasoning
> behind that decision.

Will do.

>
> Also, I wonder about the interaction with other vma manipulation
> operations. For example, can a VMA get split when in the MADV_VOLATILE
> state? If so, what happens?

Both VMAs would be volatile although one of either has never reclaimed
pages. I understand it's not an optimal but I expect user will not do
such operations(ex, mprotect, mremap) frequently on volatile vma.

If they do, maybe I need to come up with something but It wouldn't be easy.

>
> Also, I see no reason why the code shouldn't work OK with nonlinear VMAs,
> but I bet this wasn't tested ;)

Yes. I didn't consider that yet. AFAIK, nonlinear vma is related to
file-mapped vma while this patch consider only anon vma which is good for
first step.

>
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
> > if (error)
> > goto out;
> > break;
> > + case MADV_VOLATILE:
> > + if (vma->vm_flags & VM_LOCKED) {
> > + error = -EINVAL;
> > + goto out;
> > + }
> > + new_flags |= VM_VOLATILE;
> > + vma->purged = false;
> > + break;
> > + case MADV_NOVOLATILE:
> > + if (!(vma->vm_flags & VM_VOLATILE)) {
> > + error = -EINVAL;
> > + goto out;
>
> I wonder if this really should return an error. Other madvise()
> options don't do this, and running MADV_NOVOLATILE against a
> not-volatile area seems pretty benign and has clearly defined before-
> and after- states.

Will fix.

Thanks for the review, Andrew.

>
> > + }
> > +
> > + new_flags &= ~VM_VOLATILE;
> > + break;
> > }
> >
> > if (new_flags == vma->vm_flags) {
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2012-11-01 00:44:56

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

Hello,

On Wed, Oct 31, 2012 at 02:59:07PM -0700, Paul Turner wrote:
> On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
> <[email protected]> wrote:
> >
> > On Tue, 30 Oct 2012 10:29:54 +0900
> > Minchan Kim <[email protected]> wrote:
> >
> > > This patch introudces new madvise behavior MADV_VOLATILE and
> > > MADV_NOVOLATILE for anonymous pages. It's different with
> > > John Stultz's version which considers only tmpfs while this patch
> > > considers only anonymous pages so this cannot cover John's one.
> > > If below idea is proved as reasonable, I hope we can unify both
> > > concepts by madvise/fadvise.
> > >
> > > Rationale is following as.
> > > Many allocators call munmap(2) when user call free(3) if ptr is
> > > in mmaped area. But munmap isn't cheap because it have to clean up
> > > all pte entries and unlinking a vma so overhead would be increased
> > > linearly by mmaped area's size.
> >
> > Presumably the userspace allocator will internally manage memory in
> > large chunks, so the munmap() call frequency will be much lower than
> > the free() call frequency. So the performance gains from this change
> > might be very small.
>
> I don't think I strictly understand the motivation from a
> malloc-standpoint here.
>
> These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
> to perform discards on Linux. For any reasonable allocator (short
> of binding malloc --> mmap, free --> unmap) this seems a better
> choice.
>
> Note also from a performance stand-point I doubt any allocator (which
> case about performance) is going to want to pay the cost of even a
> null syscall about typical malloc/free usage (consider: a tcmalloc

Good point.

> malloc/free pairis currently <20ns). Given then that this cost is
> amortized once you start doing discards on larger blocks MADV_DONTNEED
> seems a preferable interface:
> - You don't need to reconstruct an arena when you do want to allocate
> since there's no munmap/mmap for the region to change about
> - There are no syscalls involved in later reallocating the block.

Above benefits are applied on MADV_VOLATILE, too.
But as you pointed out, there is a little bit overhead than DONTNEED
because allocator should call madvise(MADV_NOVOLATILE) before allocation.
For mavise(NOVOLATILE) does just mark vma flag, it does need mmap_sem
and could be a problem on parallel malloc/free workload as KOSAKI pointed out.

In such case, we can change semantic so malloc doesn't need to call
madivse(NOVOLATILE) before allocating. Then, page fault handler have to
check whether this page fault happen by access of volatile vma. If so,
it could return zero page instead of SIGBUS and mark the vma isn't volatile
any more.

>
> The only real additional cost is address-space. Are you strongly
> concerned about the 32-bit case?

No. I believe allocators have a logic to clean up them once address space is
almost full.

Thanks, Paul.

--
Kind regards,
Minchan Kim

2012-11-01 01:16:06

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

On Wed, Oct 31, 2012 at 3:56 PM, KOSAKI Motohiro
<[email protected]> wrote:
>>> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
>>> > allocating that area to user. Otherwise, accessing of volatile range
>>> > will meet SIGBUS error.
>>>
>>> Well, why? It would be easy enough for the fault handler to give
>>> userspace a new, zeroed page at that address.
>>
>> Note: MADV_DONTNEED already has this (nice) property.
>
> I don't think I strictly understand this patch. but maybe I can answer why
> userland and malloc folks don't like MADV_DONTNEED.
>
> glibc malloc discard freed memory by using MADV_DONTNEED
> as tcmalloc. and it is often a source of large performance decrease.
> because of MADV_DONTNEED discard memory immediately and
> right after malloc() call fall into page fault and pagesize memset() path.
> then, using DONTNEED increased zero fill and cache miss rate.
>
> At called free() time, malloc don't have a knowledge when next big malloc()
> is called. then, immediate discarding may or may not get good performance
> gain. (Ah, ok, the rate is not 5:5. then usually it is worth. but not everytime)
>

Ah; In tcmalloc allocations (and their associated free-lists) are
binned into separate lists as a function of object-size which helps to
mitigate this.

I'd make a separate more general argument here:
If I'm allocating a large (multi-kilobyte object) the cost of what I'm
about to do with that object is likely fairly large -- The fault/zero
cost a probably fairly small proportional cost, which limits the
optimization value.

>
> In past, several developers tryied to avoid such situation, likes
>
> - making zero page daemon and avoid pagesize zero fill at page fault
> - making new vma or page flags and mark as discardable w/o swap and
> vmscan treat it. (like this and/or MADV_FREE)
> - making new process option and avoid page zero fill from page fault path.
> (yes, it is big incompatibility and insecure. but some embedded folks thought
> they are acceptable downside)
> - etc
>
>
> btw, I'm not sure this patch is better for malloc because current MADV_DONTNEED
> don't need mmap_sem and works very effectively when a lot of threads case.
> taking mmap_sem might bring worse performance than DONTNEED. dunno.

MADV_VOLATILE also seems to end up looking quite similar to a
user-visible (range-based) cleancache.

A second popular use-case for such semantics is the case of
discardable cache elements (e.g. web browser). I suspect we'd want to
at least mention these in the changelog. (Alternatively, what does a
cleancache-backed-fs exposing these semantics look like?)

2012-11-01 01:19:50

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

Hi KOSAKI,

On Wed, Oct 31, 2012 at 06:56:05PM -0400, KOSAKI Motohiro wrote:
> >> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> >> > allocating that area to user. Otherwise, accessing of volatile range
> >> > will meet SIGBUS error.
> >>
> >> Well, why? It would be easy enough for the fault handler to give
> >> userspace a new, zeroed page at that address.
> >
> > Note: MADV_DONTNEED already has this (nice) property.
>
> I don't think I strictly understand this patch. but maybe I can answer why
> userland and malloc folks don't like MADV_DONTNEED.
>
> glibc malloc discard freed memory by using MADV_DONTNEED
> as tcmalloc. and it is often a source of large performance decrease.
> because of MADV_DONTNEED discard memory immediately and
> right after malloc() call fall into page fault and pagesize memset() path.
> then, using DONTNEED increased zero fill and cache miss rate.
>
> At called free() time, malloc don't have a knowledge when next big malloc()
> is called. then, immediate discarding may or may not get good performance
> gain. (Ah, ok, the rate is not 5:5. then usually it is worth. but not everytime)
>
>
> In past, several developers tryied to avoid such situation, likes
>
> - making zero page daemon and avoid pagesize zero fill at page fault
> - making new vma or page flags and mark as discardable w/o swap and
> vmscan treat it. (like this and/or MADV_FREE)

Thanks for the information.
I realized by you I'm not first people to think of this idea.
Rik already tried it(https://lkml.org/lkml/2007/4/17/53) by new page flag
and even other OSes already have such good feature. And John's concept was
already tried long time ago (https://lkml.org/lkml/2005/11/1/384)

Hmm, I look over Rik's thread but couldn't find why it wasn't merged
at that time. Anyone know it?

> - making new process option and avoid page zero fill from page fault path.
> (yes, it is big incompatibility and insecure. but some embedded folks thought
> they are acceptable downside)
> - etc
>
>
> btw, I'm not sure this patch is better for malloc because current MADV_DONTNEED
> don't need mmap_sem and works very effectively when a lot of threads case.
> taking mmap_sem might bring worse performance than DONTNEED. dunno.

It's a good point.

Quote from my reply to Paul
"
In such case, we can change semantic so malloc doesn't need to call
madivse(NOVOLATILE) before allocating. Then, page fault handler have to
check whether this page fault happen by access of volatile vma. If so,
it could return zero page instead of SIGBUS and mark the vma isn't volatile
any more.
"
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2012-11-01 01:23:31

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

On Wed, Oct 31, 2012 at 5:50 PM, Minchan Kim <[email protected]> wrote:
> Hello,
>
> On Wed, Oct 31, 2012 at 02:59:07PM -0700, Paul Turner wrote:
>> On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
>> <[email protected]> wrote:
>> >
>> > On Tue, 30 Oct 2012 10:29:54 +0900
>> > Minchan Kim <[email protected]> wrote:
>> >
>> > > This patch introudces new madvise behavior MADV_VOLATILE and
>> > > MADV_NOVOLATILE for anonymous pages. It's different with
>> > > John Stultz's version which considers only tmpfs while this patch
>> > > considers only anonymous pages so this cannot cover John's one.
>> > > If below idea is proved as reasonable, I hope we can unify both
>> > > concepts by madvise/fadvise.
>> > >
>> > > Rationale is following as.
>> > > Many allocators call munmap(2) when user call free(3) if ptr is
>> > > in mmaped area. But munmap isn't cheap because it have to clean up
>> > > all pte entries and unlinking a vma so overhead would be increased
>> > > linearly by mmaped area's size.
>> >
>> > Presumably the userspace allocator will internally manage memory in
>> > large chunks, so the munmap() call frequency will be much lower than
>> > the free() call frequency. So the performance gains from this change
>> > might be very small.
>>
>> I don't think I strictly understand the motivation from a
>> malloc-standpoint here.
>>
>> These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
>> to perform discards on Linux. For any reasonable allocator (short
>> of binding malloc --> mmap, free --> unmap) this seems a better
>> choice.
>>
>> Note also from a performance stand-point I doubt any allocator (which
>> case about performance) is going to want to pay the cost of even a
>> null syscall about typical malloc/free usage (consider: a tcmalloc
>
> Good point.
>
>> malloc/free pairis currently <20ns). Given then that this cost is
>> amortized once you start doing discards on larger blocks MADV_DONTNEED
>> seems a preferable interface:
>> - You don't need to reconstruct an arena when you do want to allocate
>> since there's no munmap/mmap for the region to change about
>> - There are no syscalls involved in later reallocating the block.
>
> Above benefits are applied on MADV_VOLATILE, too.
> But as you pointed out, there is a little bit overhead than DONTNEED
> because allocator should call madvise(MADV_NOVOLATILE) before allocation.
> For mavise(NOVOLATILE) does just mark vma flag, it does need mmap_sem
> and could be a problem on parallel malloc/free workload as KOSAKI pointed out.
>
> In such case, we can change semantic so malloc doesn't need to call
> madivse(NOVOLATILE) before allocating. Then, page fault handler have to
> check whether this page fault happen by access of volatile vma. If so,
> it could return zero page instead of SIGBUS and mark the vma isn't volatile
> any more.

I think being able to determine whether the backing was discarded
(about a atomic transition to non-volatile) would be a required
property to make this useful for non-malloc use-cases.

>
>>
>> The only real additional cost is address-space. Are you strongly
>> concerned about the 32-bit case?
>
> No. I believe allocators have a logic to clean up them once address space is
> almost full.
>
> Thanks, Paul.
>
> --
> Kind regards,
> Minchan Kim

2012-11-01 01:27:28

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

On Wed, Oct 31, 2012 at 06:22:58PM -0700, Paul Turner wrote:
> On Wed, Oct 31, 2012 at 5:50 PM, Minchan Kim <[email protected]> wrote:
> > Hello,
> >
> > On Wed, Oct 31, 2012 at 02:59:07PM -0700, Paul Turner wrote:
> >> On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
> >> <[email protected]> wrote:
> >> >
> >> > On Tue, 30 Oct 2012 10:29:54 +0900
> >> > Minchan Kim <[email protected]> wrote:
> >> >
> >> > > This patch introudces new madvise behavior MADV_VOLATILE and
> >> > > MADV_NOVOLATILE for anonymous pages. It's different with
> >> > > John Stultz's version which considers only tmpfs while this patch
> >> > > considers only anonymous pages so this cannot cover John's one.
> >> > > If below idea is proved as reasonable, I hope we can unify both
> >> > > concepts by madvise/fadvise.
> >> > >
> >> > > Rationale is following as.
> >> > > Many allocators call munmap(2) when user call free(3) if ptr is
> >> > > in mmaped area. But munmap isn't cheap because it have to clean up
> >> > > all pte entries and unlinking a vma so overhead would be increased
> >> > > linearly by mmaped area's size.
> >> >
> >> > Presumably the userspace allocator will internally manage memory in
> >> > large chunks, so the munmap() call frequency will be much lower than
> >> > the free() call frequency. So the performance gains from this change
> >> > might be very small.
> >>
> >> I don't think I strictly understand the motivation from a
> >> malloc-standpoint here.
> >>
> >> These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
> >> to perform discards on Linux. For any reasonable allocator (short
> >> of binding malloc --> mmap, free --> unmap) this seems a better
> >> choice.
> >>
> >> Note also from a performance stand-point I doubt any allocator (which
> >> case about performance) is going to want to pay the cost of even a
> >> null syscall about typical malloc/free usage (consider: a tcmalloc
> >
> > Good point.
> >
> >> malloc/free pairis currently <20ns). Given then that this cost is
> >> amortized once you start doing discards on larger blocks MADV_DONTNEED
> >> seems a preferable interface:
> >> - You don't need to reconstruct an arena when you do want to allocate
> >> since there's no munmap/mmap for the region to change about
> >> - There are no syscalls involved in later reallocating the block.
> >
> > Above benefits are applied on MADV_VOLATILE, too.
> > But as you pointed out, there is a little bit overhead than DONTNEED
> > because allocator should call madvise(MADV_NOVOLATILE) before allocation.
> > For mavise(NOVOLATILE) does just mark vma flag, it does need mmap_sem
> > and could be a problem on parallel malloc/free workload as KOSAKI pointed out.
> >
> > In such case, we can change semantic so malloc doesn't need to call
> > madivse(NOVOLATILE) before allocating. Then, page fault handler have to
> > check whether this page fault happen by access of volatile vma. If so,
> > it could return zero page instead of SIGBUS and mark the vma isn't volatile
> > any more.
>
> I think being able to determine whether the backing was discarded
> (about a atomic transition to non-volatile) would be a required
> property to make this useful for non-malloc use-cases.
>

Absolutely.

--
Kind regards,
Minchan Kim

2012-11-01 01:40:08

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

On Wed, Oct 31, 2012 at 06:15:33PM -0700, Paul Turner wrote:
> On Wed, Oct 31, 2012 at 3:56 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> >>> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> >>> > allocating that area to user. Otherwise, accessing of volatile range
> >>> > will meet SIGBUS error.
> >>>
> >>> Well, why? It would be easy enough for the fault handler to give
> >>> userspace a new, zeroed page at that address.
> >>
> >> Note: MADV_DONTNEED already has this (nice) property.
> >
> > I don't think I strictly understand this patch. but maybe I can answer why
> > userland and malloc folks don't like MADV_DONTNEED.
> >
> > glibc malloc discard freed memory by using MADV_DONTNEED
> > as tcmalloc. and it is often a source of large performance decrease.
> > because of MADV_DONTNEED discard memory immediately and
> > right after malloc() call fall into page fault and pagesize memset() path.
> > then, using DONTNEED increased zero fill and cache miss rate.
> >
> > At called free() time, malloc don't have a knowledge when next big malloc()
> > is called. then, immediate discarding may or may not get good performance
> > gain. (Ah, ok, the rate is not 5:5. then usually it is worth. but not everytime)
> >
>
> Ah; In tcmalloc allocations (and their associated free-lists) are
> binned into separate lists as a function of object-size which helps to
> mitigate this.
>
> I'd make a separate more general argument here:
> If I'm allocating a large (multi-kilobyte object) the cost of what I'm
> about to do with that object is likely fairly large -- The fault/zero
> cost a probably fairly small proportional cost, which limits the
> optimization value.

While I look at thread trial of Rik which is same goal while implementation
is different, I found this number.

https://lkml.org/lkml/2007/4/20/390

I believe optimiation is valuable. Of course, I need simillar testing for
proving it.

>
> >
> > In past, several developers tryied to avoid such situation, likes
> >
> > - making zero page daemon and avoid pagesize zero fill at page fault
> > - making new vma or page flags and mark as discardable w/o swap and
> > vmscan treat it. (like this and/or MADV_FREE)
> > - making new process option and avoid page zero fill from page fault path.
> > (yes, it is big incompatibility and insecure. but some embedded folks thought
> > they are acceptable downside)
> > - etc
> >
> >
> > btw, I'm not sure this patch is better for malloc because current MADV_DONTNEED
> > don't need mmap_sem and works very effectively when a lot of threads case.
> > taking mmap_sem might bring worse performance than DONTNEED. dunno.
>
> MADV_VOLATILE also seems to end up looking quite similar to a
> user-visible (range-based) cleancache.
>
> A second popular use-case for such semantics is the case of
> discardable cache elements (e.g. web browser). I suspect we'd want to
> at least mention these in the changelog. (Alternatively, what does a
> cleancache-backed-fs exposing these semantics look like?)
>

It's a trial of John Stultz(http://lwn.net/Articles/518130/, there was another
trial long time ago https://lkml.org/lkml/2005/11/1/384) and I want to
expand the concept from file-backed page to anonymous page so this patch
is a trial for anonymous page. So, usecase of my patch have focussed on
malloc/free case.
I hope both are able to be unified.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2012-11-01 02:01:37

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

>> - making zero page daemon and avoid pagesize zero fill at page fault
>> - making new vma or page flags and mark as discardable w/o swap and
>> vmscan treat it. (like this and/or MADV_FREE)
>
> Thanks for the information.
> I realized by you I'm not first people to think of this idea.
> Rik already tried it(https://lkml.org/lkml/2007/4/17/53) by new page flag
> and even other OSes already have such good feature. And John's concept was
> already tried long time ago (https://lkml.org/lkml/2005/11/1/384)
>
> Hmm, I look over Rik's thread but couldn't find why it wasn't merged
> at that time. Anyone know it?

Dunno. and I like volatile feature than old one. but bold remark, please don't
100% trust me, I haven't review a detailed code of your patch and I don't
strictly understand it.

2012-11-02 01:43:08

by Bob Liu

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

On Tue, Oct 30, 2012 at 9:29 AM, Minchan Kim <[email protected]> wrote:
> This patch introudces new madvise behavior MADV_VOLATILE and
> MADV_NOVOLATILE for anonymous pages. It's different with
> John Stultz's version which considers only tmpfs while this patch
> considers only anonymous pages so this cannot cover John's one.
> If below idea is proved as reasonable, I hope we can unify both
> concepts by madvise/fadvise.
>
> Rationale is following as.
> Many allocators call munmap(2) when user call free(3) if ptr is
> in mmaped area. But munmap isn't cheap because it have to clean up
> all pte entries and unlinking a vma so overhead would be increased
> linearly by mmaped area's size.
>

I have a question.
Pte entries are cleaned up during munmap(), so if user space try to
access the unmaped address
page fault will be generated.

If use MADV_VOLATILE? What's the result?
The pte entries are not cleaned so user space can still access the
memory before
VM discard pages?

> Volatile conecept of Robert Love could be very useful for reducing
> free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
> munmap(2)(Of course, they need to manage volatile mmaped area to
> reduce shortage of address space and sometime ends up unmaping them).
> The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because
>
> 1) it just marks the flag in VMA and
> 2) if memory pressure happens, VM can discard pages of volatile VMA
> instead of swapping out when volatile pages is selected as victim
> by normal VM aging policy.
> 3) freed mmaped area doesn't include any meaningful data so there
> is no point to swap them out.
>
> Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> allocating that area to user. Otherwise, accessing of volatile range
> will meet SIGBUS error.
>
> The downside is that we have to age anon lru list although we don't
> have swap because I don't want to discard volatile pages by top priority
> when memory pressure happens as volatile in this patch means "We don't
> need to swap out because user can handle the situation which data are
> disappear suddenly", NOT "They are useless so hurry up to reclaim them".
> So I want to apply same aging rule of nomal pages to them.
>
> Anon background aging of non-swap system would be a trade-off for
> getting good feature. Even, we had done it two years ago until merge
> [1] and I believe free(3) performance gain will beat loss of anon lru
> aging's overead once all of allocator start to use madvise.
> (This patch doesn't include background aging in case of non-swap system
> but it's trivial if we decide)
>
> I hope seeing opinions from others before diving into glibc or bionic.
> Welcome to any comment.
>
> [1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
>
> Changelog
> * from RFC v1
> * add clear comment of purge - Christoph
> * Change patch descritpion
>
> Cc: John Stultz <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Android Kernel Team <[email protected]>
> Cc: Robert Love <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Neil Brown <[email protected]>
> Cc: Mike Hommey <[email protected]>
> Cc: Taras Glek <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> include/asm-generic/mman-common.h | 3 +
> include/linux/mm.h | 9 ++-
> include/linux/mm_types.h | 5 ++
> include/linux/rmap.h | 24 ++++++-
> mm/ksm.c | 4 +-
> mm/madvise.c | 32 +++++++++-
> mm/memory.c | 2 +
> mm/migrate.c | 6 +-
> mm/rmap.c | 126 +++++++++++++++++++++++++++++++++++--
> mm/vmscan.c | 3 +
> 10 files changed, 202 insertions(+), 12 deletions(-)
>
> diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
> index d030d2c..5f8090d 100644
> --- a/include/asm-generic/mman-common.h
> +++ b/include/asm-generic/mman-common.h
> @@ -34,6 +34,9 @@
> #define MADV_SEQUENTIAL 2 /* expect sequential page references */
> #define MADV_WILLNEED 3 /* will need these pages */
> #define MADV_DONTNEED 4 /* don't need these pages */
> +#define MADV_VOLATILE 5 /* pages will disappear suddenly */
> +#define MADV_NOVOLATILE 6 /* pages will not disappear */
> +
>
> /* common parameters: try to keep these consistent across architectures */
> #define MADV_REMOVE 9 /* remove these pages & resources */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 311be90..78c8c08 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -120,6 +120,13 @@ extern unsigned int kobjsize(const void *objp);
> #define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */
> #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
>
> +/*
> + * Recently, Konstantin removed a few flags but not merged yet
> + * so we will get a room for new flag for supporting 32 bit.
> + * Thanks, Konstantin!
> + */
> +#define VM_VOLATILE 0x100000000
> +
> /* Bits set in the VMA until the stack is in its final location */
> #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ)
>
> @@ -143,7 +150,7 @@ extern unsigned int kobjsize(const void *objp);
> * Special vmas that are non-mergable, non-mlock()able.
> * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
> */
> -#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
> +#define VM_SPECIAL (VM_IO|VM_DONTEXPAND|VM_RESERVED|VM_PFNMAP|VM_VOLATILE)
>
> /*
> * mapping from the currently active vm_flags protection bits (the
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index bf78672..c813daa 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -279,6 +279,11 @@ struct vm_area_struct {
> #ifdef CONFIG_NUMA
> struct mempolicy *vm_policy; /* NUMA policy for the VMA */
> #endif
> + /*
> + * True if more than a page in this vma is reclaimed.
> + * It's protected by anon_vma->mutex.
> + */
> + bool purged;
> };
>
> struct core_thread {
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 3fce545..65b9f33 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -67,6 +67,10 @@ struct anon_vma_chain {
> struct list_head same_anon_vma; /* locked by anon_vma->mutex */
> };
>
> +
> +void volatile_lock(struct vm_area_struct *vma);
> +void volatile_unlock(struct vm_area_struct *vma);
> +
> #ifdef CONFIG_MMU
> static inline void get_anon_vma(struct anon_vma *anon_vma)
> {
> @@ -170,12 +174,14 @@ enum ttu_flags {
> TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
> TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
> TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
> + TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
> };
> #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>
> int try_to_unmap(struct page *, enum ttu_flags flags);
> int try_to_unmap_one(struct page *, struct vm_area_struct *,
> - unsigned long address, enum ttu_flags flags);
> + unsigned long address, enum ttu_flags flags,
> + bool *is_volatile);
>
> /*
> * Called from mm/filemap_xip.c to unmap empty zero page
> @@ -194,6 +200,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> return ptep;
> }
>
> +pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
> + unsigned long, spinlock_t **);
> +
> +static inline pte_t *page_check_volatile_address(struct page *page,
> + struct mm_struct *mm,
> + unsigned long address,
> + spinlock_t **ptlp)
> +{
> + pte_t *ptep;
> +
> + __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
> + mm, address, ptlp));
> + return ptep;
> +}
> +
> /*
> * Used by swapoff to help locate where page is expected in vma.
> */
> @@ -257,5 +278,6 @@ static inline int page_mkclean(struct page *page)
> #define SWAP_AGAIN 1
> #define SWAP_FAIL 2
> #define SWAP_MLOCK 3
> +#define SWAP_DISCARD 4
>
> #endif /* _LINUX_RMAP_H */
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 47c8853..22c54d2 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1653,6 +1653,7 @@ int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
> struct rmap_item *rmap_item;
> int ret = SWAP_AGAIN;
> int search_new_forks = 0;
> + bool dummy_volatile;
>
> VM_BUG_ON(!PageKsm(page));
> VM_BUG_ON(!PageLocked(page));
> @@ -1682,7 +1683,8 @@ again:
> continue;
>
> ret = try_to_unmap_one(page, vma,
> - rmap_item->address, flags);
> + rmap_item->address, flags,
> + &dummy_volatile);
> if (ret != SWAP_AGAIN || !page_mapped(page)) {
> anon_vma_unlock(anon_vma);
> goto out;
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 14d260f..53cd77f 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
> if (error)
> goto out;
> break;
> + case MADV_VOLATILE:
> + if (vma->vm_flags & VM_LOCKED) {
> + error = -EINVAL;
> + goto out;
> + }
> + new_flags |= VM_VOLATILE;
> + vma->purged = false;
> + break;
> + case MADV_NOVOLATILE:
> + if (!(vma->vm_flags & VM_VOLATILE)) {
> + error = -EINVAL;
> + goto out;
> + }
> +
> + new_flags &= ~VM_VOLATILE;
> + break;
> }
>
> if (new_flags == vma->vm_flags) {
> @@ -118,9 +134,15 @@ static long madvise_behavior(struct vm_area_struct * vma,
> success:
> /*
> * vm_flags is protected by the mmap_sem held in write mode.
> + * In case of VOLATILE, we need volatile_lock, additionally.
> */
> + if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
> + volatile_lock(vma);
> vma->vm_flags = new_flags;
> -
> + if (behavior == MADV_NOVOLATILE)
> + error = vma->purged;
> + if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
> + volatile_unlock(vma);
> out:
> if (error == -ENOMEM)
> error = -EAGAIN;
> @@ -310,6 +332,8 @@ madvise_behavior_valid(int behavior)
> #endif
> case MADV_DONTDUMP:
> case MADV_DODUMP:
> + case MADV_VOLATILE:
> + case MADV_NOVOLATILE:
> return 1;
>
> default:
> @@ -383,7 +407,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>
> if (start & ~PAGE_MASK)
> goto out;
> - len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> +
> + if (behavior != MADV_VOLATILE && behavior != MADV_NOVOLATILE)
> + len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> + else
> + len = len_in & PAGE_MASK;
>
> /* Check to see whether len was rounded up from small -ve to zero */
> if (len_in && !len)
> diff --git a/mm/memory.c b/mm/memory.c
> index 5736170..26b3f73 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3441,6 +3441,8 @@ int handle_pte_fault(struct mm_struct *mm,
> entry = *pte;
> if (!pte_present(entry)) {
> if (pte_none(entry)) {
> + if (unlikely(vma->vm_flags & VM_VOLATILE))
> + return VM_FAULT_SIGBUS;
> if (vma->vm_ops) {
> if (likely(vma->vm_ops->fault))
> return do_linear_fault(mm, vma, address,
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 77ed2d7..d1b51af 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> }
>
> /* Establish migration ptes or remove ptes */
> - try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> + try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
> + TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>
> skip_unmap:
> if (!page_mapped(page))
> @@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> if (PageAnon(hpage))
> anon_vma = page_get_anon_vma(hpage);
>
> - try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> + try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
> + TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>
> if (!page_mapped(hpage))
> rc = move_to_new_page(new_hpage, hpage, 1, mode);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0f3b7cd..778abfc 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
> return vma_address(page, vma);
> }
>
> +pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
> + unsigned long address, spinlock_t **ptlp)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> + spinlock_t *ptl;
> +
> + swp_entry_t entry = { .val = page_private(page) };
> +
> + if (unlikely(PageHuge(page))) {
> + pte = huge_pte_offset(mm, address);
> + ptl = &mm->page_table_lock;
> + goto check;
> + }
> +
> + pgd = pgd_offset(mm, address);
> + if (!pgd_present(*pgd))
> + return NULL;
> +
> + pud = pud_offset(pgd, address);
> + if (!pud_present(*pud))
> + return NULL;
> +
> + pmd = pmd_offset(pud, address);
> + if (!pmd_present(*pmd))
> + return NULL;
> + if (pmd_trans_huge(*pmd))
> + return NULL;
> +
> + pte = pte_offset_map(pmd, address);
> + ptl = pte_lockptr(mm, pmd);
> +check:
> + spin_lock(ptl);
> + if (PageAnon(page)) {
> + if (!pte_present(*pte) && entry.val ==
> + pte_to_swp_entry(*pte).val) {
> + *ptlp = ptl;
> + return pte;
> + }
> + } else {
> + if (pte_none(*pte)) {
> + *ptlp = ptl;
> + return pte;
> + }
> + }
> + pte_unmap_unlock(pte, ptl);
> + return NULL;
> +}
> +
> /*
> * Check that @page is mapped at @address into @mm.
> *
> @@ -1218,12 +1269,42 @@ out:
> mem_cgroup_end_update_page_stat(page, &locked, &flags);
> }
>
> +int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
> + unsigned long address)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pte_t *pte;
> + pte_t pteval;
> + spinlock_t *ptl;
> +
> + pte = page_check_volatile_address(page, mm, address, &ptl);
> + if (!pte)
> + return 0;
> +
> + /* Nuke the page table entry. */
> + flush_cache_page(vma, address, page_to_pfn(page));
> + pteval = ptep_clear_flush(vma, address, pte);
> +
> + if (PageAnon(page)) {
> + swp_entry_t entry = { .val = page_private(page) };
> + if (PageSwapCache(page)) {
> + dec_mm_counter(mm, MM_SWAPENTS);
> + swap_free(entry);
> + }
> + }
> +
> + pte_unmap_unlock(pte, ptl);
> + mmu_notifier_invalidate_page(mm, address);
> + return 1;
> +}
> +
> /*
> * Subfunctions of try_to_unmap: try_to_unmap_one called
> * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
> */
> int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> - unsigned long address, enum ttu_flags flags)
> + unsigned long address, enum ttu_flags flags,
> + bool *is_volatile)
> {
> struct mm_struct *mm = vma->vm_mm;
> pte_t *pte;
> @@ -1235,6 +1316,8 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> if (!pte)
> goto out;
>
> + if (!(vma->vm_flags & VM_VOLATILE))
> + *is_volatile = false;
> /*
> * If the page is mlock()d, we cannot swap it out.
> * If it's recently referenced (perhaps page_referenced
> @@ -1494,6 +1577,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> struct anon_vma *anon_vma;
> struct anon_vma_chain *avc;
> int ret = SWAP_AGAIN;
> + bool is_volatile = true;
> +
> + if (flags & TTU_IGNORE_VOLATILE)
> + is_volatile = false;
>
> anon_vma = page_lock_anon_vma(page);
> if (!anon_vma)
> @@ -1512,17 +1599,32 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> * temporary VMAs until after exec() completes.
> */
> if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
> - is_vma_temporary_stack(vma))
> + is_vma_temporary_stack(vma)) {
> + is_volatile = false;
> continue;
> + }
>
> address = vma_address(page, vma);
> if (address == -EFAULT)
> continue;
> - ret = try_to_unmap_one(page, vma, address, flags);
> + ret = try_to_unmap_one(page, vma, address, flags, &is_volatile);
> if (ret != SWAP_AGAIN || !page_mapped(page))
> break;
> }
>
> + if (page_mapped(page) || is_volatile == false)
> + goto out;
> +
> + list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> + struct vm_area_struct *vma = avc->vma;
> + unsigned long address;
> +
> + address = vma_address(page, vma);
> + if (try_to_zap_one(page, vma, address))
> + vma->purged = true;
> + }
> + ret = SWAP_DISCARD;
> +out:
> page_unlock_anon_vma(anon_vma);
> return ret;
> }
> @@ -1553,13 +1655,14 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
> unsigned long max_nl_cursor = 0;
> unsigned long max_nl_size = 0;
> unsigned int mapcount;
> + bool dummy;
>
> mutex_lock(&mapping->i_mmap_mutex);
> vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> unsigned long address = vma_address(page, vma);
> if (address == -EFAULT)
> continue;
> - ret = try_to_unmap_one(page, vma, address, flags);
> + ret = try_to_unmap_one(page, vma, address, flags, &dummy);
> if (ret != SWAP_AGAIN || !page_mapped(page))
> goto out;
> }
> @@ -1651,6 +1754,7 @@ out:
> * SWAP_AGAIN - we missed a mapping, try again later
> * SWAP_FAIL - the page is unswappable
> * SWAP_MLOCK - page is mlocked.
> + * SWAP_DISCARD - page is volatile.
> */
> int try_to_unmap(struct page *page, enum ttu_flags flags)
> {
> @@ -1665,7 +1769,7 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
> ret = try_to_unmap_anon(page, flags);
> else
> ret = try_to_unmap_file(page, flags);
> - if (ret != SWAP_MLOCK && !page_mapped(page))
> + if (ret != SWAP_MLOCK && !page_mapped(page) && ret != SWAP_DISCARD)
> ret = SWAP_SUCCESS;
> return ret;
> }
> @@ -1707,6 +1811,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
> anon_vma_free(anon_vma);
> }
>
> +void volatile_lock(struct vm_area_struct *vma)
> +{
> + if (vma->anon_vma)
> + anon_vma_lock(vma->anon_vma);
> +}
> +
> +void volatile_unlock(struct vm_area_struct *vma)
> +{
> + if (vma->anon_vma)
> + anon_vma_unlock(vma->anon_vma);
> +}
> +
> #ifdef CONFIG_MIGRATION
> /*
> * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 99b434b..d5b60d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -789,6 +789,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> */
> if (page_mapped(page) && mapping) {
> switch (try_to_unmap(page, TTU_UNMAP)) {
> + case SWAP_DISCARD:
> + goto discard_page;
> case SWAP_FAIL:
> goto activate_locked;
> case SWAP_AGAIN:
> @@ -857,6 +859,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> }
> }
>
> +discard_page:
> /*
> * If the page has buffers, try to free the buffer mappings
> * associated with this page. If we succeed we try to free
> --
> 1.7.9.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Thanks,
--Bob

2012-11-02 02:31:01

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

Hi Bob,

On Fri, Nov 02, 2012 at 09:43:01AM +0800, Bob Liu wrote:
> On Tue, Oct 30, 2012 at 9:29 AM, Minchan Kim <[email protected]> wrote:
> > This patch introudces new madvise behavior MADV_VOLATILE and
> > MADV_NOVOLATILE for anonymous pages. It's different with
> > John Stultz's version which considers only tmpfs while this patch
> > considers only anonymous pages so this cannot cover John's one.
> > If below idea is proved as reasonable, I hope we can unify both
> > concepts by madvise/fadvise.
> >
> > Rationale is following as.
> > Many allocators call munmap(2) when user call free(3) if ptr is
> > in mmaped area. But munmap isn't cheap because it have to clean up
> > all pte entries and unlinking a vma so overhead would be increased
> > linearly by mmaped area's size.
> >
>
> I have a question.
> Pte entries are cleaned up during munmap(), so if user space try to
> access the unmaped address
> page fault will be generated.
>
> If use MADV_VOLATILE? What's the result?

1) If user accesses pages of volatile vma which are discard by VM
due to memory pressure, it's SIGBUS.

2) If user accesses pages of volatile vma which live in memory without
discarding, he can access it.

3) If user accesses pages of volatile vma which are swapped out before
calling madvise(MADV_VOLATILE), It's okay.

> The pte entries are not cleaned so user space can still access the
> memory before
> VM discard pages?

Yeb.

>
> > Volatile conecept of Robert Love could be very useful for reducing
> > free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
> > munmap(2)(Of course, they need to manage volatile mmaped area to
> > reduce shortage of address space and sometime ends up unmaping them).
> > The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because
> >
> > 1) it just marks the flag in VMA and
> > 2) if memory pressure happens, VM can discard pages of volatile VMA
> > instead of swapping out when volatile pages is selected as victim
> > by normal VM aging policy.
> > 3) freed mmaped area doesn't include any meaningful data so there
> > is no point to swap them out.
> >
> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> > allocating that area to user. Otherwise, accessing of volatile range
> > will meet SIGBUS error.
> >
> > The downside is that we have to age anon lru list although we don't
> > have swap because I don't want to discard volatile pages by top priority
> > when memory pressure happens as volatile in this patch means "We don't
> > need to swap out because user can handle the situation which data are
> > disappear suddenly", NOT "They are useless so hurry up to reclaim them".
> > So I want to apply same aging rule of nomal pages to them.
> >
> > Anon background aging of non-swap system would be a trade-off for
> > getting good feature. Even, we had done it two years ago until merge
> > [1] and I believe free(3) performance gain will beat loss of anon lru
> > aging's overead once all of allocator start to use madvise.
> > (This patch doesn't include background aging in case of non-swap system
> > but it's trivial if we decide)
> >
> > I hope seeing opinions from others before diving into glibc or bionic.
> > Welcome to any comment.
> >
> > [1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
> >
> > Changelog
> > * from RFC v1
> > * add clear comment of purge - Christoph
> > * Change patch descritpion
> >
> > Cc: John Stultz <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Christoph Lameter <[email protected]>
> > Cc: Android Kernel Team <[email protected]>
> > Cc: Robert Love <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
> > Cc: Dave Hansen <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> > Cc: Dave Chinner <[email protected]>
> > Cc: Neil Brown <[email protected]>
> > Cc: Mike Hommey <[email protected]>
> > Cc: Taras Glek <[email protected]>
> > Cc: KOSAKI Motohiro <[email protected]>
> > Cc: Christoph Lameter <[email protected]>
> > Cc: KAMEZAWA Hiroyuki <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
> > include/asm-generic/mman-common.h | 3 +
> > include/linux/mm.h | 9 ++-
> > include/linux/mm_types.h | 5 ++
> > include/linux/rmap.h | 24 ++++++-
> > mm/ksm.c | 4 +-
> > mm/madvise.c | 32 +++++++++-
> > mm/memory.c | 2 +
> > mm/migrate.c | 6 +-
> > mm/rmap.c | 126 +++++++++++++++++++++++++++++++++++--
> > mm/vmscan.c | 3 +
> > 10 files changed, 202 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
> > index d030d2c..5f8090d 100644
> > --- a/include/asm-generic/mman-common.h
> > +++ b/include/asm-generic/mman-common.h
> > @@ -34,6 +34,9 @@
> > #define MADV_SEQUENTIAL 2 /* expect sequential page references */
> > #define MADV_WILLNEED 3 /* will need these pages */
> > #define MADV_DONTNEED 4 /* don't need these pages */
> > +#define MADV_VOLATILE 5 /* pages will disappear suddenly */
> > +#define MADV_NOVOLATILE 6 /* pages will not disappear */
> > +
> >
> > /* common parameters: try to keep these consistent across architectures */
> > #define MADV_REMOVE 9 /* remove these pages & resources */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 311be90..78c8c08 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -120,6 +120,13 @@ extern unsigned int kobjsize(const void *objp);
> > #define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */
> > #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
> >
> > +/*
> > + * Recently, Konstantin removed a few flags but not merged yet
> > + * so we will get a room for new flag for supporting 32 bit.
> > + * Thanks, Konstantin!
> > + */
> > +#define VM_VOLATILE 0x100000000
> > +
> > /* Bits set in the VMA until the stack is in its final location */
> > #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ)
> >
> > @@ -143,7 +150,7 @@ extern unsigned int kobjsize(const void *objp);
> > * Special vmas that are non-mergable, non-mlock()able.
> > * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
> > */
> > -#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
> > +#define VM_SPECIAL (VM_IO|VM_DONTEXPAND|VM_RESERVED|VM_PFNMAP|VM_VOLATILE)
> >
> > /*
> > * mapping from the currently active vm_flags protection bits (the
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index bf78672..c813daa 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -279,6 +279,11 @@ struct vm_area_struct {
> > #ifdef CONFIG_NUMA
> > struct mempolicy *vm_policy; /* NUMA policy for the VMA */
> > #endif
> > + /*
> > + * True if more than a page in this vma is reclaimed.
> > + * It's protected by anon_vma->mutex.
> > + */
> > + bool purged;
> > };
> >
> > struct core_thread {
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index 3fce545..65b9f33 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -67,6 +67,10 @@ struct anon_vma_chain {
> > struct list_head same_anon_vma; /* locked by anon_vma->mutex */
> > };
> >
> > +
> > +void volatile_lock(struct vm_area_struct *vma);
> > +void volatile_unlock(struct vm_area_struct *vma);
> > +
> > #ifdef CONFIG_MMU
> > static inline void get_anon_vma(struct anon_vma *anon_vma)
> > {
> > @@ -170,12 +174,14 @@ enum ttu_flags {
> > TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
> > TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
> > TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
> > + TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
> > };
> > #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
> >
> > int try_to_unmap(struct page *, enum ttu_flags flags);
> > int try_to_unmap_one(struct page *, struct vm_area_struct *,
> > - unsigned long address, enum ttu_flags flags);
> > + unsigned long address, enum ttu_flags flags,
> > + bool *is_volatile);
> >
> > /*
> > * Called from mm/filemap_xip.c to unmap empty zero page
> > @@ -194,6 +200,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> > return ptep;
> > }
> >
> > +pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
> > + unsigned long, spinlock_t **);
> > +
> > +static inline pte_t *page_check_volatile_address(struct page *page,
> > + struct mm_struct *mm,
> > + unsigned long address,
> > + spinlock_t **ptlp)
> > +{
> > + pte_t *ptep;
> > +
> > + __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
> > + mm, address, ptlp));
> > + return ptep;
> > +}
> > +
> > /*
> > * Used by swapoff to help locate where page is expected in vma.
> > */
> > @@ -257,5 +278,6 @@ static inline int page_mkclean(struct page *page)
> > #define SWAP_AGAIN 1
> > #define SWAP_FAIL 2
> > #define SWAP_MLOCK 3
> > +#define SWAP_DISCARD 4
> >
> > #endif /* _LINUX_RMAP_H */
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 47c8853..22c54d2 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -1653,6 +1653,7 @@ int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
> > struct rmap_item *rmap_item;
> > int ret = SWAP_AGAIN;
> > int search_new_forks = 0;
> > + bool dummy_volatile;
> >
> > VM_BUG_ON(!PageKsm(page));
> > VM_BUG_ON(!PageLocked(page));
> > @@ -1682,7 +1683,8 @@ again:
> > continue;
> >
> > ret = try_to_unmap_one(page, vma,
> > - rmap_item->address, flags);
> > + rmap_item->address, flags,
> > + &dummy_volatile);
> > if (ret != SWAP_AGAIN || !page_mapped(page)) {
> > anon_vma_unlock(anon_vma);
> > goto out;
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 14d260f..53cd77f 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
> > if (error)
> > goto out;
> > break;
> > + case MADV_VOLATILE:
> > + if (vma->vm_flags & VM_LOCKED) {
> > + error = -EINVAL;
> > + goto out;
> > + }
> > + new_flags |= VM_VOLATILE;
> > + vma->purged = false;
> > + break;
> > + case MADV_NOVOLATILE:
> > + if (!(vma->vm_flags & VM_VOLATILE)) {
> > + error = -EINVAL;
> > + goto out;
> > + }
> > +
> > + new_flags &= ~VM_VOLATILE;
> > + break;
> > }
> >
> > if (new_flags == vma->vm_flags) {
> > @@ -118,9 +134,15 @@ static long madvise_behavior(struct vm_area_struct * vma,
> > success:
> > /*
> > * vm_flags is protected by the mmap_sem held in write mode.
> > + * In case of VOLATILE, we need volatile_lock, additionally.
> > */
> > + if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
> > + volatile_lock(vma);
> > vma->vm_flags = new_flags;
> > -
> > + if (behavior == MADV_NOVOLATILE)
> > + error = vma->purged;
> > + if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
> > + volatile_unlock(vma);
> > out:
> > if (error == -ENOMEM)
> > error = -EAGAIN;
> > @@ -310,6 +332,8 @@ madvise_behavior_valid(int behavior)
> > #endif
> > case MADV_DONTDUMP:
> > case MADV_DODUMP:
> > + case MADV_VOLATILE:
> > + case MADV_NOVOLATILE:
> > return 1;
> >
> > default:
> > @@ -383,7 +407,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >
> > if (start & ~PAGE_MASK)
> > goto out;
> > - len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> > +
> > + if (behavior != MADV_VOLATILE && behavior != MADV_NOVOLATILE)
> > + len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> > + else
> > + len = len_in & PAGE_MASK;
> >
> > /* Check to see whether len was rounded up from small -ve to zero */
> > if (len_in && !len)
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 5736170..26b3f73 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3441,6 +3441,8 @@ int handle_pte_fault(struct mm_struct *mm,
> > entry = *pte;
> > if (!pte_present(entry)) {
> > if (pte_none(entry)) {
> > + if (unlikely(vma->vm_flags & VM_VOLATILE))
> > + return VM_FAULT_SIGBUS;
> > if (vma->vm_ops) {
> > if (likely(vma->vm_ops->fault))
> > return do_linear_fault(mm, vma, address,
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 77ed2d7..d1b51af 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> > }
> >
> > /* Establish migration ptes or remove ptes */
> > - try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> > + try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
> > + TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
> >
> > skip_unmap:
> > if (!page_mapped(page))
> > @@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> > if (PageAnon(hpage))
> > anon_vma = page_get_anon_vma(hpage);
> >
> > - try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> > + try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
> > + TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
> >
> > if (!page_mapped(hpage))
> > rc = move_to_new_page(new_hpage, hpage, 1, mode);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..778abfc 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
> > return vma_address(page, vma);
> > }
> >
> > +pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
> > + unsigned long address, spinlock_t **ptlp)
> > +{
> > + pgd_t *pgd;
> > + pud_t *pud;
> > + pmd_t *pmd;
> > + pte_t *pte;
> > + spinlock_t *ptl;
> > +
> > + swp_entry_t entry = { .val = page_private(page) };
> > +
> > + if (unlikely(PageHuge(page))) {
> > + pte = huge_pte_offset(mm, address);
> > + ptl = &mm->page_table_lock;
> > + goto check;
> > + }
> > +
> > + pgd = pgd_offset(mm, address);
> > + if (!pgd_present(*pgd))
> > + return NULL;
> > +
> > + pud = pud_offset(pgd, address);
> > + if (!pud_present(*pud))
> > + return NULL;
> > +
> > + pmd = pmd_offset(pud, address);
> > + if (!pmd_present(*pmd))
> > + return NULL;
> > + if (pmd_trans_huge(*pmd))
> > + return NULL;
> > +
> > + pte = pte_offset_map(pmd, address);
> > + ptl = pte_lockptr(mm, pmd);
> > +check:
> > + spin_lock(ptl);
> > + if (PageAnon(page)) {
> > + if (!pte_present(*pte) && entry.val ==
> > + pte_to_swp_entry(*pte).val) {
> > + *ptlp = ptl;
> > + return pte;
> > + }
> > + } else {
> > + if (pte_none(*pte)) {
> > + *ptlp = ptl;
> > + return pte;
> > + }
> > + }
> > + pte_unmap_unlock(pte, ptl);
> > + return NULL;
> > +}
> > +
> > /*
> > * Check that @page is mapped at @address into @mm.
> > *
> > @@ -1218,12 +1269,42 @@ out:
> > mem_cgroup_end_update_page_stat(page, &locked, &flags);
> > }
> >
> > +int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
> > + unsigned long address)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + pte_t *pte;
> > + pte_t pteval;
> > + spinlock_t *ptl;
> > +
> > + pte = page_check_volatile_address(page, mm, address, &ptl);
> > + if (!pte)
> > + return 0;
> > +
> > + /* Nuke the page table entry. */
> > + flush_cache_page(vma, address, page_to_pfn(page));
> > + pteval = ptep_clear_flush(vma, address, pte);
> > +
> > + if (PageAnon(page)) {
> > + swp_entry_t entry = { .val = page_private(page) };
> > + if (PageSwapCache(page)) {
> > + dec_mm_counter(mm, MM_SWAPENTS);
> > + swap_free(entry);
> > + }
> > + }
> > +
> > + pte_unmap_unlock(pte, ptl);
> > + mmu_notifier_invalidate_page(mm, address);
> > + return 1;
> > +}
> > +
> > /*
> > * Subfunctions of try_to_unmap: try_to_unmap_one called
> > * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
> > */
> > int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > - unsigned long address, enum ttu_flags flags)
> > + unsigned long address, enum ttu_flags flags,
> > + bool *is_volatile)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > pte_t *pte;
> > @@ -1235,6 +1316,8 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > if (!pte)
> > goto out;
> >
> > + if (!(vma->vm_flags & VM_VOLATILE))
> > + *is_volatile = false;
> > /*
> > * If the page is mlock()d, we cannot swap it out.
> > * If it's recently referenced (perhaps page_referenced
> > @@ -1494,6 +1577,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> > struct anon_vma *anon_vma;
> > struct anon_vma_chain *avc;
> > int ret = SWAP_AGAIN;
> > + bool is_volatile = true;
> > +
> > + if (flags & TTU_IGNORE_VOLATILE)
> > + is_volatile = false;
> >
> > anon_vma = page_lock_anon_vma(page);
> > if (!anon_vma)
> > @@ -1512,17 +1599,32 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> > * temporary VMAs until after exec() completes.
> > */
> > if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
> > - is_vma_temporary_stack(vma))
> > + is_vma_temporary_stack(vma)) {
> > + is_volatile = false;
> > continue;
> > + }
> >
> > address = vma_address(page, vma);
> > if (address == -EFAULT)
> > continue;
> > - ret = try_to_unmap_one(page, vma, address, flags);
> > + ret = try_to_unmap_one(page, vma, address, flags, &is_volatile);
> > if (ret != SWAP_AGAIN || !page_mapped(page))
> > break;
> > }
> >
> > + if (page_mapped(page) || is_volatile == false)
> > + goto out;
> > +
> > + list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> > + struct vm_area_struct *vma = avc->vma;
> > + unsigned long address;
> > +
> > + address = vma_address(page, vma);
> > + if (try_to_zap_one(page, vma, address))
> > + vma->purged = true;
> > + }
> > + ret = SWAP_DISCARD;
> > +out:
> > page_unlock_anon_vma(anon_vma);
> > return ret;
> > }
> > @@ -1553,13 +1655,14 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
> > unsigned long max_nl_cursor = 0;
> > unsigned long max_nl_size = 0;
> > unsigned int mapcount;
> > + bool dummy;
> >
> > mutex_lock(&mapping->i_mmap_mutex);
> > vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> > unsigned long address = vma_address(page, vma);
> > if (address == -EFAULT)
> > continue;
> > - ret = try_to_unmap_one(page, vma, address, flags);
> > + ret = try_to_unmap_one(page, vma, address, flags, &dummy);
> > if (ret != SWAP_AGAIN || !page_mapped(page))
> > goto out;
> > }
> > @@ -1651,6 +1754,7 @@ out:
> > * SWAP_AGAIN - we missed a mapping, try again later
> > * SWAP_FAIL - the page is unswappable
> > * SWAP_MLOCK - page is mlocked.
> > + * SWAP_DISCARD - page is volatile.
> > */
> > int try_to_unmap(struct page *page, enum ttu_flags flags)
> > {
> > @@ -1665,7 +1769,7 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
> > ret = try_to_unmap_anon(page, flags);
> > else
> > ret = try_to_unmap_file(page, flags);
> > - if (ret != SWAP_MLOCK && !page_mapped(page))
> > + if (ret != SWAP_MLOCK && !page_mapped(page) && ret != SWAP_DISCARD)
> > ret = SWAP_SUCCESS;
> > return ret;
> > }
> > @@ -1707,6 +1811,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
> > anon_vma_free(anon_vma);
> > }
> >
> > +void volatile_lock(struct vm_area_struct *vma)
> > +{
> > + if (vma->anon_vma)
> > + anon_vma_lock(vma->anon_vma);
> > +}
> > +
> > +void volatile_unlock(struct vm_area_struct *vma)
> > +{
> > + if (vma->anon_vma)
> > + anon_vma_unlock(vma->anon_vma);
> > +}
> > +
> > #ifdef CONFIG_MIGRATION
> > /*
> > * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 99b434b..d5b60d0 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -789,6 +789,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > */
> > if (page_mapped(page) && mapping) {
> > switch (try_to_unmap(page, TTU_UNMAP)) {
> > + case SWAP_DISCARD:
> > + goto discard_page;
> > case SWAP_FAIL:
> > goto activate_locked;
> > case SWAP_AGAIN:
> > @@ -857,6 +859,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > }
> > }
> >
> > +discard_page:
> > /*
> > * If the page has buffers, try to free the buffer mappings
> > * associated with this page. If we succeed we try to free
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> --
> Thanks,
> --Bob
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2012-11-06 00:14:52

by Arun Sharma

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

On Wed, Oct 31, 2012 at 06:56:05PM -0400, KOSAKI Motohiro wrote:
> glibc malloc discard freed memory by using MADV_DONTNEED
> as tcmalloc. and it is often a source of large performance decrease.
> because of MADV_DONTNEED discard memory immediately and
> right after malloc() call fall into page fault and pagesize memset() path.
> then, using DONTNEED increased zero fill and cache miss rate.

The memcg based solution that I posted a few months ago is working well
for us. We see significantly less cpu in zero'ing pages.

Not everyone was comfortable with the security implications of recycling
pages between processes in a memcg, although it was disabled by default
and had to be explicitly opted-in.

Also, memory allocators have a second motivation in using madvise: to
create virtually contiguous regions of memory from a fragmented address
space, without increasing the RSS.

-Arun

2012-11-06 01:49:43

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

Hello,

On Mon, Nov 05, 2012 at 03:54:43PM -0800, Arun Sharma wrote:
> On Wed, Oct 31, 2012 at 06:56:05PM -0400, KOSAKI Motohiro wrote:
> > glibc malloc discard freed memory by using MADV_DONTNEED
> > as tcmalloc. and it is often a source of large performance decrease.
> > because of MADV_DONTNEED discard memory immediately and
> > right after malloc() call fall into page fault and pagesize memset() path.
> > then, using DONTNEED increased zero fill and cache miss rate.
>
> The memcg based solution that I posted a few months ago is working well
> for us. We see significantly less cpu in zero'ing pages.
>
> Not everyone was comfortable with the security implications of recycling
> pages between processes in a memcg, although it was disabled by default
> and had to be explicitly opted-in.
>
> Also, memory allocators have a second motivation in using madvise: to
> create virtually contiguous regions of memory from a fragmented address
> space, without increasing the RSS.

I don't get it. How do we create contiguos region by madvise?
Just out of curiosity.
Could you elaborate that use case? :)

>
> -Arun
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind Regards,
Minchan Kim

2012-11-06 02:38:35

by Arun Sharma

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

On 11/5/12 5:49 PM, Minchan Kim wrote:

>> Also, memory allocators have a second motivation in using madvise: to
>> create virtually contiguous regions of memory from a fragmented address
>> space, without increasing the RSS.
>
> I don't get it. How do we create contiguos region by madvise?
> Just out of curiosity.
> Could you elaborate that use case? :)

By using a new anonymous map and faulting pages in.

The fragmented virtual memory is released via MADV_DONTNEED and if the
malloc/free activity on the system is dominated by one process, chances
are that the newly faulted in page is the one released by the same
process :)

The net effect is that physical pages within a single address space are
rearranged so larger allocations can be satisfied.

-Arun

2012-11-22 22:01:22

by John Stultz

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

On 10/29/2012 06:29 PM, Minchan Kim wrote:
> This patch introudces new madvise behavior MADV_VOLATILE and
> MADV_NOVOLATILE for anonymous pages. It's different with
> John Stultz's version which considers only tmpfs while this patch
> considers only anonymous pages so this cannot cover John's one.
> If below idea is proved as reasonable, I hope we can unify both
> concepts by madvise/fadvise.
>
> Rationale is following as.
> Many allocators call munmap(2) when user call free(3) if ptr is
> in mmaped area. But munmap isn't cheap because it have to clean up
> all pte entries and unlinking a vma so overhead would be increased
> linearly by mmaped area's size.
>
> Volatile conecept of Robert Love could be very useful for reducing
> free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
> munmap(2)(Of course, they need to manage volatile mmaped area to
> reduce shortage of address space and sometime ends up unmaping them).
> The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because
>
> 1) it just marks the flag in VMA and
> 2) if memory pressure happens, VM can discard pages of volatile VMA
> instead of swapping out when volatile pages is selected as victim
> by normal VM aging policy.
> 3) freed mmaped area doesn't include any meaningful data so there
> is no point to swap them out.
>
> Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> allocating that area to user. Otherwise, accessing of volatile range
> will meet SIGBUS error.
>
> The downside is that we have to age anon lru list although we don't
> have swap because I don't want to discard volatile pages by top priority
> when memory pressure happens as volatile in this patch means "We don't
> need to swap out because user can handle the situation which data are
> disappear suddenly", NOT "They are useless so hurry up to reclaim them".
> So I want to apply same aging rule of nomal pages to them.
>
> Anon background aging of non-swap system would be a trade-off for
> getting good feature. Even, we had done it two years ago until merge
> [1] and I believe free(3) performance gain will beat loss of anon lru
> aging's overead once all of allocator start to use madvise.
> (This patch doesn't include background aging in case of non-swap system
> but it's trivial if we decide)

Hey Minchan!
So I've been looking at your patch for a bit, and I'm still trying
to fully grok it and the rmap code. Overall this approach looks pretty
interesting, and while your patch description focused on malloc/free
behavior, I suspect your patch would satisfy what the mozilla folks are
looking for, and while its not quite sufficient yet for Android, the
interface semantics are very close to what I've been wanting (my test
cases were easily mapped over).

The two major issues for me are:
1) As you noted, this approach currently doesn't work on non-swap
systems, as we don't try to shrink the anonymous page lrus. This is a
big problem, as it makes it unusable for most all Android systems. You
suggest we may want to change aging the anonymous lru, and I had a patch
earlier that tried to change some of the anonymous lru aging rules for
volatile pages, but its not quite right for what you have here. So I'd
be interested in hearing how you think the anonymous lru aging should
happen with swapoff.

2) Being able to use this with tmpfs files. I'm currently trying to
better understand the rmap code, looking to see if there's a way to have
try_to_unmap_file() work similarly to try_to_unmap_anon(), to allow
allow users to madvise() on mmapped tmpfs files. This would provide a
very similar interface as to what I've been proposing with
fadvise/fallocate, but just using process virtual addresses instead of
(fd, offset) pairs. The benefit with (fd,offset) pairs for Android is
that its easier to manage shared volatile ranges between two processes
that are sharing data via an mmapped tmpfs file (although this actual
use case may be fairly rare). I believe we should still be able to
rework the ashmem internals to use madvise (which would provide legacy
support for existing android apps), so then its just a question of if we
could then eventually convince Android apps to use the madvise interface
directly, rather then the ashmem unpin ioctl.

The other concern with the madvise on mmapped files approach is that
there's no easy way I can see to limit it to tmpfs files. I know some
have been interested in having fallocate(VOLATILE) interface for
non-tmpfs files, but I'm not sure I see the benefit there yet. I have
noted folks mixing the idea of volatile pages being purged under memory
pressure with the idea of volatile files, which might be purged from
disk under disk pressure. While I think the second idea is interesting,
I do think its completely separate from the volatile memory concept.

Anyway, I'd be interested in your thoughts on these two issues. Thanks
so much for sending out this patch, its given me quite a bit to chew on,
and I too hope we can merge our different approaches together.

thanks
-john

2012-11-29 04:18:14

by John Stultz

[permalink] [raw]
Subject: Re: [RFC v2] Support volatile range for anon vma

On 11/21/2012 04:36 PM, John Stultz wrote:
> 2) Being able to use this with tmpfs files. I'm currently trying to
> better understand the rmap code, looking to see if there's a way to
> have try_to_unmap_file() work similarly to try_to_unmap_anon(), to
> allow allow users to madvise() on mmapped tmpfs files. This would
> provide a very similar interface as to what I've been proposing with
> fadvise/fallocate, but just using process virtual addresses instead of
> (fd, offset) pairs. The benefit with (fd,offset) pairs for Android
> is that its easier to manage shared volatile ranges between two
> processes that are sharing data via an mmapped tmpfs file (although
> this actual use case may be fairly rare). I believe we should still
> be able to rework the ashmem internals to use madvise (which would
> provide legacy support for existing android apps), so then its just a
> question of if we could then eventually convince Android apps to use
> the madvise interface directly, rather then the ashmem unpin ioctl.

Hey Minchan,
I've been playing around with your patch trying to better
understand your approach and to extend it to support tmpfs files. In
doing so I've found a few bugs, and have some rough fixes I wanted to
share. There's still a few edge cases I need to deal with (the
vma-purged flag isn't being properly handled through vma merge/split
operations), but its starting to come along.

Anyway, take a look at the tree here and let me know what you think.
http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol

I'm sure much is wrong with the tree, but with it I can now mark tmpfs
file pages as volatile/nonvolatile and see them purged under pressure.
Unfortunately its not limited to tmpfs, so persistent files will also
work, but the state of the underlying files on purge is undefined.
Hopefully I can find a way to limit it to non-persistent filesystems for
now, and if needed find a way to extend it to persistent filesystems in
a sane way later.

thanks
-john