Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1762578Ab2KBBnI (ORCPT ); Thu, 1 Nov 2012 21:43:08 -0400 Received: from mail-we0-f174.google.com ([74.125.82.174]:55074 "EHLO mail-we0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1762211Ab2KBBnE (ORCPT ); Thu, 1 Nov 2012 21:43:04 -0400 MIME-Version: 1.0 In-Reply-To: <1351560594-18366-1-git-send-email-minchan@kernel.org> References: <1351560594-18366-1-git-send-email-minchan@kernel.org> Date: Fri, 2 Nov 2012 09:43:01 +0800 Message-ID: Subject: Re: [RFC v2] Support volatile range for anon vma From: Bob Liu To: Minchan Kim Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, John Stultz , Christoph Lameter , Android Kernel Team , Robert Love , Mel Gorman , Hugh Dickins , Dave Hansen , Rik van Riel , Dave Chinner , Neil Brown , Mike Hommey , Taras Glek , KOSAKI Motohiro , KAMEZAWA Hiroyuki Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 22747 Lines: 589 On Tue, Oct 30, 2012 at 9:29 AM, Minchan Kim wrote: > This patch introudces new madvise behavior MADV_VOLATILE and > MADV_NOVOLATILE for anonymous pages. It's different with > John Stultz's version which considers only tmpfs while this patch > considers only anonymous pages so this cannot cover John's one. > If below idea is proved as reasonable, I hope we can unify both > concepts by madvise/fadvise. > > Rationale is following as. > Many allocators call munmap(2) when user call free(3) if ptr is > in mmaped area. But munmap isn't cheap because it have to clean up > all pte entries and unlinking a vma so overhead would be increased > linearly by mmaped area's size. > I have a question. Pte entries are cleaned up during munmap(), so if user space try to access the unmaped address page fault will be generated. If use MADV_VOLATILE? What's the result? The pte entries are not cleaned so user space can still access the memory before VM discard pages? > Volatile conecept of Robert Love could be very useful for reducing > free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of > munmap(2)(Of course, they need to manage volatile mmaped area to > reduce shortage of address space and sometime ends up unmaping them). > The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because > > 1) it just marks the flag in VMA and > 2) if memory pressure happens, VM can discard pages of volatile VMA > instead of swapping out when volatile pages is selected as victim > by normal VM aging policy. > 3) freed mmaped area doesn't include any meaningful data so there > is no point to swap them out. > > Allocator should call madvise(MADV_NOVOLATILE) before reusing for > allocating that area to user. Otherwise, accessing of volatile range > will meet SIGBUS error. > > The downside is that we have to age anon lru list although we don't > have swap because I don't want to discard volatile pages by top priority > when memory pressure happens as volatile in this patch means "We don't > need to swap out because user can handle the situation which data are > disappear suddenly", NOT "They are useless so hurry up to reclaim them". > So I want to apply same aging rule of nomal pages to them. > > Anon background aging of non-swap system would be a trade-off for > getting good feature. Even, we had done it two years ago until merge > [1] and I believe free(3) performance gain will beat loss of anon lru > aging's overead once all of allocator start to use madvise. > (This patch doesn't include background aging in case of non-swap system > but it's trivial if we decide) > > I hope seeing opinions from others before diving into glibc or bionic. > Welcome to any comment. > > [1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system > > Changelog > * from RFC v1 > * add clear comment of purge - Christoph > * Change patch descritpion > > Cc: John Stultz > Cc: Andrew Morton > Cc: Christoph Lameter > Cc: Android Kernel Team > Cc: Robert Love > Cc: Mel Gorman > Cc: Hugh Dickins > Cc: Dave Hansen > Cc: Rik van Riel > Cc: Dave Chinner > Cc: Neil Brown > Cc: Mike Hommey > Cc: Taras Glek > Cc: KOSAKI Motohiro > Cc: Christoph Lameter > Cc: KAMEZAWA Hiroyuki > Signed-off-by: Minchan Kim > --- > include/asm-generic/mman-common.h | 3 + > include/linux/mm.h | 9 ++- > include/linux/mm_types.h | 5 ++ > include/linux/rmap.h | 24 ++++++- > mm/ksm.c | 4 +- > mm/madvise.c | 32 +++++++++- > mm/memory.c | 2 + > mm/migrate.c | 6 +- > mm/rmap.c | 126 +++++++++++++++++++++++++++++++++++-- > mm/vmscan.c | 3 + > 10 files changed, 202 insertions(+), 12 deletions(-) > > diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h > index d030d2c..5f8090d 100644 > --- a/include/asm-generic/mman-common.h > +++ b/include/asm-generic/mman-common.h > @@ -34,6 +34,9 @@ > #define MADV_SEQUENTIAL 2 /* expect sequential page references */ > #define MADV_WILLNEED 3 /* will need these pages */ > #define MADV_DONTNEED 4 /* don't need these pages */ > +#define MADV_VOLATILE 5 /* pages will disappear suddenly */ > +#define MADV_NOVOLATILE 6 /* pages will not disappear */ > + > > /* common parameters: try to keep these consistent across architectures */ > #define MADV_REMOVE 9 /* remove these pages & resources */ > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 311be90..78c8c08 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -120,6 +120,13 @@ extern unsigned int kobjsize(const void *objp); > #define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */ > #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ > > +/* > + * Recently, Konstantin removed a few flags but not merged yet > + * so we will get a room for new flag for supporting 32 bit. > + * Thanks, Konstantin! > + */ > +#define VM_VOLATILE 0x100000000 > + > /* Bits set in the VMA until the stack is in its final location */ > #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ) > > @@ -143,7 +150,7 @@ extern unsigned int kobjsize(const void *objp); > * Special vmas that are non-mergable, non-mlock()able. > * Note: mm/huge_memory.c VM_NO_THP depends on this definition. > */ > -#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP) > +#define VM_SPECIAL (VM_IO|VM_DONTEXPAND|VM_RESERVED|VM_PFNMAP|VM_VOLATILE) > > /* > * mapping from the currently active vm_flags protection bits (the > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index bf78672..c813daa 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -279,6 +279,11 @@ struct vm_area_struct { > #ifdef CONFIG_NUMA > struct mempolicy *vm_policy; /* NUMA policy for the VMA */ > #endif > + /* > + * True if more than a page in this vma is reclaimed. > + * It's protected by anon_vma->mutex. > + */ > + bool purged; > }; > > struct core_thread { > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index 3fce545..65b9f33 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -67,6 +67,10 @@ struct anon_vma_chain { > struct list_head same_anon_vma; /* locked by anon_vma->mutex */ > }; > > + > +void volatile_lock(struct vm_area_struct *vma); > +void volatile_unlock(struct vm_area_struct *vma); > + > #ifdef CONFIG_MMU > static inline void get_anon_vma(struct anon_vma *anon_vma) > { > @@ -170,12 +174,14 @@ enum ttu_flags { > TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ > TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ > TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */ > + TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */ > }; > #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) > > int try_to_unmap(struct page *, enum ttu_flags flags); > int try_to_unmap_one(struct page *, struct vm_area_struct *, > - unsigned long address, enum ttu_flags flags); > + unsigned long address, enum ttu_flags flags, > + bool *is_volatile); > > /* > * Called from mm/filemap_xip.c to unmap empty zero page > @@ -194,6 +200,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm, > return ptep; > } > > +pte_t *__page_check_volatile_address(struct page *, struct mm_struct *, > + unsigned long, spinlock_t **); > + > +static inline pte_t *page_check_volatile_address(struct page *page, > + struct mm_struct *mm, > + unsigned long address, > + spinlock_t **ptlp) > +{ > + pte_t *ptep; > + > + __cond_lock(*ptlp, ptep = __page_check_volatile_address(page, > + mm, address, ptlp)); > + return ptep; > +} > + > /* > * Used by swapoff to help locate where page is expected in vma. > */ > @@ -257,5 +278,6 @@ static inline int page_mkclean(struct page *page) > #define SWAP_AGAIN 1 > #define SWAP_FAIL 2 > #define SWAP_MLOCK 3 > +#define SWAP_DISCARD 4 > > #endif /* _LINUX_RMAP_H */ > diff --git a/mm/ksm.c b/mm/ksm.c > index 47c8853..22c54d2 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -1653,6 +1653,7 @@ int try_to_unmap_ksm(struct page *page, enum ttu_flags flags) > struct rmap_item *rmap_item; > int ret = SWAP_AGAIN; > int search_new_forks = 0; > + bool dummy_volatile; > > VM_BUG_ON(!PageKsm(page)); > VM_BUG_ON(!PageLocked(page)); > @@ -1682,7 +1683,8 @@ again: > continue; > > ret = try_to_unmap_one(page, vma, > - rmap_item->address, flags); > + rmap_item->address, flags, > + &dummy_volatile); > if (ret != SWAP_AGAIN || !page_mapped(page)) { > anon_vma_unlock(anon_vma); > goto out; > diff --git a/mm/madvise.c b/mm/madvise.c > index 14d260f..53cd77f 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma, > if (error) > goto out; > break; > + case MADV_VOLATILE: > + if (vma->vm_flags & VM_LOCKED) { > + error = -EINVAL; > + goto out; > + } > + new_flags |= VM_VOLATILE; > + vma->purged = false; > + break; > + case MADV_NOVOLATILE: > + if (!(vma->vm_flags & VM_VOLATILE)) { > + error = -EINVAL; > + goto out; > + } > + > + new_flags &= ~VM_VOLATILE; > + break; > } > > if (new_flags == vma->vm_flags) { > @@ -118,9 +134,15 @@ static long madvise_behavior(struct vm_area_struct * vma, > success: > /* > * vm_flags is protected by the mmap_sem held in write mode. > + * In case of VOLATILE, we need volatile_lock, additionally. > */ > + if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE) > + volatile_lock(vma); > vma->vm_flags = new_flags; > - > + if (behavior == MADV_NOVOLATILE) > + error = vma->purged; > + if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE) > + volatile_unlock(vma); > out: > if (error == -ENOMEM) > error = -EAGAIN; > @@ -310,6 +332,8 @@ madvise_behavior_valid(int behavior) > #endif > case MADV_DONTDUMP: > case MADV_DODUMP: > + case MADV_VOLATILE: > + case MADV_NOVOLATILE: > return 1; > > default: > @@ -383,7 +407,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > > if (start & ~PAGE_MASK) > goto out; > - len = (len_in + ~PAGE_MASK) & PAGE_MASK; > + > + if (behavior != MADV_VOLATILE && behavior != MADV_NOVOLATILE) > + len = (len_in + ~PAGE_MASK) & PAGE_MASK; > + else > + len = len_in & PAGE_MASK; > > /* Check to see whether len was rounded up from small -ve to zero */ > if (len_in && !len) > diff --git a/mm/memory.c b/mm/memory.c > index 5736170..26b3f73 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3441,6 +3441,8 @@ int handle_pte_fault(struct mm_struct *mm, > entry = *pte; > if (!pte_present(entry)) { > if (pte_none(entry)) { > + if (unlikely(vma->vm_flags & VM_VOLATILE)) > + return VM_FAULT_SIGBUS; > if (vma->vm_ops) { > if (likely(vma->vm_ops->fault)) > return do_linear_fault(mm, vma, address, > diff --git a/mm/migrate.c b/mm/migrate.c > index 77ed2d7..d1b51af 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > } > > /* Establish migration ptes or remove ptes */ > - try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); > + try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK| > + TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE); > > skip_unmap: > if (!page_mapped(page)) > @@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, > if (PageAnon(hpage)) > anon_vma = page_get_anon_vma(hpage); > > - try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); > + try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK| > + TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE); > > if (!page_mapped(hpage)) > rc = move_to_new_page(new_hpage, hpage, 1, mode); > diff --git a/mm/rmap.c b/mm/rmap.c > index 0f3b7cd..778abfc 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) > return vma_address(page, vma); > } > > +pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm, > + unsigned long address, spinlock_t **ptlp) > +{ > + pgd_t *pgd; > + pud_t *pud; > + pmd_t *pmd; > + pte_t *pte; > + spinlock_t *ptl; > + > + swp_entry_t entry = { .val = page_private(page) }; > + > + if (unlikely(PageHuge(page))) { > + pte = huge_pte_offset(mm, address); > + ptl = &mm->page_table_lock; > + goto check; > + } > + > + pgd = pgd_offset(mm, address); > + if (!pgd_present(*pgd)) > + return NULL; > + > + pud = pud_offset(pgd, address); > + if (!pud_present(*pud)) > + return NULL; > + > + pmd = pmd_offset(pud, address); > + if (!pmd_present(*pmd)) > + return NULL; > + if (pmd_trans_huge(*pmd)) > + return NULL; > + > + pte = pte_offset_map(pmd, address); > + ptl = pte_lockptr(mm, pmd); > +check: > + spin_lock(ptl); > + if (PageAnon(page)) { > + if (!pte_present(*pte) && entry.val == > + pte_to_swp_entry(*pte).val) { > + *ptlp = ptl; > + return pte; > + } > + } else { > + if (pte_none(*pte)) { > + *ptlp = ptl; > + return pte; > + } > + } > + pte_unmap_unlock(pte, ptl); > + return NULL; > +} > + > /* > * Check that @page is mapped at @address into @mm. > * > @@ -1218,12 +1269,42 @@ out: > mem_cgroup_end_update_page_stat(page, &locked, &flags); > } > > +int try_to_zap_one(struct page *page, struct vm_area_struct *vma, > + unsigned long address) > +{ > + struct mm_struct *mm = vma->vm_mm; > + pte_t *pte; > + pte_t pteval; > + spinlock_t *ptl; > + > + pte = page_check_volatile_address(page, mm, address, &ptl); > + if (!pte) > + return 0; > + > + /* Nuke the page table entry. */ > + flush_cache_page(vma, address, page_to_pfn(page)); > + pteval = ptep_clear_flush(vma, address, pte); > + > + if (PageAnon(page)) { > + swp_entry_t entry = { .val = page_private(page) }; > + if (PageSwapCache(page)) { > + dec_mm_counter(mm, MM_SWAPENTS); > + swap_free(entry); > + } > + } > + > + pte_unmap_unlock(pte, ptl); > + mmu_notifier_invalidate_page(mm, address); > + return 1; > +} > + > /* > * Subfunctions of try_to_unmap: try_to_unmap_one called > * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file. > */ > int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > - unsigned long address, enum ttu_flags flags) > + unsigned long address, enum ttu_flags flags, > + bool *is_volatile) > { > struct mm_struct *mm = vma->vm_mm; > pte_t *pte; > @@ -1235,6 +1316,8 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > if (!pte) > goto out; > > + if (!(vma->vm_flags & VM_VOLATILE)) > + *is_volatile = false; > /* > * If the page is mlock()d, we cannot swap it out. > * If it's recently referenced (perhaps page_referenced > @@ -1494,6 +1577,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags) > struct anon_vma *anon_vma; > struct anon_vma_chain *avc; > int ret = SWAP_AGAIN; > + bool is_volatile = true; > + > + if (flags & TTU_IGNORE_VOLATILE) > + is_volatile = false; > > anon_vma = page_lock_anon_vma(page); > if (!anon_vma) > @@ -1512,17 +1599,32 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags) > * temporary VMAs until after exec() completes. > */ > if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) && > - is_vma_temporary_stack(vma)) > + is_vma_temporary_stack(vma)) { > + is_volatile = false; > continue; > + } > > address = vma_address(page, vma); > if (address == -EFAULT) > continue; > - ret = try_to_unmap_one(page, vma, address, flags); > + ret = try_to_unmap_one(page, vma, address, flags, &is_volatile); > if (ret != SWAP_AGAIN || !page_mapped(page)) > break; > } > > + if (page_mapped(page) || is_volatile == false) > + goto out; > + > + list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { > + struct vm_area_struct *vma = avc->vma; > + unsigned long address; > + > + address = vma_address(page, vma); > + if (try_to_zap_one(page, vma, address)) > + vma->purged = true; > + } > + ret = SWAP_DISCARD; > +out: > page_unlock_anon_vma(anon_vma); > return ret; > } > @@ -1553,13 +1655,14 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags) > unsigned long max_nl_cursor = 0; > unsigned long max_nl_size = 0; > unsigned int mapcount; > + bool dummy; > > mutex_lock(&mapping->i_mmap_mutex); > vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { > unsigned long address = vma_address(page, vma); > if (address == -EFAULT) > continue; > - ret = try_to_unmap_one(page, vma, address, flags); > + ret = try_to_unmap_one(page, vma, address, flags, &dummy); > if (ret != SWAP_AGAIN || !page_mapped(page)) > goto out; > } > @@ -1651,6 +1754,7 @@ out: > * SWAP_AGAIN - we missed a mapping, try again later > * SWAP_FAIL - the page is unswappable > * SWAP_MLOCK - page is mlocked. > + * SWAP_DISCARD - page is volatile. > */ > int try_to_unmap(struct page *page, enum ttu_flags flags) > { > @@ -1665,7 +1769,7 @@ int try_to_unmap(struct page *page, enum ttu_flags flags) > ret = try_to_unmap_anon(page, flags); > else > ret = try_to_unmap_file(page, flags); > - if (ret != SWAP_MLOCK && !page_mapped(page)) > + if (ret != SWAP_MLOCK && !page_mapped(page) && ret != SWAP_DISCARD) > ret = SWAP_SUCCESS; > return ret; > } > @@ -1707,6 +1811,18 @@ void __put_anon_vma(struct anon_vma *anon_vma) > anon_vma_free(anon_vma); > } > > +void volatile_lock(struct vm_area_struct *vma) > +{ > + if (vma->anon_vma) > + anon_vma_lock(vma->anon_vma); > +} > + > +void volatile_unlock(struct vm_area_struct *vma) > +{ > + if (vma->anon_vma) > + anon_vma_unlock(vma->anon_vma); > +} > + > #ifdef CONFIG_MIGRATION > /* > * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file(): > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 99b434b..d5b60d0 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -789,6 +789,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, > */ > if (page_mapped(page) && mapping) { > switch (try_to_unmap(page, TTU_UNMAP)) { > + case SWAP_DISCARD: > + goto discard_page; > case SWAP_FAIL: > goto activate_locked; > case SWAP_AGAIN: > @@ -857,6 +859,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, > } > } > > +discard_page: > /* > * If the page has buffers, try to free the buffer mappings > * associated with this page. If we succeed we try to free > -- > 1.7.9.5 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Thanks, --Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/