Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755447AbaGHAgV (ORCPT ); Mon, 7 Jul 2014 20:36:21 -0400 Received: from lgeamrelo04.lge.com ([156.147.1.127]:55724 "EHLO lgeamrelo04.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753965AbaGHAgR (ORCPT ); Mon, 7 Jul 2014 20:36:17 -0400 X-Original-SENDERIP: 10.177.220.169 X-Original-MAILFROM: minchan@kernel.org Date: Tue, 8 Jul 2014 09:36:17 +0900 From: Minchan Kim To: "Kirill A. Shutemov" Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Kerrisk , Linux API , Hugh Dickins , Johannes Weiner , Rik van Riel , KOSAKI Motohiro , Mel Gorman , Jason Evans , Zhang Yanfei Subject: Re: [PATCH v10 1/7] mm: support madvise(MADV_FREE) Message-ID: <20140708003617.GB6076@bbox> References: <1404694438-10272-1-git-send-email-minchan@kernel.org> <1404694438-10272-2-git-send-email-minchan@kernel.org> <20140707104137.GA23150@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20140707104137.GA23150@node.dhcp.inet.fi> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jul 07, 2014 at 01:41:37PM +0300, Kirill A. Shutemov wrote: > On Mon, Jul 07, 2014 at 09:53:52AM +0900, Minchan Kim wrote: > > Linux doesn't have an ability to free pages lazy while other OS > > already have been supported that named by madvise(MADV_FREE). > > > > The gain is clear that kernel can discard freed pages rather than > > swapping out or OOM if memory pressure happens. > > > > Without memory pressure, freed pages would be reused by userspace > > without another additional overhead(ex, page fault + allocation > > + zeroing). > > > > How to work is following as. > > > > When madvise syscall is called, VM clears dirty bit of ptes of > > the range. If memory pressure happens, VM checks dirty bit of > > page table and if it found still "clean", it means it's a > > "lazyfree pages" so VM could discard the page instead of swapping out. > > Once there was store operation for the page before VM peek a page > > to reclaim, dirty bit is set so VM can swap out the page instead of > > discarding. > > > > Firstly, heavy users would be general allocators(ex, jemalloc, > > tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already > > have supported the feature for other OS(ex, FreeBSD) > > > > barrios@blaptop:~/benchmark/ebizzy$ lscpu > > Architecture: x86_64 > > CPU op-mode(s): 32-bit, 64-bit > > Byte Order: Little Endian > > CPU(s): 4 > > On-line CPU(s) list: 0-3 > > Thread(s) per core: 2 > > Core(s) per socket: 2 > > Socket(s): 1 > > NUMA node(s): 1 > > Vendor ID: GenuineIntel > > CPU family: 6 > > Model: 42 > > Stepping: 7 > > CPU MHz: 2801.000 > > BogoMIPS: 5581.64 > > Virtualization: VT-x > > L1d cache: 32K > > L1i cache: 32K > > L2 cache: 256K > > L3 cache: 4096K > > NUMA node0 CPU(s): 0-3 > > > > ebizzy benchmark(./ebizzy -S 10 -n 512) > > > > vanilla-jemalloc MADV_free-jemalloc > > > > 1 thread > > records: 10 records: 10 > > avg: 7682.10 avg: 15306.10 > > std: 62.35(0.81%) std: 347.99(2.27%) > > max: 7770.00 max: 15622.00 > > min: 7598.00 min: 14772.00 > > > > 2 thread > > records: 10 records: 10 > > avg: 12747.50 avg: 24171.00 > > std: 792.06(6.21%) std: 895.18(3.70%) > > max: 13337.00 max: 26023.00 > > min: 10535.00 min: 23152.00 > > > > 4 thread > > records: 10 records: 10 > > avg: 16474.60 avg: 33717.90 > > std: 1496.45(9.08%) std: 2008.97(5.96%) > > max: 17877.00 max: 35958.00 > > min: 12224.00 min: 29565.00 > > > > 8 thread > > records: 10 records: 10 > > avg: 16778.50 avg: 33308.10 > > std: 825.53(4.92%) std: 1668.30(5.01%) > > max: 17543.00 max: 36010.00 > > min: 14576.00 min: 29577.00 > > > > 16 thread > > records: 10 records: 10 > > avg: 20614.40 avg: 35516.30 > > std: 602.95(2.92%) std: 1283.65(3.61%) > > max: 21753.00 max: 37178.00 > > min: 19605.00 min: 33217.00 > > > > 32 thread > > records: 10 records: 10 > > avg: 22771.70 avg: 36018.50 > > std: 598.94(2.63%) std: 1046.76(2.91%) > > max: 24035.00 max: 37266.00 > > min: 22108.00 min: 34149.00 > > > > In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED. > > > > Cc: Michael Kerrisk > > Cc: Linux API > > Cc: Hugh Dickins > > Cc: Johannes Weiner > > Cc: KOSAKI Motohiro > > Cc: Mel Gorman > > Cc: Jason Evans > > Cc: Zhang Yanfei > > Acked-by: Rik van Riel > > Signed-off-by: Minchan Kim > > --- > > ... > > > +static void madvise_free_page_range(struct mmu_gather *tlb, > > + struct vm_area_struct *vma, > > + unsigned long addr, unsigned long end) > > +{ > > + pgd_t *pgd; > > + unsigned long next; > > + > > + BUG_ON(addr >= end); > > + tlb_start_vma(tlb, vma); > > + pgd = pgd_offset(vma->vm_mm, addr); > > + do { > > + next = pgd_addr_end(addr, end); > > + if (pgd_none_or_clear_bad(pgd)) > > + continue; > > + next = madvise_free_pud_range(tlb, vma, pgd, addr, next); > > + } while (pgd++, addr = next, addr != end); > > + tlb_end_vma(tlb, vma); > > Any particular reason why pagewalker can't be used here? Nothing special. I just copied from MADV_DONTNEED. Will try it. > > > +} > > ... > > > @@ -381,6 +547,13 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, > > return madvise_remove(vma, prev, start, end); > > case MADV_WILLNEED: > > return madvise_willneed(vma, prev, start, end); > > + case MADV_FREE: > > + /* > > + * XXX: In this implementation, MADV_FREE works like > > + * MADV_DONTNEED on swapless system or full swap. > > + */ > > + if (get_nr_swap_pages() > 0) > > + return madvise_free(vma, prev, start, end); > > Looks racy wrt to full swap. What will happen if we will do madvise_free() > on full swap? Now, we don't age anonymous LRU list if swap is full so that VM would lose the chance to discard freed page via shrink_page_list in this implementation if that race hppanes. But it would be not severe because MADV_FREE semantic doesn't say VM must discard them but it is just hint from userside that specified range is no longer important so that VM can gave a freedom to free and I think it's not a common case. In addition, I have a plan to support MADV_FREE on swapless system, too. > > > case MADV_DONTNEED: > > return madvise_dontneed(vma, prev, start, end); > > default: > > ... > > > @@ -1204,6 +1223,16 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > } > > dec_mm_counter(mm, MM_ANONPAGES); > > inc_mm_counter(mm, MM_SWAPENTS); > > + } else if (flags & TTU_UNMAP) { > > + if (dirty || PageDirty(page)) { > > + set_pte_at(mm, address, pte, pteval); > > + ret = SWAP_FAIL; > > + goto out_unmap; > > I don't get this part. > Looks like it will fail to unmap the page if it's dirty and not backed by > swapcache. Current code doesn't have such limitation. > Do we really need this? Good point. Code is rather ugly even, it has side-effect with hwpoisend page unmapping. How about this? I didn't test it but if there is no objection, I will go this with stress testing. --- include/linux/rmap.h | 1 + mm/rmap.c | 22 ++++++++++++---------- mm/vmscan.c | 5 +++-- 3 files changed, 16 insertions(+), 12 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index dea05914f167..0ba377b97a38 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -75,6 +75,7 @@ enum ttu_flags { TTU_UNMAP = 1, /* unmap mode */ TTU_MIGRATION = 2, /* migration mode */ TTU_MUNLOCK = 4, /* munlock mode */ + TTU_FREE = 8, /* free mode */ TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ diff --git a/mm/rmap.c b/mm/rmap.c index 3c415eb8b6f0..010d51ea26c4 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1209,6 +1209,18 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, swp_entry_t entry = { .val = page_private(page) }; pte_t swp_pte; + if (flags & TTU_FREE) { + if (dirty || PageDirty(page)) { + set_pte_at(mm, address, pte, pteval); + ret = SWAP_FAIL; + goto out_unmap; + } else { + /* It's a freeable page by MADV_FREE */ + dec_mm_counter(mm, MM_ANONPAGES); + goto discard; + } + } + if (PageSwapCache(page)) { /* * Store the swap location in the pte. @@ -1227,16 +1239,6 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } dec_mm_counter(mm, MM_ANONPAGES); inc_mm_counter(mm, MM_SWAPENTS); - } else if (flags & TTU_UNMAP) { - if (dirty || PageDirty(page)) { - set_pte_at(mm, address, pte, pteval); - ret = SWAP_FAIL; - goto out_unmap; - } else { - /* It's a freeable page by madvise_free */ - dec_mm_counter(mm, MM_ANONPAGES); - goto discard; - } } else if (IS_ENABLED(CONFIG_MIGRATION)) { /* * Store the pfn of the page in a special migration diff --git a/mm/vmscan.c b/mm/vmscan.c index 4e15babf4414..a7dbce703208 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1549,8 +1549,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, if (nr_taken == 0) return 0; - nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP, - &nr_dirty, &nr_unqueued_dirty, &nr_congested, + nr_reclaimed = shrink_page_list(&page_list, zone, sc, + TTU_UNMAP|TTU_FREE, &nr_dirty, + &nr_unqueued_dirty, &nr_congested, &nr_writeback, &nr_immediate, false); -- 2.0.0 > > > + } else { > > + /* It's a freeable page by madvise_free */ > > + dec_mm_counter(mm, MM_ANONPAGES); > > + goto discard; > > + } > > } else if (IS_ENABLED(CONFIG_MIGRATION)) { > > /* > > * Store the pfn of the page in a special migration > > -- > Kirill A. Shutemov > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/