Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031352AbbKEBhj (ORCPT ); Wed, 4 Nov 2015 20:37:39 -0500 Received: from LGEAMRELO11.lge.com ([156.147.23.51]:60922 "EHLO lgeamrelo11.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932340AbbKEBhh (ORCPT ); Wed, 4 Nov 2015 20:37:37 -0500 X-Original-SENDERIP: 156.147.1.125 X-Original-MAILFROM: minchan@kernel.org X-Original-SENDERIP: 165.244.98.204 X-Original-MAILFROM: minchan@kernel.org X-Original-SENDERIP: 10.177.223.161 X-Original-MAILFROM: minchan@kernel.org Date: Thu, 5 Nov 2015 10:37:41 +0900 From: Minchan Kim To: Shaohua Li CC: Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Kerrisk , linux-api@vger.kernel.org, Hugh Dickins , Johannes Weiner , Rik van Riel , Mel Gorman , KOSAKI Motohiro , Jason Evans , Daniel Micay , "Kirill A. Shutemov" , Michal Hocko , yalin.wang2010@gmail.com, bmaurer@fb.com, John Stultz Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE) Message-ID: <20151105013741.GI7357@bbox> References: <1446600367-7976-1-git-send-email-minchan@kernel.org> <1446600367-7976-2-git-send-email-minchan@kernel.org> <20151104200006.GA46783@kernel.org> <20151105013350.GH7357@bbox> MIME-Version: 1.0 In-Reply-To: <20151105013350.GH7357@bbox> User-Agent: Mutt/1.5.21 (2010-09-15) X-MIMETrack: Itemize by SMTP Server on LGEKRMHUB04/LGE/LG Group(Release 8.5.3FP3HF583 | August 9, 2013) at 2015/11/05 10:37:34, Serialize by Router on LGEKRMHUB04/LGE/LG Group(Release 8.5.3FP3HF583 | August 9, 2013) at 2015/11/05 10:37:34, Serialize complete at 2015/11/05 10:37:34 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8325 Lines: 178 On Thu, Nov 05, 2015 at 10:33:50AM +0900, Minchan Kim wrote: > On Wed, Nov 04, 2015 at 12:00:06PM -0800, Shaohua Li wrote: > > On Wed, Nov 04, 2015 at 10:25:55AM +0900, Minchan Kim wrote: > > > Linux doesn't have an ability to free pages lazy while other OS already > > > have been supported that named by madvise(MADV_FREE). > > > > > > The gain is clear that kernel can discard freed pages rather than swapping > > > out or OOM if memory pressure happens. > > > > > > Without memory pressure, freed pages would be reused by userspace without > > > another additional overhead(ex, page fault + allocation + zeroing). > > > > > > Jason Evans said: > > > > > > : Facebook has been using MAP_UNINITIALIZED > > > : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for > > > : several years, but there are operational costs to maintaining this > > > : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it > > > : in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it > > > : increased throughput for much of our workload by ~5%, and although the > > > : benefit has decreased using newer hardware and kernels, there is still > > > : enough benefit that we cannot reasonably retire it without a replacement. > > > : > > > : Aside from Facebook operations, there are numerous broadly used > > > : applications that would benefit from MADV_FREE. The ones that immediately > > > : come to mind are redis, varnish, and MariaDB. I don't have much insight > > > : into Android internals and development process, but I would hope to see > > > : MADV_FREE support eventually end up there as well to benefit applications > > > : linked with the integrated jemalloc. > > > : > > > : jemalloc will use MADV_FREE once it becomes available in the Linux kernel. > > > : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's > > > : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux > > > : (and AIX, but I'm not sure it even compiles on AIX). The lack of > > > : MADV_FREE on Linux forced me down a long series of increasingly > > > : sophisticated heuristics for madvise() volume reduction, and even so this > > > : remains a common performance issue for people using jemalloc on Linux. > > > : Please integrate MADV_FREE; many people will benefit substantially. > > > > > > How it works: > > > > > > When madvise syscall is called, VM clears dirty bit of ptes of the range. > > > If memory pressure happens, VM checks dirty bit of page table and if it > > > found still "clean", it means it's a "lazyfree pages" so VM could discard > > > the page instead of swapping out. Once there was store operation for the > > > page before VM peek a page to reclaim, dirty bit is set so VM can swap out > > > the page instead of discarding. > > > > > > Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc > > > and hope glibc supports it) and jemalloc/tcmalloc already have supported > > > the feature for other OS(ex, FreeBSD) > > > > > > barrios@blaptop:~/benchmark/ebizzy$ lscpu > > > Architecture: x86_64 > > > CPU op-mode(s): 32-bit, 64-bit > > > Byte Order: Little Endian > > > CPU(s): 12 > > > On-line CPU(s) list: 0-11 > > > Thread(s) per core: 1 > > > Core(s) per socket: 1 > > > Socket(s): 12 > > > NUMA node(s): 1 > > > Vendor ID: GenuineIntel > > > CPU family: 6 > > > Model: 2 > > > Stepping: 3 > > > CPU MHz: 3200.185 > > > BogoMIPS: 6400.53 > > > Virtualization: VT-x > > > Hypervisor vendor: KVM > > > Virtualization type: full > > > L1d cache: 32K > > > L1i cache: 32K > > > L2 cache: 4096K > > > NUMA node0 CPU(s): 0-11 > > > ebizzy benchmark(./ebizzy -S 10 -n 512) > > > > > > Higher avg is better. > > > > > > vanilla-jemalloc MADV_free-jemalloc > > > > > > 1 thread > > > records: 10 records: 10 > > > avg: 2961.90 avg: 12069.70 > > > std: 71.96(2.43%) std: 186.68(1.55%) > > > max: 3070.00 max: 12385.00 > > > min: 2796.00 min: 11746.00 > > > > > > 2 thread > > > records: 10 records: 10 > > > avg: 5020.00 avg: 17827.00 > > > std: 264.87(5.28%) std: 358.52(2.01%) > > > max: 5244.00 max: 18760.00 > > > min: 4251.00 min: 17382.00 > > > > > > 4 thread > > > records: 10 records: 10 > > > avg: 8988.80 avg: 27930.80 > > > std: 1175.33(13.08%) std: 3317.33(11.88%) > > > max: 9508.00 max: 30879.00 > > > min: 5477.00 min: 21024.00 > > > > > > 8 thread > > > records: 10 records: 10 > > > avg: 13036.50 avg: 33739.40 > > > std: 170.67(1.31%) std: 5146.22(15.25%) > > > max: 13371.00 max: 40572.00 > > > min: 12785.00 min: 24088.00 > > > > > > 16 thread > > > records: 10 records: 10 > > > avg: 11092.40 avg: 31424.20 > > > std: 710.60(6.41%) std: 3763.89(11.98%) > > > max: 12446.00 max: 36635.00 > > > min: 9949.00 min: 25669.00 > > > > > > 32 thread > > > records: 10 records: 10 > > > avg: 11067.00 avg: 34495.80 > > > std: 971.06(8.77%) std: 2721.36(7.89%) > > > max: 12010.00 max: 38598.00 > > > min: 9002.00 min: 30636.00 > > > > > > In summary, MADV_FREE is about much faster than MADV_DONTNEED. > > > > The MADV_FREE is discussed for a while, it probably is too late to propose > > something new, but we had the new idea (from Ben Maurer, CCed) recently and > > think it's better. Our target is still jemalloc. > > > > Compared to MADV_DONTNEED, MADV_FREE's lazy memory free is a huge win to reduce > > page fault. But there is one issue remaining, the TLB flush. Both MADV_DONTNEED > > and MADV_FREE do TLB flush. TLB flush overhead is quite big in contemporary > > multi-thread applications. In our production workload, we observed 80% CPU > > spending on TLB flush triggered by jemalloc madvise(MADV_DONTNEED) sometimes. > > We haven't tested MADV_FREE yet, but the result should be similar. It's hard to > > avoid the TLB flush issue with MADV_FREE, because it helps avoid data > > corruption. > > > > The new proposal tries to fix the TLB issue. We introduce two madvise verbs: > > > > MARK_FREE. Userspace notifies kernel the memory range can be discarded. Kernel > > just records the range in current stage. Should memory pressure happen, page > > reclaim can free the memory directly regardless the pte state. > > > > MARK_NOFREE. Userspace notifies kernel the memory range will be reused soon. > > Kernel deletes the record and prevents page reclaim discards the memory. If the > > memory isn't reclaimed, userspace will access the old memory, otherwise do > > normal page fault handling. > > > > The point is to let userspace notify kernel if memory can be discarded, instead > > of depending on pte dirty bit used by MADV_FREE. With these, no TLB flush is > > required till page reclaim actually frees the memory (page reclaim need do the > > TLB flush for MADV_FREE too). It still preserves the lazy memory free merit of > > MADV_FREE. > > > > Compared to MADV_FREE, reusing memory with the new proposal isn't transparent, > > eg must call MARK_NOFREE. But it's easy to utilize the new API in jemalloc. > > > > We don't have code to backup this yet, sorry. We'd like to discuss it if it > > makes sense. > > It's really what volatile range did. > John Stultz and me tried it for a *long* time but it had lots of troubles. > It's really hard to write it down in my time due to really long history > and even I forgot lots of detail(ie, dead brain). > Please search volatile ranges in google. > Finally, people in LSF/MM suggested MADV_FREE to help anonymous page side > rather than stucking hich prevent useful feature. :( I should have Cced John Stutlz. He would have good memory than me so he would help but I'm not sure he has a interest on volatile ranges, still. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/