Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760833Ab2KAAPX (ORCPT ); Wed, 31 Oct 2012 20:15:23 -0400 Received: from LGEMRELSE7Q.lge.com ([156.147.1.151]:53077 "EHLO LGEMRELSE7Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754236Ab2KAAPV (ORCPT ); Wed, 31 Oct 2012 20:15:21 -0400 X-AuditID: 9c930197-b7c0bae00000042f-f6-5091bf1682a9 Date: Thu, 1 Nov 2012 09:21:18 +0900 From: Minchan Kim To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, John Stultz , Christoph Lameter , Android Kernel Team , Robert Love , Mel Gorman , Hugh Dickins , Dave Hansen , Rik van Riel , Dave Chinner , Neil Brown , Mike Hommey , Taras Glek , KOSAKI Motohiro , KAMEZAWA Hiroyuki , sanjay@google.com, pjt@google.com, David Rientjes Subject: Re: [RFC v2] Support volatile range for anon vma Message-ID: <20121101002118.GA26256@bbox> References: <1351560594-18366-1-git-send-email-minchan@kernel.org> <20121031143524.0509665d.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121031143524.0509665d.akpm@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6005 Lines: 163 Hi Andrew, On Wed, Oct 31, 2012 at 02:35:24PM -0700, Andrew Morton wrote: > On Tue, 30 Oct 2012 10:29:54 +0900 > Minchan Kim wrote: > > > This patch introudces new madvise behavior MADV_VOLATILE and > > MADV_NOVOLATILE for anonymous pages. It's different with > > John Stultz's version which considers only tmpfs while this patch > > considers only anonymous pages so this cannot cover John's one. > > If below idea is proved as reasonable, I hope we can unify both > > concepts by madvise/fadvise. > > > > Rationale is following as. > > Many allocators call munmap(2) when user call free(3) if ptr is > > in mmaped area. But munmap isn't cheap because it have to clean up > > all pte entries and unlinking a vma so overhead would be increased > > linearly by mmaped area's size. > > Presumably the userspace allocator will internally manage memory in > large chunks, so the munmap() call frequency will be much lower than > the free() call frequency. So the performance gains from this change > might be very small. > > The whole point of the patch is to improve performance, but we have no > evidence that it was successful in doing that! I do think we'll need > good quantitative testing results before proceeding with such a patch, > please. Absolutely. That's why I send it as RFC. In this time, I would like to reach a concensus on that this idea makes sense before further investigating because we have lots of experienced developer pool and one of them might know this is really needed or not. > > Also, it is very desirable that we involve the relevant userspace > (glibc, etc) developers in this. And I understand that the google > tcmalloc project will probably have interest in this - I've cc'ed > various people@google in the hope that they can provide input (please). Thanks! I should have done. Such input is really one I need now. > > Also, it is a userspace API change. Please cc mtk.manpages@gmail.com. This is RFC so we don't have anything fixed until now. I will Cc'ed him after everything I should solve goes out and interface is fixed. > > Also, I assume that you have userspace test code. At some stage, > please consider adding a case to tools/testing/selftests. Such a test > would require to creation of memory pressure, which is rather contrary > to the selftests' current philosopy of being a bunch of short-running > little tests. Perhaps you can come up with something. But I suggest > that such work be done later, once it becomes clearer that this code is > actually headed into the kernel. Yes. > > > Allocator should call madvise(MADV_NOVOLATILE) before reusing for > > allocating that area to user. Otherwise, accessing of volatile range > > will meet SIGBUS error. > > Well, why? It would be easy enough for the fault handler to give > userspace a new, zeroed page at that address. Absolutely. It would be convenient but as a matter of fact, I am considering to unify John Stultz's fallocate volatile range which consider only tmpfs pages so madvise/fadvise might be better candidate as API. In tmpfs case, John implemented it as returning zero page when someone access volatile region like you mentioned but in this kernel summit, Hugh pointed out and wanted to return SIGBUS and I think it makes debug better. Another option is we can put a flag in API which indicates that VM will return zero page or SIGBUS when user access volatile range so user can do what they want. > > Or we could simply leave the old page in place at that address. If the > page gets touched, we clear MADV_NOVOLATILE on its VMA and give the > page (or all the not-yet-reclaimed pages) back to userspace at their > old addresses. > > Various options suggest themselves here. You've chosen one of them but > I would like to see a pretty exhaustive description of the reasoning > behind that decision. Will do. > > Also, I wonder about the interaction with other vma manipulation > operations. For example, can a VMA get split when in the MADV_VOLATILE > state? If so, what happens? Both VMAs would be volatile although one of either has never reclaimed pages. I understand it's not an optimal but I expect user will not do such operations(ex, mprotect, mremap) frequently on volatile vma. If they do, maybe I need to come up with something but It wouldn't be easy. > > Also, I see no reason why the code shouldn't work OK with nonlinear VMAs, > but I bet this wasn't tested ;) Yes. I didn't consider that yet. AFAIK, nonlinear vma is related to file-mapped vma while this patch consider only anon vma which is good for first step. > > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma, > > if (error) > > goto out; > > break; > > + case MADV_VOLATILE: > > + if (vma->vm_flags & VM_LOCKED) { > > + error = -EINVAL; > > + goto out; > > + } > > + new_flags |= VM_VOLATILE; > > + vma->purged = false; > > + break; > > + case MADV_NOVOLATILE: > > + if (!(vma->vm_flags & VM_VOLATILE)) { > > + error = -EINVAL; > > + goto out; > > I wonder if this really should return an error. Other madvise() > options don't do this, and running MADV_NOVOLATILE against a > not-volatile area seems pretty benign and has clearly defined before- > and after- states. Will fix. Thanks for the review, Andrew. > > > + } > > + > > + new_flags &= ~VM_VOLATILE; > > + break; > > } > > > > if (new_flags == vma->vm_flags) { > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/