Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753254AbZFHRS5 (ORCPT ); Mon, 8 Jun 2009 13:18:57 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751391AbZFHRSu (ORCPT ); Mon, 8 Jun 2009 13:18:50 -0400 Received: from mx2.redhat.com ([66.187.237.31]:39055 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750784AbZFHRSt (ORCPT ); Mon, 8 Jun 2009 13:18:49 -0400 Message-ID: <4A2D47C1.5020302@redhat.com> Date: Mon, 08 Jun 2009 20:17:53 +0300 From: Izik Eidus User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) MIME-Version: 1.0 To: Hugh Dickins CC: aarcange@redhat.com, akpm@linux-foundation.org, nickpiggin@yahoo.com.au, chrisw@redhat.com, riel@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise References: <1242261048-4487-1-git-send-email-ieidus@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10389 Lines: 246 Hugh Dickins wrote: > On Thu, 14 May 2009, Izik Eidus wrote: > > >> This is comment request for ksm api changes. >> The following patchs move the api to use madvise instead of ioctls. >> > > Thanks a lot for doing this. > > I'm afraid more than three weeks have gone past, and the 2.6.31 > merge window is almost upon us, and you haven't even got a comment > out of me: I apologize for that. > > Although my (lack of) response is indistinguishable from a conspiracy > to keep KSM out of the kernel, I beg to assure you that's not the case. > I do want KSM to go in - though I never shared Andrew's optimism that it > is 2.6.31 material: I've too long a list of notes/doubts on the existing > implementation, which I've not had time to expand upon to you; but I > don't think there are any killer issues, we should be able to work > things out as 2.6.31 goes through its -rcs, and aim for 2.6.32. > > But let's get this change of interface sorted out first. > Agree. > I remain convinced that it's right to go the madvise() route, > though I don't necessarily like the details in your patches. > And I've come to the conclusion that the only way I can force > myself to contribute constructively, is to start from these > patches, and shift things around until it's as I think it > should be, then see what you think of the result. > Sound perfect way to go. > I notice that you chose to integrate fully (though not fully enough) > with vmas, adding a VM_MERGEABLE flag. Fine, that's probably going > to be safest in the end, and I'll follow you; but it is further than > I was necessarily asking you to go - it might have been okay to use > the madvise() interface, but just to declare areas of address space > (not necessarily backed by mappings) to ksm.c, as you did via /dev/ksm. > But it's fairly likely that if you had stayed with that, it would have > proved problematic later, so let's go forward with the full integration > with vmas. > > >> Before i will describe the patchs, i want to note that i rewrote this >> patch seires alot of times, all the other methods that i have tried had some >> fandumatel issues with them. >> The current implemantion does have some issues with it, but i belive they are >> all solveable and better than the other ways to do it. >> If you feel you have better way how to do it, please tell me :). >> >> Ok when we changed ksm to use madvise instead of ioctls we wanted to keep >> the following rules: >> >> Not to increase the host memory usage if ksm is not being used (even when it >> is compiled), this mean not to add fields into mm_struct / vm_area_struct... >> >> Not to effect the system performence with notifiers that will have to block >> while ksm code is running under some lock - ksm is helper, it should do it >> work quitely, - this why i dropped patch that i did that add mmu notifiers >> support inside ksm.c and recived notifications from the MM (for example >> when vma is destroyed (invalidate_range...) >> >> Not to change the MM logic. >> >> Trying to touch as less code as we can outisde ksm.c >> > > These are well-intentioned goals, and thank you for making the effort > to follow them; but I'm probably going to depart from them. I'd > rather we put in what's necessary and appropriate, and then cut > that down if necessary. > That the way to go, i just didnt want to scare anyone (it was obiouse to me that it is needed, just wanted you to say it is needed) > >> Taking into account all this rules, the end result that we have came with is: >> mmlist is now not used only by swapoff, but by ksm as well, this mean that >> each time you call to madvise for to set vma as MERGEABLE, madvise will check >> if the mm_struct is inside the mmlist and will insert it in case it isnt. >> It is belived that it is better to hurt little bit the performence of swapoff >> than adding another list into the mm_struct. >> > > That was a perfectly sensible thing for you to do, given your rules > above; but I don't really like the result, and think it'll be clearer > to have your own list. Whether by mm or by vma, I've not yet decided: > by mm won't add enough #idef CONFIG_KSM bloat to worry about; by vma, > we might be able to reuse some prio_tree fields, I've not checked yet. > > >> One issue that should be note is: after mm_struct is going into the mmlist, it >> wont be kicked from it until the procsses is die (even if there are no more >> VM_MERGEABLE vmas), this doesnt mean memory is wasted, but it does mean ksm >> will spend little more time in doing cur = cur->next if(...). >> >> Another issue is: when procsess is die, ksm will have to find (when scanning) >> that its mm_users == 1 and then do mmput(), this mean that there might be dealy >> from the time that someone do kill until the mm is really free - >> i am open for suggestions on how to improve this... >> > > You've resisted putting in the callbacks you need. I think they were > always (i.e. even when using /dev/ksm) necessary, but should become > more obvious now we have this tighter integration with mm's vmas. > > You seem to have no callback in fork: doesn't that mean that KSM > pages get into mms of which mm/ksm.c has no knowledge? What you mean by this?, should the vma flags be copyed into the child and therefore ksm will scan the vma? (only thing i have to check is: maybe the process itself wont go into the mmlist, and therefore ksm wont know about it) > You had > no callback in mremap move: doesn't that mean that KSM pages could > be moved into areas which mm/ksm.c never tracked? Though that's > probably no issue now we move over to vmas: they should now travel > with their VM flag. You have no callback in unmap: doesn't that > mean that KSM never knows when its pages have gone away? > Yes, Adding all this callbacks would make ksm much more happy, Again, i didnt want to scare anyone... > (Closing the /dev/ksm fd used to clean up some of this, in the > end; but the lifetime of the fd can be so different from that of > the mapped area, I've felt very unsafe with that technique - a good > technique when you're trying to sneak in special handling for your > special driver, but not a good technique once you go to mainline.) > > I haven't worked out the full consequences of these lost pages: > perhaps it's no worse than that you could never properly enforce > your ksm_thread_max_kernel_pages quota. > You mean the shared pages outside the stable tree comment? > >> (when someone do echo 0 > /sys/kernel/mm/ksm/run ksm will throw away all the >> memory, so condtion when the memory wont ever be free wont happen) >> >> >> Another important thing is: this is request for comment, i still not sure few >> things that we have made here are totaly safe: >> (the mmlist sync with drain_mmlist, and the handle_vmas() function in madvise, >> the logic inside ksm for searching the next virtual address on the vmas, >> and so on...) >> The main purpuse of this is to ask if the new interface is what you guys >> want..., and if you like the impelmantion desgin. >> > > It's in the right direction. But it would be silly for me to start > criticizing your details now: I need to try doing the same, that will > force me to think deeply enough about it, and I may then be led to > the same decisions as you made. > > >> (I have added option to scan closed support applications as well) >> > > That's a nice detail that I'll find very useful for testing, > but we might want to hold it back longer than the rest. I just get > naturally more cautious when we consider interfaces for doing things > to other processes, and want to spend even longer over it. > > >> Thanks. >> >> Izik Eidus (4): >> madvice: add MADV_SHAREABLE and MADV_UNSHAREABLE calls. >> > > I didn't understand why you went over to VM_MERGEABLE but stuck > with MADV_SHAREABLE: there's a confusing mix of shareables and > mergeables, I'll head for mergeables throughout, though keep to "KSM". > > >> mmlist: share mmlist with ksm. >> ksm: change ksm api to use madvise instead of ioctls. >> ksm: add support for scanning procsses that were not modifided to use >> ksm >> > > While I'm being communicative, let me mention two things, > not related to this RFC patchset, but to what's currently in mmotm. > > I've a bugfix patch to scan_get_next_index(), I'll send that to you > Thanks. > in a few moments. > > And a question on your page_wrprotect() addition to mm/rmap.c: though > it may contain some important checks (I'm thinking of the get_user_pages > protection), isn't it essentially redundant, and should be removed from > the patchset? If we have a private page which is mapped into more than > the one address space by which we arrive at it, then, quite independent > of KSM, it needs to be write-protected already to prevent mods in one > address space leaking into another - doesn't it? So I see no need for > the rmap'ped write-protection there, just make the checks and write > protect the pte you have in ksm.c. Or am I missing something? > Ok, so we have here 2 cases for ksm: 1: When the page is anonymous and is mapped readonly beteween serveal processes: for this you say we shouldnt walk over the rmap and try to writeprotect what is already writeprtected... 2: When the page is anonymous and is mapped write by just one process: for this you say it is better to handle it directly from inside ksm beacuse we already know the virtual address mapping of this page? so about this: you are right about the fact that we might dont have to walk over the rmap of the page for pages with mapcount 1 but isnt it cleaner to deal it inside rmap.c? another thing, get_user_pages() protection is needed even in that case, beacuse get_user_pages_fast is lockless, so odirect can run under our legs after we write protecting the page. anyway, nothing critical, i dont mind to move page_write_protect_one() into ksm.c, i still think get_user_pages protection is needed. Thanks alot for your time. > Hugh > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/