Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756533AbZFHW6S (ORCPT ); Mon, 8 Jun 2009 18:58:18 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755490AbZFHW6K (ORCPT ); Mon, 8 Jun 2009 18:58:10 -0400 Received: from mx2.redhat.com ([66.187.237.31]:57569 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753696AbZFHW6J (ORCPT ); Mon, 8 Jun 2009 18:58:09 -0400 Date: Tue, 9 Jun 2009 00:57:57 +0200 From: Andrea Arcangeli To: Hugh Dickins Cc: Izik Eidus , akpm@linux-foundation.org, nickpiggin@yahoo.com.au, chrisw@redhat.com, riel@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 0/4] RFC - ksm api change into madvise Message-ID: <20090608225756.GB8642@random.random> References: <1242261048-4487-1-git-send-email-ieidus@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5938 Lines: 107 Hi Hugh! On Mon, Jun 08, 2009 at 05:18:57PM +0100, Hugh Dickins wrote: > I notice that you chose to integrate fully (though not fully enough) > with vmas, adding a VM_MERGEABLE flag. Fine, that's probably going > to be safest in the end, and I'll follow you; but it is further than After we decided to switch to madvise, I suggested to take the approach in the RFC in this thread so with a rmap_item list ordered by address that is scanned nested in the core-vma loop that searches for vma with VM_MERGEABLE set, so that madvise has only to set the VM_MERGEABLE bit, and it still avoids to alter the VM at all. Replacing the ksm slots (kind of out of sync ksm vmas) with the nested loop, allows to provide madvise semantics while still being out of sync with rmap_item and tree_item. > You've resisted putting in the callbacks you need. I think they were > always (i.e. even when using /dev/ksm) necessary, but should become > more obvious now we have this tighter integration with mm's vmas. There is no need of littering the VM with new callbacks, KSM only has to register in the mmu notifier to teardown rmap_items (and tree_item when the last rmap_item of the tree_item list is gone) synchronously when the mappings are invalidated during munmap/mremap or swapping (anon pages can already be swapped while it's tracked by the unstable tree, and later we hope stable tree ksm pages can be swapped too, and we get rid of those from the rbtree as we walk it with is_present_pte and gup). After that we can also have tree_item store the kernel-vaddr (or struct page *) to avoid calling gup() during the walk of the rbtree. However such a change involves huge complexities and little gain, and the part of getting rid of orphaned rmap_item/tree_item synchronously doesn't provide practical advantages. It takes a single pass to get rid of all orphaned rmap_items/tree_items, and shutting down ksm gets rid of everything releasing any memory allocated by KSM (or at least it should, especially if we don't allow it to be a module anymore because it might not be worth it, the =N option is still useful for the embedded that are sure they won't get great benefit from KSM in their apps). Izik has been brave enough to try having rmap_item/tree_item in sync with mmu notifier already, but one of the complexities were in the locking, we can't schedule anywhere in the mmu notifier methods (this could be relaxed but it makes no sense to depend on this just to get rid of orphaned rmap_item/tree_item synchronously). Other complexities were in the fact replace_page when it does set_pte_at_notify will recurse in KSM itself if KSM registers in MMU notifier. Admittedly I didn't spend much time thinking about how everything is supposed to work with mmu notifer and in-sync rmap_item/tree_item with tree_item pointing to a physical page address without any struct page pinning whatsoever (the very cool thing that is all about mmu notifier, no pinning and full swapping) yet, but because I doubt it will provide any significant advantage and I think being simpler has some value too at least until there are more users of the feature. Hence the current implementation still has an out of sync scan and garbage collect rmap_item/tree_item lazily. This is not KVM that without mmu notifier can't swap at all. We can do all our work here in KSM out of sync by just garbage collecting the orphaned entries out of sync, KSM works with virtual addresses it only does temporary gup pins so it's not forced to use mmu notifier to give good behavior. So my suggestion is to go with out of sync, and then once this is all solid and finished, we can try to use mmu notifier to make it in sync and benchmark to see if it is worth it. With the primary benefit of being able to remove gup from the tree lookup (collecting orphaned rmap_item/tree_item a bit sooner doesn't matter so much IMHO, at least for KVM). > And a question on your page_wrprotect() addition to mm/rmap.c: though > it may contain some important checks (I'm thinking of the get_user_pages > protection), isn't it essentially redundant, and should be removed from > the patchset? If we have a private page which is mapped into more than > the one address space by which we arrive at it, then, quite independent > of KSM, it needs to be write-protected already to prevent mods in one > address space leaking into another - doesn't it? So I see no need for > the rmap'ped write-protection there, just make the checks and write > protect the pte you have in ksm.c. Or am I missing something? Makes sense, basically if pte is already wrprotected we've to do nothing, and if it's writable we only need to care about the current pte because mapcount is guaranteed 1. That probably can provide a minor optimization to the ksm performance. OTOH if we'll ever decide to merge more than anon pages this won't hold. Let's say anon pages are by orders of magnitude simpler to merge than anything else because they don't belong to all other data structures like pagecache, I guess we'll never merge pagecache given the significant additional complexities involved, so I'm not against keeping replace_page local and without walking the anon_vma list at all. So let us know what you think about the rmap_item/tree_item out of sync, or in sync with mmu notifier. As said Izik already did a preliminary patch with mmu notifier registration. I doubt we want to invest in that direction unless there's 100% agreement that it is definitely the way to go, and the expectation that it will make a substantial difference to the KSM users. Minor optimizations that increase complexity a lot, can be left for later. Thanks for looking into KSM! Andrea -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/