Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753294AbbGVWg7 (ORCPT ); Wed, 22 Jul 2015 18:36:59 -0400 Received: from mail-pa0-f46.google.com ([209.85.220.46]:34022 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753190AbbGVWgz (ORCPT ); Wed, 22 Jul 2015 18:36:55 -0400 Date: Wed, 22 Jul 2015 15:36:52 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Vlastimil Babka cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Hugh Dickins , Andrea Arcangeli , "Kirill A. Shutemov" , Rik van Riel , Mel Gorman , Joonsoo Kim Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd In-Reply-To: <55AFB569.90702@suse.cz> Message-ID: References: <1435826795-13777-1-git-send-email-vbabka@suse.cz> <1435826795-13777-2-git-send-email-vbabka@suse.cz> <55AE0AFE.8070200@suse.cz> <55AFB569.90702@suse.cz> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5381 Lines: 102 On Wed, 22 Jul 2015, Vlastimil Babka wrote: > > I don't think the criteria for waking up khugepaged should become any more > > complex beyond its current state, which is impacted by two different > > tunables, and whether it actually has memory to scan. During this > > additional wakeup, you'd also need to pass kcompactd's node and only do > > local khugepaged scanning since there's no guarantee khugepaged can > > allocate on all nodes when one kcompactd defragments memory. > > Keeping track of the nodes where hugepage allocations are expected to succeed > is already done in this series. "local khugepaged scanning" is unfortunately > not possible in general, since the node that will be used for a given pmd is > not known until half of pte's (or more) are scanned. > When a khugepaged allocation fails for a node, it could easily kick off background compaction on that node and revisit the range later, very similar to how we can kick off background compaction in the page allocator when async or sync_light compaction fails. The distinction I'm trying to draw is between "periodic" and "background" compaction. I think there're usecases for both and we shouldn't be limiting ourselves to one or the other. Periodic compaction would wakeup at a user-defined period and fully compact memory over all nodes, round-robin at each wakeup. This keeps fragmentation low so that (ideally) background compaction or direct compaction wouldn't be needed. Background compaction would be triggered from the page allocator when async or sync_light compaction fails, regardless of whether this was from khugepaged, page fault context, or any other high-order allocation. This is an interesting discussion because I can think of lots of ways to be smart about it, but I haven't tried to implement it yet: heuristics that do ratelimiting, preemptive compaction based on fragmentation stats, etc. My rfc implements periodic compaction in khugepaged simply because we find very large thp_fault_fallback numbers and these faults tend to come in bunches so that background compaction wouldn't really help the situation itself: it's simply not fast enough and we give up compaction at fault way too early for it have a chance of being successful. I have a hard time finding other examples of that outside thp, especially at such large orders. The number one culprit that I can think of would be slub and I haven't seen any complaints about high order_fallback stats. The additional benefit of doing the periodic compaction in khugepaged is that we can do it before scanning, where alloc_sleep_millisecs is so high that kicking off background compaction on allocation failure wouldn't help. Then, storing the nodes where khugepaged allocation has failed isn't needed: the allocation itself would trigger background compaction. > > If it's unmovable fragmentation, then any periodic synchronous memory > > compaction isn't going to help either. > > It can help if it moves away movable pages out of unmovable pageblocks, so the > following unmovable allocations can be served from those pageblocks and not > fallback to pollute another movable pageblock. Even better if this is done > (kcompactd woken up) in response to such fallback, where unmovable page falls > to a partially filled movable pageblock. Stuffing also this into khugepaged > would be really a stretch. Joonsoo proposed another daemon for that in > https://lkml.org/lkml/2015/4/27/94 but extending kcompactd would be a very > natural way for this. > Sure, this is an example of why background compaction would be helpful and triggered by the page allocator when async or migrate_sync allocation fails. > > Hmm, I don't think we have to select one to the excusion of the other. I > > don't think that because khugepaged may do periodic synchronous memory > > compaction (to eventually remove direct compaction entirely from the page > > fault path, since we have checks in the page allocator that specifically > > do that) > > That would be nice for the THP page faults, yes. Or maybe just change the > default for thp "defrag" tunable to "madvise". > Right, however I'm afraid that what we have done to compaction in the fault path for MIGRATE_ASYNC has been implicitly change that default in the code :) I have examples where async compaction in the fault path scans three pageblocks and gives up because of the abort heuristics, that's not suggesting that we'll be very successful. The hope is that we can change the default to "madvise" due to periodic and background compaction and then make the "always" case do some actual defrag :) > I think pushing compaction in a workqueue would meet a bigger resistance than > new kthreads. It could be too heavyweight for this mechanism and what if > there's suddenly lots of allocations in parallel failing and scheduling the > work items? So if we do it elsewhere, I think it's best as kcompactd kthreads > and then why would we do it also in khugepaged? > We'd need the aforementioned ratelimiting to ensure that background compaction is handled appropriately, absolutely. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/