Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753637AbbBYVYc (ORCPT ); Wed, 25 Feb 2015 16:24:32 -0500 Received: from mail-ig0-f169.google.com ([209.85.213.169]:42865 "EHLO mail-ig0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752748AbbBYVYb (ORCPT ); Wed, 25 Feb 2015 16:24:31 -0500 Date: Wed, 25 Feb 2015 13:24:28 -0800 (PST) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Vlastimil Babka cc: Andrew Morton , Greg Thelen , "Aneesh Kumar K.V" , Linus Torvalds , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch v2 for-4.0] mm, thp: really limit transparent hugepage allocation to local node In-Reply-To: <54EDA96C.4000609@suse.cz> Message-ID: References: <54EDA96C.4000609@suse.cz> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3515 Lines: 80 On Wed, 25 Feb 2015, Vlastimil Babka wrote: > > Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local > > node") restructured alloc_hugepage_vma() with the intent of only > > allocating transparent hugepages locally when there was not an effective > > interleave mempolicy. > > > > alloc_pages_exact_node() does not limit the allocation to the single > > node, however, but rather prefers it. This is because __GFP_THISNODE is > > not set which would cause the node-local nodemask to be passed. Without > > it, only a nodemask that prefers the local node is passed. > > Oops, good catch. > But I believe we have the same problem with khugepaged_alloc_page(), rendering > the recent node determination and zone_reclaim strictness patches partially > useless. > Indeed. > Then I start to wonder about other alloc_pages_exact_node() users. Some do > pass __GFP_THISNODE, others not - are they also mistaken? I guess the function > is a misnomer - when I see "exact_node", I expect the __GFP_THISNODE behavior. > I looked through these yesterday as well and could only find the do_migrate_pages() case for page migration where __GFP_THISNODE was missing. I proposed that separately as http://marc.info/?l=linux-mm&m=142481989722497 -- I couldn't find any other users that looked wrong. > I think to avoid such hidden catches, we should create > alloc_pages_preferred_node() variant, change the exact_node() variant to pass > __GFP_THISNODE, and audit and adjust all callers accordingly. > Sounds like that should be done as part of a cleanup after the 4.0 issues are addressed. alloc_pages_exact_node() does seem to suggest that we want exactly that node, implying __GFP_THISNODE behavior already, so it would be good to avoid having this come up again in the future. > Also, you pass __GFP_NOWARN but that should be covered by GFP_TRANSHUGE > already. Of course, nothing guarantees that hugepage == true implies that gfp > == GFP_TRANSHUGE... but current in-tree callers conform to that. > Ah, good point, and it includes __GFP_NORETRY as well which means that this patch is busted. It won't try compaction or direct reclaim in the page allocator slowpath because of this: /* * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and * __GFP_NOWARN set) should not cause reclaim since the subsystem * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim * using a larger set of nodes after it has established that the * allowed per node queues are empty and that nodes are * over allocated. */ if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) goto nopage; Hmm. It would be disappointing to have to pass the nodemask of the exact node that we want to allocate from into the page allocator to avoid using __GFP_THISNODE. There's a sneaky way around it by just removing __GFP_NORETRY from GFP_TRANSHUGE so the condition above fails and since the page allocator won't retry for such a high-order allocation, but that probably just papers over this stuff too much already. I think what we want to do is cause the slab allocators to not use __GFP_WAIT if they want to avoid reclaim. This is probably going to be a much more invasive patch than originally thought. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/