Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp7944975imu; Mon, 3 Dec 2018 23:38:15 -0800 (PST) X-Google-Smtp-Source: AFSGD/XZS/WBrAknl8snVwGCa0gKLBUtSA05r5e/b7mnP8DVnpbzh+gXKd0nx1c5VLcdIUsEqB0D X-Received: by 2002:a62:1d4c:: with SMTP id d73mr19238184pfd.90.1543909095545; Mon, 03 Dec 2018 23:38:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1543909095; cv=none; d=google.com; s=arc-20160816; b=xNKYM5g355O1t91s/oW8nDmaha6I28f5BhqVw7ejVu/2jflfgtMIVLvSWzT8K2uT43 wwcThX9Jugtvae1ANqSmunp6mknhntoigltVj8pUZWhrVmFDeAcHEdZVSc1cBFCH8wxX HQfgtZjwrhw3xrRKSeKwD1arNbBHz57AqKhQPFW0dNEPjhHy9E69maea9ypODf4La8jn NcIYZilXUu08x8TWIBQd6JQBa1sdvW7CvedWKqXPkUkQ6icN1uSF4yBhrKJFDpLn/6iT PXfuSdWqbTeG4I3vqN+sBs/5s6RdM4jK9WPUlWd57AFOXLck2niphh6ccpUixgtBiymn FaDg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=Ien5ccvH0P+8P7+Cy5ku6TBB6znl5uvJibPkC3KYz9I=; b=eZK5QbmbmhJryfJNIZuUk3Sgug8eYkCIVXhls4Jjw+sIZXK3pqOIS21OpTAK4nvhi0 MD/j0mcNWcUUfS4+XbYBuqJGfaRjjOiG3Z86lDvMVR3T9SqK/0iHWii3/v5ZJn73Sv5F wwT13HHNyQ9kRyp6cdZh75/xzN8M9DFmTj59ofkjvuSrsIUCeJ58RgYm7qrD8C7vBZnN rYwmcCT754z61qZx44knRQy0ugn0hY8jcqrXaPX7UEzfy1yu/TlYvmVCUS/DWBe66Lkf EhnfHH2egsbKgLlSHkUKXnOAHa9L4Scab4LxqtAsPVwicIMOJKooSI32Ti7p9Nr9+j50 YFBg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b137si14956651pga.394.2018.12.03.23.38.00; Mon, 03 Dec 2018 23:38:15 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726077AbeLDHfm (ORCPT + 99 others); Tue, 4 Dec 2018 02:35:42 -0500 Received: from mx2.suse.de ([195.135.220.15]:42928 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726017AbeLDHfm (ORCPT ); Tue, 4 Dec 2018 02:35:42 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id D83CBACEA; Tue, 4 Dec 2018 07:35:36 +0000 (UTC) Date: Tue, 4 Dec 2018 08:35:35 +0100 From: Michal Hocko To: David Rientjes Cc: Linus Torvalds , Andrea Arcangeli , ying.huang@intel.com, s.priebe@profihost.ag, mgorman@techsingularity.net, Linux List Kernel Mailing , alex.williamson@redhat.com, lkp@01.org, kirill@shutemov.name, Andrew Morton , zi.yan@cs.rutgers.edu, Vlastimil Babka Subject: Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations Message-ID: <20181204073535.GV31738@dhcp22.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 03-12-18 15:50:24, David Rientjes wrote: > This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for > MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp: > consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"). > > By not setting __GFP_THISNODE, applications can allocate remote hugepages > when the local node is fragmented or low on memory when either the thp > defrag setting is "always" or the vma has been madvised with > MADV_HUGEPAGE. > > Remote access to hugepages often has much higher latency than local pages > of the native page size. On Haswell, ac5b2c18911f was shown to have a > 13.9% access regression after this commit for binaries that remap their > text segment to be backed by transparent hugepages. > > The intent of ac5b2c18911f is to address an issue where a local node is > low on memory or fragmented such that a hugepage cannot be allocated. In > every scenario where this was described as a fix, there is abundant and > unfragmented remote memory available to allocate from, even with a greater > access latency. > > If remote memory is also low or fragmented, not setting __GFP_THISNODE was > also measured on Haswell to have a 40% regression in allocation latency. > > Restore __GFP_THISNODE for thp allocations. > > Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") > Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") At minimum do not remove the cleanup part which consolidates the gfp hadnling to a single place. There is no real reason to have the __GFP_THISNODE ugliness outside of alloc_hugepage_direct_gfpmask. I still hate the __GFP_THISNODE part as mentioned before. It is an ugly hack but I can learn to live with it if this is indeed the only option for the short term workaround until we find a proper solution. > Signed-off-by: David Rientjes > --- > include/linux/mempolicy.h | 2 -- > mm/huge_memory.c | 42 +++++++++++++++------------------------ > mm/mempolicy.c | 7 ++++--- > 3 files changed, 20 insertions(+), 31 deletions(-) > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -139,8 +139,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, > struct mempolicy *get_task_policy(struct task_struct *p); > struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, > unsigned long addr); > -struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > - unsigned long addr); > bool vma_policy_mof(struct vm_area_struct *vma); > > extern void numa_default_policy(void); > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -632,37 +632,27 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, > static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr) > { > const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); > - gfp_t this_node = 0; > - > -#ifdef CONFIG_NUMA > - struct mempolicy *pol; > - /* > - * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not > - * specified, to express a general desire to stay on the current > - * node for optimistic allocation attempts. If the defrag mode > - * and/or madvise hint requires the direct reclaim then we prefer > - * to fallback to other node rather than node reclaim because that > - * can lead to excessive reclaim even though there is free memory > - * on other nodes. We expect that NUMA preferences are specified > - * by memory policies. > - */ > - pol = get_vma_policy(vma, addr); > - if (pol->mode != MPOL_BIND) > - this_node = __GFP_THISNODE; > - mpol_cond_put(pol); > -#endif > + const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE; > > + /* Always do synchronous compaction */ > if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) > - return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); > + return GFP_TRANSHUGE | __GFP_THISNODE | > + (vma_madvised ? 0 : __GFP_NORETRY); > + > + /* Kick kcompactd and fail quickly */ > if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) > - return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node; > + return gfp_mask | __GFP_KSWAPD_RECLAIM; > + > + /* Synchronous compaction if madvised, otherwise kick kcompactd */ > if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags)) > - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : > - __GFP_KSWAPD_RECLAIM | this_node); > + return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : > + __GFP_KSWAPD_RECLAIM); > + > + /* Only do synchronous compaction if madvised */ > if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags)) > - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : > - this_node); > - return GFP_TRANSHUGE_LIGHT | this_node; > + return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0); > + > + return gfp_mask; > } > > /* Caller must hold page table lock. */ > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1116,8 +1116,9 @@ static struct page *new_page(struct page *page, unsigned long start) > } else if (PageTransHuge(page)) { > struct page *thp; > > - thp = alloc_pages_vma(GFP_TRANSHUGE, HPAGE_PMD_ORDER, vma, > - address, numa_node_id()); > + thp = alloc_pages_vma(GFP_TRANSHUGE | __GFP_THISNODE, > + HPAGE_PMD_ORDER, vma, address, > + numa_node_id()); > if (!thp) > return NULL; > prep_transhuge_page(thp); > @@ -1662,7 +1663,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, > * freeing by another task. It is the caller's responsibility to free the > * extra reference for shared policies. > */ > -struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > +static struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > unsigned long addr) > { > struct mempolicy *pol = __get_vma_policy(vma, addr); -- Michal Hocko SUSE Labs