Received: by 2002:a25:c593:0:0:0:0:0 with SMTP id v141csp1117823ybe; Wed, 4 Sep 2019 12:55:31 -0700 (PDT) X-Google-Smtp-Source: APXvYqx/MDzqwz4SIclPGUWHuIFWSMCWlHufL3uCRtn/GPrnOnIaqDA3pJCGLWvqVI40DNxi3XJl X-Received: by 2002:a17:902:988a:: with SMTP id s10mr4111367plp.119.1567626931383; Wed, 04 Sep 2019 12:55:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1567626931; cv=none; d=google.com; s=arc-20160816; b=bzwSVxyiqitMNzZ84xa4k7Is2oCxfHq7kN7BZfwwIkLCEJMqIJS9AviL7PfTWa51aT k7crzDnVLfbXO9tNK5kJPtx6E+WffZUXAX17ErF1kfOdiRXE31ql2Feou8tOZSnB2ev8 DNTq/f5OJXgw/DwOVPokvPFwHQZdMEiqxGbgr9rN6PfpkSnR+BuKwGCG3QpQpLZGOw+w fTD5HUs95o7GkIZ2uoJSdBOmeFmwkWWWG3drfuXmp2Alww399iJ0IxQUvXz1XbaB9tRr 9EdgTgTiI3F54oTizRbZYu/ePN41V1uqewYCSzG2VeBeCzz16rVC54W9xOoVXIxD2h6b vrlQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:message-id :subject:cc:to:from:date:dkim-signature; bh=/yqLrrEZ1jGCgctDjUkMs8Nl6CSrDL+V901mgKrrlLc=; b=fokIf8KmZpZ+kl/iP/N23XkoK/I8K5fVSoSlOTy0lSYofBzwOIxvPbmQujW7yc82Vz 5RmCHqBgrH9XaC4rEv/ziMUnBvlUmJlUYfWESUdUDrrIlgq0EajwLSQRZQXj2BY83Qwx qdog0Nw/6gXxiwau7tOYmAtrx/IIJXwYDzw8pSABFzHBGgvoYpzGlEZCRAbvATlQsUyZ zoYdvAhvWzcqNAqXza3d3CKa/AW8z3avRFFxz9jPwOhmiwq0dNLSLveEFmQWT5yggv4R 8bE3LcWwgsvTA4Bcsd84gHKg37ykt2O1D0Gsk6A5PfjNcIhFcY3soHpFKgJeDhEU+T7v vs3g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=aYpwCJlp; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f7si15286543pgq.369.2019.09.04.12.55.16; Wed, 04 Sep 2019 12:55:31 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=aYpwCJlp; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730299AbfIDTyW (ORCPT + 99 others); Wed, 4 Sep 2019 15:54:22 -0400 Received: from mail-pg1-f193.google.com ([209.85.215.193]:41957 "EHLO mail-pg1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726528AbfIDTyV (ORCPT ); Wed, 4 Sep 2019 15:54:21 -0400 Received: by mail-pg1-f193.google.com with SMTP id x15so11812539pgg.8 for ; Wed, 04 Sep 2019 12:54:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:user-agent:mime-version; bh=/yqLrrEZ1jGCgctDjUkMs8Nl6CSrDL+V901mgKrrlLc=; b=aYpwCJlpggUi2FQ0+ApYB02dfO+53N9oMqyYZoiWyvmwYynDM/vkN4WH4vPpshrnjI /wEZ0BQozY9Vz3IYpgEpRlDoJKcfsl8jUIkBCP4EIkBsLmZROitGMvntgR+w/15fSe0l L1oCGWejuhYlxUmB7okIRI7/QG+P2H5ykBGYn0CrWi+dXqY9CxJ8dYWsQjCR+xnisFD+ QPz3uj6A/rSJUXVv8HdQ1sQr7A6ZvAf4X42VuIbDJ2h0+jBSRqX+buEA5dQukQcDgj2f Re2Y4AhkhVEP57zS1Lhzch1D07VtGHmFyyjVhsHqhKGiSjKSfYNr+q6tlKwjWsiQG73L WkJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:user-agent :mime-version; bh=/yqLrrEZ1jGCgctDjUkMs8Nl6CSrDL+V901mgKrrlLc=; b=Lp5wzFTCPWNTa+EcQhJfgOcSVyoAaAE7+kXAMgl8gFU3SGbXbGmKXWaJhLnXjro9nE 98lZKMaH6LlwBSROukV1ckJLFjGRs6vVqDx9FYaaAWWCRbuYTu6i4V1hVtKhEmnpRqd1 3LvEY5GWMhriCdb3jBai833RIxa/6c/5ltnEj5GR/qmSYrAyQ8kFIWeoeJODXpSwBBj1 IdtAnLI9ciNJPQmTrSaLJfEkw7OlXGL5/YHdAfXZNV3mUTJP9Q4iZl/AzY0ceUptbFHx CvDCU1syMqYCBPqNp6nObK9TeecXfsCSKHjy8w2qJflHft0a2+HUtZS9SN1LjwslFouD f3xg== X-Gm-Message-State: APjAAAUMnwZ2nsMDgQ0nbWvb7G3HyGMoVv9D7tZDiwZI26P5lgdBjf64 19XDjdFnqgn6H40TG5A7yhVpig== X-Received: by 2002:a17:90a:c715:: with SMTP id o21mr4798298pjt.55.1567626859752; Wed, 04 Sep 2019 12:54:19 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id g202sm32480208pfb.155.2019.09.04.12.54.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Sep 2019 12:54:19 -0700 (PDT) Date: Wed, 4 Sep 2019 12:54:18 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Linus Torvalds , Andrew Morton cc: Andrea Arcangeli , Michal Hocko , Mel Gorman , Vlastimil Babka , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [patch for-5.3 1/4] Revert "Revert "mm, thp: restore node-local hugepage allocations"" Message-ID: User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This reverts commit a8282608c88e08b1782141026eab61204c1e533f. The commit references the original intended semantic for MADV_HUGEPAGE which has subsequently taken on three unique purposes: - enables or disables thp for a range of memory depending on the system's config (is thp "enabled" set to "always" or "madvise"), - determines the synchronous compaction behavior for thp allocations at fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"), and - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only clear previous hugepage advice). These are the three purposes that currently exist in 5.2 and over the past several years that userspace has been written around. Adding a NUMA locality preference adds a fourth dimension to an already conflated advice mode. Based on the semantic that MADV_HUGEPAGE has provided over the past several years, there exist workloads that use the tunable based on these principles: specifically that the allocation should attempt to defragment a local node before falling back. It is agreed that remote hugepages typically (but not always) have a better access latency than remote native pages, although on Naples this is at parity for intersocket. The revert commit that this patch reverts allows hugepage allocation to immediately allocate remotely when local memory is fragmented. This is contrary to the semantic of MADV_HUGEPAGE over the past several years: that is, memory compaction should be attempted locally before falling back. The performance degradation of remote hugepages over local hugepages on Rome, for example, is 53.5% increased access latency. For this reason, the goal is to revert back to the 5.2 and previous behavior that would attempt local defragmentation before falling back. With the patch that is reverted by this patch, we see performance degradations at the tail because the allocator happily allocates the remote hugepage rather than even attempting to make a local hugepage available. zone_reclaim_mode is not a solution to this problem since it does not only impact hugepage allocations but rather changes the memory allocation strategy for *all* page allocations. Signed-off-by: David Rientjes --- include/linux/mempolicy.h | 2 -- mm/huge_memory.c | 42 +++++++++++++++------------------------ mm/mempolicy.c | 2 +- 3 files changed, 17 insertions(+), 29 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -139,8 +139,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, struct mempolicy *get_task_policy(struct task_struct *p); struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, unsigned long addr); -struct mempolicy *get_vma_policy(struct vm_area_struct *vma, - unsigned long addr); bool vma_policy_mof(struct vm_area_struct *vma); extern void numa_default_policy(void); diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -648,37 +648,27 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr) { const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); - gfp_t this_node = 0; - -#ifdef CONFIG_NUMA - struct mempolicy *pol; - /* - * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not - * specified, to express a general desire to stay on the current - * node for optimistic allocation attempts. If the defrag mode - * and/or madvise hint requires the direct reclaim then we prefer - * to fallback to other node rather than node reclaim because that - * can lead to excessive reclaim even though there is free memory - * on other nodes. We expect that NUMA preferences are specified - * by memory policies. - */ - pol = get_vma_policy(vma, addr); - if (pol->mode != MPOL_BIND) - this_node = __GFP_THISNODE; - mpol_cond_put(pol); -#endif + const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE; + /* Always do synchronous compaction */ if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); + return GFP_TRANSHUGE | __GFP_THISNODE | + (vma_madvised ? 0 : __GFP_NORETRY); + + /* Kick kcompactd and fail quickly */ if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node; + return gfp_mask | __GFP_KSWAPD_RECLAIM; + + /* Synchronous compaction if madvised, otherwise kick kcompactd */ if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : - __GFP_KSWAPD_RECLAIM | this_node); + return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : + __GFP_KSWAPD_RECLAIM); + + /* Only do synchronous compaction if madvised */ if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : - this_node); - return GFP_TRANSHUGE_LIGHT | this_node; + return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0); + + return gfp_mask; } /* Caller must hold page table lock. */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1734,7 +1734,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, * freeing by another task. It is the caller's responsibility to free the * extra reference for shared policies. */ -struct mempolicy *get_vma_policy(struct vm_area_struct *vma, +static struct mempolicy *get_vma_policy(struct vm_area_struct *vma, unsigned long addr) { struct mempolicy *pol = __get_vma_policy(vma, addr);