Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Sun, 9 Dec 2018 14:44:23 -0800 (PST)
From:   David Rientjes <rientjes@google.com>
To:     Andrea Arcangeli <aarcange@redhat.com>
cc:     Michal Hocko <mhocko@kernel.org>, Vlastimil Babka <vbabka@suse.cz>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        ying.huang@intel.com, s.priebe@profihost.ag,
        mgorman@techsingularity.net,
        Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
        alex.williamson@redhat.com, lkp@01.org, kirill@shutemov.name,
        Andrew Morton <akpm@linux-foundation.org>,
        zi.yan@cs.rutgers.edu
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation
 regressions
In-Reply-To: <20181206003126.GA21159@redhat.com>
Message-ID: <alpine.DEB.2.21.1812091420350.95551@chino.kir.corp.google.com>
References: <alpine.DEB.2.21.1812031545080.161134@chino.kir.corp.google.com> <bb198d88-27be-0d5c-d871-1ffd26a08e29@suse.cz> <alpine.DEB.2.21.1812041356490.157466@chino.kir.corp.google.com> <20181205090554.GX1286@dhcp22.suse.cz>
 <alpine.DEB.2.21.1812051142040.240991@chino.kir.corp.google.com> <20181205214542.GC11899@redhat.com> <alpine.DEB.2.21.1812051402150.9633@chino.kir.corp.google.com> <20181206003126.GA21159@redhat.com>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> > I've must have said this at least six or seven times: fault latency is 
> 
> In your original regression report in this thread to Linus:
> 
> https://lkml.kernel.org/r/alpine.DEB.2.21.1811281504030.231719@chino.kir.corp.google.com
> 
> you said "On a fragmented host, the change itself showed a 13.9%
> access latency regression on Haswell and up to 40% allocation latency
>                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> regression. This is more substantial on Naples and Rome.  I also
> ^^^^^^^^^^
> measured similar numbers to this for Haswell."
> 
> > secondary to the *access* latency.  We want to try hard for MADV_HUGEPAGE 
> > users to do synchronous compaction and try to make a hugepage available.  
> 
> I'm glad you said it six or seven times now, because you forgot to
> mention in the above email that the "40% allocation/fault latency
> regression" you reported above, is actually a secondary concern because
> those must be long lived allocations and we can't yet generate
> compound pages for free after all..
> 

I've been referring to the long history of this discussion, namely my 
explicit Nacked-by in https://marc.info/?l=linux-kernel&m=153868420126775 
two months ago stating the 13.9% access latency regression.  The patch was 
nonetheless still merged and I proposed the revert for the same chief 
complaint, and it was reverted.

I brought up the access latency issue three months ago in
https://marc.info/?l=linux-kernel&m=153661012118046 and said allocation 
latency was a secondary concern, specifically that our users of 
MADV_HUGEPAGE are willing to accept the increased allocation latency for 
local hugepages.

> BTW, I never bothered to ask yet, but, did you enable NUMA balancing
> in your benchmarks? NUMA balancing would fix the access latency very
> easily too, so that 13.9% access latency must quickly disappear if you
> correctly have NUMA balancing enabled in a NUMA system.
> 

No, we do not have CONFIG_NUMA_BALANCING enabled.  The __GFP_THISNODE 
behavior for hugepages was added in 4.0 for the PPC usecase, not by me.  
That had nothing to do with the madvise mode: the initial documentation 
referred to the mode as a way to prevent an increase in rss for configs 
where "enabled" was set to madvise.  The allocation policy was never about 
MADV_HUGEPAGE in any 4.x kernel, it was only an indication for certain 
defrag settings to determine how much work should be done to allocate 
*local* hugepages at fault.

If you are saying that the change in allocator policy in a patch from 
Aneesh almost four years ago and has gone unreported by anybody up until a 
few months ago, I can understand the frustration.  I do, however, support 
the __GFP_THISNODE change he made because his data shows the same results 
as mine.

I've suggested a very simple extension, specifically a prctl() mode that 
is inherited across fork, that would allow a workload to specify that it 
prefers remote allocations over local compaction/reclaim because it is too 
large to fit on a single node.  I'd value your feedback for that 
suggestion to fix your usecase.