Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Wed, 5 Dec 2018 14:10:47 -0800 (PST)
From:   David Rientjes <rientjes@google.com>
To:     Andrea Arcangeli <aarcange@redhat.com>
cc:     Michal Hocko <mhocko@kernel.org>, Vlastimil Babka <vbabka@suse.cz>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        ying.huang@intel.com, s.priebe@profihost.ag,
        mgorman@techsingularity.net,
        Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
        alex.williamson@redhat.com, lkp@01.org, kirill@shutemov.name,
        Andrew Morton <akpm@linux-foundation.org>,
        zi.yan@cs.rutgers.edu
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation
 regressions
In-Reply-To: <20181205214542.GC11899@redhat.com>
Message-ID: <alpine.DEB.2.21.1812051402150.9633@chino.kir.corp.google.com>
References: <alpine.DEB.2.21.1812031545080.161134@chino.kir.corp.google.com> <bb198d88-27be-0d5c-d871-1ffd26a08e29@suse.cz> <alpine.DEB.2.21.1812041356490.157466@chino.kir.corp.google.com> <20181205090554.GX1286@dhcp22.suse.cz>
 <alpine.DEB.2.21.1812051142040.240991@chino.kir.corp.google.com> <20181205214542.GC11899@redhat.com>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> > High thp utilization is not always better, especially when those hugepages 
> > are accessed remotely and introduce the regressions that I've reported.  
> > Seeking high thp utilization at all costs is not the goal if it causes 
> > workloads to regress.
> 
> Is it possible what you need is a defrag=compactonly_thisnode to set
> instead of the default defrag=madvise? The fact you seem concerned
> about page fault latencies doesn't make your workload an obvious
> candidate for MADV_HUGEPAGE to begin with. At least unless you decide
> to smooth the MADV_HUGEPAGE behavior with an mbind that will simply
> add __GFP_THISNODE to the allocations, perhaps you'll be even faster
> if you invoke reclaim in the local node for 4k allocations too.
> 

I've must have said this at least six or seven times: fault latency is 
secondary to the *access* latency.  We want to try hard for MADV_HUGEPAGE 
users to do synchronous compaction and try to make a hugepage available.  
We really want to be backed by hugepages, but certainly not when the 
access latency becomes 13.9% worse as a result compared to local pages of 
the native page size.

This is not a system-wide configuration detail, it is specific to the 
workload: does it span more than one node or not?  No workload that can 
fit into a single node, which you also say is going to be the majority of 
workloads on today's platforms, is going to want to revert __GFP_THISNODE 
behavior of the past almost four years.  It perfectly makes sense, 
however, to be a new mempolicy mode, a new madvise mode, or a prctl.

> It looks like for your workload THP is a nice to have add-on, which is
> practically true of all workloads (with a few corner cases that must
> use MADV_NOHUGEPAGE), and it's what the defrag= default is about.
> 
> Is it possible that you just don't want to shut off completely
> compaction in the page fault and if you're ok to do it for your
> library, you may be ok with that for all other apps too?
> 

We enable synchronous compaction for MADV_HUGEPAGE users, yes, because we 
are not concerned with the fault latency but rather the access latency.

> That's a different stance from other MADV_HUGEPAGE users because you
> don't seem to mind a severely crippled THP utilization in your
> app.
> 

If access latency is really better for local pages of the native page 
size, we of course want to fault those instead.  For almost the past four 
years, the behavior of MADV_HUGEPAGE has been to compact and possibly 
reclaim locally and then fallback to local pages.  It is exactly what our 
users of MADV_HUGEPAGE want; I did not introduce this NUMA locality 
restriction but our users have used it.

Please: if we wish to change behavior from February 2015, let's extend the 
API to allow for remote allocations in several of the ways we have already 
brainstormed rather than cause regressions.