Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp8795921imu; Tue, 4 Dec 2018 14:26:47 -0800 (PST) X-Google-Smtp-Source: AFSGD/VIEkE/vB7Ahla5085DsY6cER5O1LRDZFP6aaW5yUA8zF5U8X9aJBrGvj9gEqrYCQzAL22q X-Received: by 2002:a63:5907:: with SMTP id n7mr18185523pgb.435.1543962407677; Tue, 04 Dec 2018 14:26:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1543962407; cv=none; d=google.com; s=arc-20160816; b=r729LkY4tWNMkT/eCrrF+aI+VdzbSs2tIM26BNM/L94JBPVwVtMklxuEu1U+nQ6FsU Anu0M0HEybkIAsB50Eq6H2Yvw0ZCzINYCKoOYKNg24ADkTX4fXMa6V8hmKOnN0b0Az+K vNgUGSBE4ZSro98uy8adpV/8YoCqt3i+j7vNrTowaUK4Fm0o++q+Ge64bap3R8smJfoz qhd2wtAGkFgxRBW30qAzyusL6bD+CPWfvfF3YTdqC8E7zznnz1sm/bH5etD1LSYU9J4j 6c3DGe2nrsYl6vEvtvW0JGkyHLg+YRzJZqeATcHW7CP6zVMxtdR2KKXXFNtS+9X2MeD/ IvvA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature; bh=GgnIbkTTYT6j4yMMvvZjd6oEH7KBr8tuTO49wtKWm0o=; b=Pf5vwHWeRlhT/6/ooLFQiERt12zx01V2SoQZonnkWdEJ55Dr65zE7eB+Ew65JLnxSq tEd5ITtKasduREwz1T9biteyDIxQyKn1jJDGc/ngWMIMPj9IoNxLTz0AZmURTw4i1Ize daBa1ocz+P+EvrlQeWAjJAZHmVP8nnzrTZBbZ1NAGv/Pij3duIZBeD4qczfBQFh6WG5k 7gwm7L4pF+EPfZVOaSCpIQeofODwKXqGEjqRFoLdjK1ulDW4JJ6NJa0GqkqvfkaYcTcz yPkV0ITEtWFS5MYHUsQLSeeRxOx1LeeSis7UkDfhpjpCOTF7CjNn59QQTdJOLqLg3PAH IGgw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=IuKoXQbH; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a17si18034906pfn.213.2018.12.04.14.26.32; Tue, 04 Dec 2018 14:26:47 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=IuKoXQbH; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726031AbeLDWZ5 (ORCPT + 99 others); Tue, 4 Dec 2018 17:25:57 -0500 Received: from mail-pg1-f195.google.com ([209.85.215.195]:33473 "EHLO mail-pg1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725886AbeLDWZ5 (ORCPT ); Tue, 4 Dec 2018 17:25:57 -0500 Received: by mail-pg1-f195.google.com with SMTP id z11so8032422pgu.0 for ; Tue, 04 Dec 2018 14:25:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=GgnIbkTTYT6j4yMMvvZjd6oEH7KBr8tuTO49wtKWm0o=; b=IuKoXQbH5SohkE+LVxXg06YfeVKU6Iuqdwmwrxv3CEHRVkmY+E8zHlGaTkrxtRdFJy 3jbiOx2eXW1YGKjwV+UULTQlJNH+7JW8WwRuY+A3d3GHRKPM4cYJKOC9puWSVrAPOl0J 4uUpIpA9TblgYBV5JDJYkQ7w4eZSvz44chcp0qi6xXiq+gj/ss2m/20HRCWszaeyhxPY ZBvewc39DaA6nKw4seH1BA3o3ZjErgi707PsofH8AmsSh91vI+n7dnZbBiLH9P4Pz4e4 Q2d5aX9gMjNXXn+6ivXf0y5+BlzBx2E0AZd0f9/lYmRMGuEPwLKvreD8LB5RyvbeqMR4 5VgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=GgnIbkTTYT6j4yMMvvZjd6oEH7KBr8tuTO49wtKWm0o=; b=CxBCETHa3nK7+uHUlgoHvSX61hHJU3mHGGoaySdJLe9SoEWkd4WKalNK06HIDwPyhw pUiJJOL5YCPYKdaL20p2rksUJ10a0hLum5KQe8i0DFukt6kE5xEp22lPP9gie3/t91SY 9dAgNGCQMiDwBNEjVdECK2DlezK+0k8sNrLm7+EoRmtSbbtGwnNALP/aSw/+RpR11cGo WslMfR4Kx5l6ZW56xocj9b+TbHMEdLCLN+1mbFciBN1wJpoafvRJ2AG+Rj/2xZx1sKTz Wm4uPqksi0cgXJ7vjc7Wu//8g79xei/yjFCVN11LKrM9nsvqB5blmUQusXmRYbRsLyBZ RK0Q== X-Gm-Message-State: AA+aEWZvamLiVWMHBlo9vn87IQBK1UXYvLTTxD7RAFGiw3zTmoClsTlS HRv1rjcnyLJDAJluGseGnhSiKQ== X-Received: by 2002:a63:2946:: with SMTP id p67mr18610653pgp.317.1543962356294; Tue, 04 Dec 2018 14:25:56 -0800 (PST) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id h128sm22316145pgc.15.2018.12.04.14.25.55 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 04 Dec 2018 14:25:55 -0800 (PST) Date: Tue, 4 Dec 2018 14:25:54 -0800 (PST) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Michal Hocko cc: Linus Torvalds , Andrea Arcangeli , ying.huang@intel.com, s.priebe@profihost.ag, mgorman@techsingularity.net, Linux List Kernel Mailing , alex.williamson@redhat.com, lkp@01.org, kirill@shutemov.name, Andrew Morton , zi.yan@cs.rutgers.edu, Vlastimil Babka Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions In-Reply-To: <20181204073850.GW31738@dhcp22.suse.cz> Message-ID: References: <20181204073850.GW31738@dhcp22.suse.cz> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 4 Dec 2018, Michal Hocko wrote: > > This fixes a 13.9% of remote memory access regression and 40% remote > > memory allocation regression on Haswell when the local node is fragmented > > for hugepage sized pages and memory is being faulted with either the thp > > defrag setting of "always" or has been madvised with MADV_HUGEPAGE. > > > > The usecase that initially identified this issue were binaries that mremap > > their .text segment to be backed by transparent hugepages on startup. > > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap(). > > Do you have something you can share with so that other people can play > and try to reproduce? > This is a single MADV_HUGEPAGE usecase, there is nothing special about it. It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and faulted the memory with a fragmented local node and then measured the remote access latency to the remote hugepage that occurs without setting __GFP_THISNODE. You can also measure the remote allocation latency by fragmenting the entire system and then faulting. (Remapping the text segment only involves parsing /proc/self/exe, mmap, madvise, memcpy, and mremap.) > > This requires a full revert and partial revert of commits merged during > > the 4.20 rc cycle. The full revert, of ac5b2c18911f ("mm: thp: relax > > __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large > > amounts of swap activity on the local zone when faulting hugepages by > > falling back to remote memory. This remote allocation causes the access > > regression and, if fragmented, the allocation regression. > > Have you tried to measure any of the workloads Mel and Andrea have > pointed out during the previous review discussion? In other words what > is the impact on the THP success rate and allocation latencies for other > usecases? It isn't a property of the workload, it's a property of the how fragmented both local and remote memory is. In Andrea's case, I believe he has stated that memory compaction has failed locally and the resulting reclaim activity ends up looping and causing it the thrash the local node whereas 75% of remote memory is free and not fragmented. So we have local fragmentation and reclaim is very expensive to enable compaction to succeed, if it ever does succeed[*], and mostly free remote memory. If remote memory is also fragmented, Andrea's case will run into a much more severe swap storm as a result of not setting __GFP_THISNODE. The premise of the entire change is that his remote memory is mostly free so fallback results in a quick allocation. For balanced nodes, that's not going to be the case. The fix to prevent the heavy reclaim activity is to set __GFP_NORETRY as the page allocator suspects, which patch 2 here does. That's an interesting memory state to [*] Reclaim here would only be beneficial if we fail the order-0 watermark check in __compaction_suitable() *and* the reclaimed memory can be accessed during isolate_freepages().