Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp9237342imu; Wed, 5 Dec 2018 01:07:01 -0800 (PST) X-Google-Smtp-Source: AFSGD/W7hQQ+iG5z1k+Tmq493tjHnYw4CAu7q+comTg4C+cHWqpt61xr+y47yk6swEss4egsWq9q X-Received: by 2002:a17:902:82c2:: with SMTP id u2mr23354632plz.110.1544000821143; Wed, 05 Dec 2018 01:07:01 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544000821; cv=none; d=google.com; s=arc-20160816; b=g2QG8E5wEHq8pFkO7MonMDR8SV7YXzI2vwG8A1TyfvVw9U6JCkYJlmmRtKw4IZMTUd v1reUvYwdVyUIZJA0DPDBtJfvkOn2Qv8qGYosCLqY+lYqDctFU5TrzOkIMjkgPe2M6vG Dpz3uJjXb813vJaPvHMAqVQP3OJ8/333i+1BaK79cuVM1jPIGdnw+/gQLBqaKNFZR6v2 SJEhdvy19rPKsynvHfEc/xr5s4vexWl0rLyCbOc3/Iq3ZRHmUPatJaSU2+nttPHdzaTJ EMWSTULrmC2GTzOVkPNNB/OJcvcWtbQBQo1oIw+1fQNm3H9QKusAtZHrbWb9lwXj+nAl W2hg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=TYOMZk7NDFeJmLiMhXAeDBfZBPSENHgUxziBysHy3Qg=; b=s7718dIVZqEoaJyiNL+SMkhJ00XWx30g/quSJeBOd/mOEWkVzBrPhMQv3yalvtQYIB sOqNI8vA0BTmZkSDEGS0g19Wfw9pn+nnrWBuffC8Yp0vSwCMOmUDrvRJrq4aX7sKkPCC HhKX0dYOR4F5EO/N0Wdir153i4iLHMqWaxm3YTlRDPXfPRwEOl5hTknr+Af7M6gXi3XR /wiG0eU+U146uTITMboRU7DJbaFkrish2WFCP0T7fujqL4UxzZvx6u4oZ91Xb2uuhbSQ XVU90TNqzxijUuX8IYtlD0G0csDPDVSoJVpsOCL3CmT3dI16BXPVQxHsscBULNHiFRjw Idmw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b36si16137635pla.354.2018.12.05.01.06.46; Wed, 05 Dec 2018 01:07:01 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727182AbeLEJF6 (ORCPT + 99 others); Wed, 5 Dec 2018 04:05:58 -0500 Received: from mx2.suse.de ([195.135.220.15]:44632 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726102AbeLEJF6 (ORCPT ); Wed, 5 Dec 2018 04:05:58 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 0F1D9B17F; Wed, 5 Dec 2018 09:05:55 +0000 (UTC) Date: Wed, 5 Dec 2018 10:05:54 +0100 From: Michal Hocko To: David Rientjes Cc: Vlastimil Babka , Linus Torvalds , Andrea Arcangeli , ying.huang@intel.com, s.priebe@profihost.ag, mgorman@techsingularity.net, Linux List Kernel Mailing , alex.williamson@redhat.com, lkp@01.org, kirill@shutemov.name, Andrew Morton , zi.yan@cs.rutgers.edu Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions Message-ID: <20181205090554.GX1286@dhcp22.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 04-12-18 14:04:10, David Rientjes wrote: > On Tue, 4 Dec 2018, Vlastimil Babka wrote: > > > So, AFAIK, the situation is: > > > > - commit 5265047ac301 in 4.1 introduced __GFP_THISNODE for THP. The > > intention came a bit earlier in 4.0 commit 077fcf116c8c. (I admit acking > > both as it seemed to make sense). > > Yes, both are based on the preference to fault local thp and fallback to > local pages before allocating remotely because it does not cause the > performance regression introduced by not setting __GFP_THISNODE. > > > - The resulting node-reclaim-like behavior regressed Andrea's KVM > > workloads, but reverting it (only for madvised or non-default > > defrag=always THP by commit ac5b2c18911f) would regress David's > > workloads starting with 4.20 to pre-4.1 levels. > > > > Almost, but the defrag=always case had the subtle difference of also > setting __GFP_NORETRY whereas MADV_HUGEPAGE did not. This was different > than the comment in __alloc_pages_slowpath() that expected thp fault > allocations to be caught by checking __GFP_NORETRY. > > > If the decision is that it's too late to revert a 4.1 regression for one > > kind of workload in 4.20 because it causes regression for another > > workload, then I guess we just revert ac5b2c18911f (patch 1) for 4.20 > > and don't rush a different fix (patch 2) to 4.20. It's not a big > > difference if a 4.1 regression is fixed in 4.20 or 4.21? > > > > The revert is certainly needed to prevent the regression, yes, but I > anticipate that Andrea will report back that patch 2 at least improves the > situation for the problem that he was addressing, specifically that it is > pointless to thrash any node or reclaim unnecessarily when compaction has > already failed. This is what setting __GFP_NORETRY for all thp fault > allocations fixes. Yes but earlier numbers from Mel and repeated again [1] simply show that the swap storms are only handled in favor of an absolute drop of THP success rate. > > Because there might be other unexpected consequences of patch 2 that > > testing won't be able to catch in the remaining 4.20 rc's. And I'm not > > even sure if it will fix Andrea's workloads. While it should prevent > > node-reclaim-like thrashing, it will still mean that KVM (or anyone) > > won't be able to allocate THP's remotely, even if the local node is > > exhausted of both huge and base pages. > > > > Patch 2 does nothing with respect to the remote allocation policy; it > simply prevents reclaim (and potentially thrashing). Patch 1 sets > __GFP_THISNODE to prevent the remote allocation. Yes, this is understood. So we are getting worst of both. We have a numa locality side effect of MADV_HUGEPAGE and we have a poor THP utilization. So how come this is an improvement. Especially when the reported regression hasn't been demonstrated on a real or repeatable workload but rather a very vague presumably worst case behavior where the access penalty is absolutely prevailing. [1] http://lkml.kernel.org/r/20181204104558.GV23260@techsingularity.net -- Michal Hocko SUSE Labs