Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp9178560imu; Tue, 4 Dec 2018 23:43:25 -0800 (PST) X-Google-Smtp-Source: AFSGD/WkHpR/yNKPCgjiYhF/gVjRTuUQl4eix7+ulfk6jDuBnz7F2paQkvxhOoBfq7JumKLr4oQ6 X-Received: by 2002:a17:902:5ac7:: with SMTP id g7mr23465065plm.212.1543995805823; Tue, 04 Dec 2018 23:43:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1543995805; cv=none; d=google.com; s=arc-20160816; b=OgdlKrbd8R2QeH34z4FJGY46m1TTl6s63yF9ziuUxfsSsPI+2rXAyBAn/2BcIpG8qq oK3rmJRC/V+hKF63eJrSTDjS7EnxR4FccAuz9ShoxiPqehEYrj6ou/X104tnEUsVoFgz ToLozvh+7ewRY99c2E+r0ZAwVlmy/qfuvDeaNnh6qPU5pLjLZH9aJk6b1zqD0A74LJur 3Q5SjuP6OCGJmFDxYm5RFg2L0IOYxYPD6oNszX3SnGBfyfCwPLRJaMrKT+AXKASfwcyV /O107uKtuZQxWVDdigyxiu+Nv+R/n1wr5heNdrexUK1XgNOZ320I2kN+ZWHLv9op7UJY TJPA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=xyHTMpKEfkD1vssh/ti1l5wix0l7GN5W0MBt3IzW9YA=; b=Ue4xf/xUERpF5BCoBswWfkFChlcGqP8pFlLn5oy4h/oYUwBVtKxaKdTXGo666uWCNq oHcma6Us4nqkLfrB8L0fyaaWR43eiAk9+Ag93jhNK7lRwbbE6EimzyGZlx7mVRftNXEW UIOSA7W/0jF9iFoD/8qrZCqP5wOWNi3ZiZcRDiiJHYdqOzMXXfQ8125z+H4FsPE7rs1d 7afll2kZ3Vf8ThXbGylhMm5RTFocf2MgJyQgzPU39CCxQe8Kdr6k2jMBoo4Jy01Icrix evcd36dyak2OZwiiJaw0/7h2QW5gGZsoYHx9vKfFamnJmpU2G2rkiHetrOnwL/NUXIjV Vxdw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a19si17255066pgj.429.2018.12.04.23.43.09; Tue, 04 Dec 2018 23:43:25 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727177AbeLEHk4 (ORCPT + 99 others); Wed, 5 Dec 2018 02:40:56 -0500 Received: from mx2.suse.de ([195.135.220.15]:55892 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726102AbeLEHk4 (ORCPT ); Wed, 5 Dec 2018 02:40:56 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id BCC9FAE7A; Wed, 5 Dec 2018 07:40:53 +0000 (UTC) Date: Wed, 5 Dec 2018 08:40:52 +0100 From: Michal Hocko To: David Rientjes Cc: Linus Torvalds , Andrea Arcangeli , ying.huang@intel.com, s.priebe@profihost.ag, mgorman@techsingularity.net, Linux List Kernel Mailing , alex.williamson@redhat.com, lkp@01.org, kirill@shutemov.name, Andrew Morton , zi.yan@cs.rutgers.edu, Vlastimil Babka Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions Message-ID: <20181205074052.GU1286@dhcp22.suse.cz> References: <20181204073850.GW31738@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 04-12-18 14:25:54, David Rientjes wrote: > On Tue, 4 Dec 2018, Michal Hocko wrote: > > > > This fixes a 13.9% of remote memory access regression and 40% remote > > > memory allocation regression on Haswell when the local node is fragmented > > > for hugepage sized pages and memory is being faulted with either the thp > > > defrag setting of "always" or has been madvised with MADV_HUGEPAGE. > > > > > > The usecase that initially identified this issue were binaries that mremap > > > their .text segment to be backed by transparent hugepages on startup. > > > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap(). > > > > Do you have something you can share with so that other people can play > > and try to reproduce? > > > > This is a single MADV_HUGEPAGE usecase, there is nothing special about it. > It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and > faulted the memory with a fragmented local node and then measured the > remote access latency to the remote hugepage that occurs without setting > __GFP_THISNODE. You can also measure the remote allocation latency by > fragmenting the entire system and then faulting. > > (Remapping the text segment only involves parsing /proc/self/exe, mmap, > madvise, memcpy, and mremap.) How does this reflect your real workload and regressions in it. It certainly shows the worst case behavior where the access penalty is prevalent while there are no other metrics which might be interesting or even important. E.g. page table savings or the TLB pressure in general when THP fail too eagerly. As Anrea mentioned there are really valid cases where the remote latency pays off. Have you actually seen the advertised regression in the real workload and do you have any means to simulate that workload. > > > This requires a full revert and partial revert of commits merged during > > > the 4.20 rc cycle. The full revert, of ac5b2c18911f ("mm: thp: relax > > > __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large > > > amounts of swap activity on the local zone when faulting hugepages by > > > falling back to remote memory. This remote allocation causes the access > > > regression and, if fragmented, the allocation regression. > > > > Have you tried to measure any of the workloads Mel and Andrea have > > pointed out during the previous review discussion? In other words what > > is the impact on the THP success rate and allocation latencies for other > > usecases? > > It isn't a property of the workload, it's a property of the how fragmented > both local and remote memory is. In Andrea's case, I believe he has > stated that memory compaction has failed locally and the resulting reclaim > activity ends up looping and causing it the thrash the local node whereas > 75% of remote memory is free and not fragmented. So we have local > fragmentation and reclaim is very expensive to enable compaction to > succeed, if it ever does succeed[*], and mostly free remote memory. > > If remote memory is also fragmented, Andrea's case will run into a much > more severe swap storm as a result of not setting __GFP_THISNODE. The > premise of the entire change is that his remote memory is mostly free so > fallback results in a quick allocation. For balanced nodes, that's not > going to be the case. The fix to prevent the heavy reclaim activity is to > set __GFP_NORETRY as the page allocator suspects, which patch 2 here does. You are not answering my question and I take it as you haven't actually tested this in a variety of workload and base your assumptions on an artificial worst case scenario. Please correct me if I am wrong. -- Michal Hocko SUSE Labs