Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp7350950imu; Mon, 3 Dec 2018 11:25:56 -0800 (PST) X-Google-Smtp-Source: AFSGD/XdmGHqaOJhCQoMjrtSDRv54VPsvaxgOgSf4wDirWqerQD1I2xe7E3NFWJpErELYdhfxizT X-Received: by 2002:a62:7e93:: with SMTP id z141mr16773909pfc.239.1543865156061; Mon, 03 Dec 2018 11:25:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1543865156; cv=none; d=google.com; s=arc-20160816; b=uGAHy0T7ztSRhT46jUgXAaq9zBqJMKFnE51wvcBrljQ/89Xdeu2tzBRJ2zuirFc5GT FDODo07ozzhRPmvHtOjS1704myVsUQlgBl6ow3yq0NV367iA+GM3sAHMEctZgDkWC2eT rn3avslxw9pOqRTUrmV6t++ulugu/Ma1yOXDCYVPylK468vtMXqnhD33p++RyRuqqf18 LXrhHpBUbZeqg36ggv8qtJbLUzTe+k+oE7uq8rVRmg2cyG4J/PVlt+XlMlpz7g+TT9bj A2O+zlGZvugNgt16AtWiE/SGiqO1uDSlYuswJwPiy+RywZXEns90DP8SvJCAS7DRZcwJ H6/A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=48EcqWWgW7kBPkugoqx3eFilCkOsihJ8Nl2ofC+e4ZQ=; b=vzpiFQJD3qsj2GFGiMVbzacnDbaKEu3zOEdt+eoAIyyl6nHzVvA24AqYf50qski0Bw P1nUHYhWv8xyaBFEjLiZ9AAg1tFmlsn+PYeyhALSpy3UG2qPlhT+XgsEcJM9NaLGeksK jXfLDfj0rxX7WNhohlaHeWoltwsUhzqvqedNekVR/Wmh9kZysE1Mdxur9/ntNSh+O69L T59vfU/vWuRkb6Kwhlynv7E2AaI8NL++nMiEND2u1CKYElBtJu9vF4Jh1Xu4mFZ/Tk6D 8U9MojEg+bfYDw4ZZ8x5ypR7F22H438OwC5ZezM7SkvGEOhIkKb/esScLM4F/GOniYp8 1Ybw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b19si10453715pfm.100.2018.12.03.11.25.41; Mon, 03 Dec 2018 11:25:56 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726037AbeLCTXw (ORCPT + 99 others); Mon, 3 Dec 2018 14:23:52 -0500 Received: from mx1.redhat.com ([209.132.183.28]:52570 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725890AbeLCTXw (ORCPT ); Mon, 3 Dec 2018 14:23:52 -0500 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 867DB86678; Mon, 3 Dec 2018 19:23:47 +0000 (UTC) Received: from sky.random (ovpn-122-73.rdu2.redhat.com [10.10.122.73]) by smtp.corp.redhat.com (Postfix) with ESMTPS id E3814600C7; Mon, 3 Dec 2018 19:23:44 +0000 (UTC) Date: Mon, 3 Dec 2018 14:23:44 -0500 From: Andrea Arcangeli To: Michal Hocko Cc: Linus Torvalds , ying.huang@intel.com, s.priebe@profihost.ag, mgorman@techsingularity.net, Linux List Kernel Mailing , alex.williamson@redhat.com, lkp@01.org, David Rientjes , kirill@shutemov.name, Andrew Morton , zi.yan@cs.rutgers.edu, Vlastimil Babka Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression Message-ID: <20181203192344.GA2986@redhat.com> References: <20181127205737.GI16136@redhat.com> <87tvk1yjkp.fsf@yhuang-dev.intel.com> <20181203181456.GK31738@dhcp22.suse.cz> <20181203183050.GL31738@dhcp22.suse.cz> <20181203185954.GM31738@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181203185954.GM31738@dhcp22.suse.cz> User-Agent: Mutt/1.11.0 (2018-11-25) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Mon, 03 Dec 2018 19:23:47 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 03, 2018 at 07:59:54PM +0100, Michal Hocko wrote: > I have merely said that a better THP locality needs more work and during > the review discussion I have even volunteered to work on that. There > are other reclaim related fixes under work right now. All I am saying > is that MADV_TRANSHUGE having numa locality implications cannot satisfy > all the usecases and it is particurarly KVM that suffers from it. I'd like to clarify it's not just KVM, we found with KVM because for KVM it's fairly common to create VM that won't possibly fit in a single node, while most other apps don't tend to allocate that much memory. It's trivial to reproduce the badness by running a memhog process that allocates more than the RAM of 1 NUMA node, under defrag=always setting (or by changing memhog to use MADV_HUGEPAGE) and it'll create swap storms despite 75% of the RAM is completely free in a 4 node NUMA (or 50% of RAM free in a 2 node NUMA) etc.. How can it be ok to push the system into gigabytes of swap by default without any special capability despite 50% - 75% or more of the RAM is free? That's the downside of the __GFP_THISNODE optimizaton. __GFP_THISNODE helps increasing NUMA locality if your app can fit in a single node which is the common David's workload. But if his workload would more often than not fit in a single node, he would also run into an unacceptable slowdown because of the __GFP_THISNODE. I think there's lots of room for improvement for the future, but in my view that __GFP_THISNODE as it was implemented was an incomplete hack, that opened the door for bad VM corner cases that should not happen. It also would be nice to have a reproducer for David's workload, the software to run the binary on THP is not released either. We have lots of reproducer for the corner case introduced by the __GFP_THISNODE trick. So this is basically a revert of the commit that made MADV_HUGEPAGE with __GFP_THISNODE behave like a privileged (although not as static) mbind. I provided an alternative but we weren't sure if that was the best long term solution that could satisfy everyone because it does have some drawback too. Thanks, Andrea