Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp2081635imu; Wed, 12 Dec 2018 09:11:32 -0800 (PST) X-Google-Smtp-Source: AFSGD/WZftldoMBbtjaaEAKUm+02vAaC8f/nvaS2iw5yRsj9zXBbA+pNA4RbaZNYl26r6N9qELNV X-Received: by 2002:a17:902:8607:: with SMTP id f7mr19966014plo.123.1544634692391; Wed, 12 Dec 2018 09:11:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544634692; cv=none; d=google.com; s=arc-20160816; b=0reZBXV9qRNdDEHzgBM1HDoFO4MMGnIOpZEkVbf2Yq28TFwzG8lH92J2u3KHrBldRt UDNWvo0F5+YeO5R/cR3ek7Uyewzlx1vxVDWt9krprLquy3erm0f/Q1vZC3QDmMXdm0mH kJZXZiiYjTTE+i9dr3CZjAhvqg7H8EqnBXb4l9rYNYxNuuYkb/OnAODq7JUUwJvVOa3l 0/yDIxSK+PSXvigt+XcpY1hUUIjc6KUS2RAwdRjQb1ArIykDwQe49hXiITHnk+op5PTJ NG7r/8KqfVrUBgFC+ry1D2aXGYVmVtGPunSIfwiqLFwF4JmRL3+r3ZVdsLIXraFLG6Y8 uLlw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=B+CDyLiNk3ekWOxjvumnlGBP3ikEvgN6eoJjPVCQkXc=; b=yUZvdbVAXK/hjjB2GQMCtNLdFKpKy5UkHVro6FKM9+QTo95aGepe4h8mdMTbXIVK0q HhUY6LwxwC+QEsL1khcpyC/1/3+EWhmRELNIqfLDsdZGBS1jX6A8GbySErPioKq/a5gI Oj7kePK1MYy9LwpdHIYD+lZ4OMd2qharsUXVAHjwBulBdaEalMgHtxUCQiDeQjUfHw+a G6ZXBnbOZGw2eRPwTQzhyhkf3kcYn2f0gemamiOAfYcZvCqboYWpF8pnoc3YiUAh1bfb 0/HPcn4EY88shduypKh8o6HMasN//Pa1aWYKnuVPEPuJ7MRTR0Y3nbahKn+rZMiiN7Ts qjCQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p14si15253247plq.25.2018.12.12.09.11.17; Wed, 12 Dec 2018 09:11:32 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727938AbeLLRJD (ORCPT + 99 others); Wed, 12 Dec 2018 12:09:03 -0500 Received: from mx1.redhat.com ([209.132.183.28]:46930 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726358AbeLLRJD (ORCPT ); Wed, 12 Dec 2018 12:09:03 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 7BF0930832D2; Wed, 12 Dec 2018 17:09:02 +0000 (UTC) Received: from sky.random (ovpn-122-218.rdu2.redhat.com [10.10.122.218]) by smtp.corp.redhat.com (Postfix) with ESMTPS id B41965ED3C; Wed, 12 Dec 2018 17:08:58 +0000 (UTC) Date: Wed, 12 Dec 2018 12:00:16 -0500 From: Andrea Arcangeli To: Michal Hocko Cc: David Rientjes , Linus Torvalds , mgorman@techsingularity.net, Vlastimil Babka , ying.huang@intel.com, s.priebe@profihost.ag, Linux List Kernel Mailing , alex.williamson@redhat.com, lkp@01.org, kirill@shutemov.name, Andrew Morton , zi.yan@cs.rutgers.edu Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression Message-ID: <20181212170016.GG1130@redhat.com> References: <20181205233632.GE11899@redhat.com> <20181210044916.GC24097@redhat.com> <20181212095051.GO1286@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181212095051.GO1286@dhcp22.suse.cz> User-Agent: Mutt/1.11.1 (2018-12-01) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.44]); Wed, 12 Dec 2018 17:09:02 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 12, 2018 at 10:50:51AM +0100, Michal Hocko wrote: > I can be convinced that larger pages really require a different behavior > than base pages but you should better show _real_ numbers on a wider > variety workloads to back your claims. I have only heard hand waving and I agree with your point about node_reclaim and I think David complaint of "I got remote THP instead of local 4k" with our proposed fix, is going to morph into "I got remote 4k instead of local 4k" with his favorite fix. Because David stopped calling reclaim with __GFP_THISNODE, the moment the node is full of pagecache, node_reclaim behavior will go away and even 4k pages will start to be allocated remote (and because of __GFP_THISNODE set in the THP allocation, all readily available or trivial to compact remote THP will be ignored too). What David needs I think is a way to set __GFP_THISNODE for THP *and 4k* allocations and if both fails in a row with __GFP_THISNODE set, we need to repeat the whole thing without __GFP_THISNODE set (ideally with a mask to skip the node that we already scraped down to the bottom during the initial __GFP_THISNODE pass). This way his proprietary software binary will work even better than before when the local node is fragmented and he'll finally be able to get the speedup from remote THP too in case the local node is truly OOM, but all other nodes are full of readily available THP. To achieve this without a new MADV_THISNODE/MADV_NODE_RECLAIM, we'd need a way to start with __GFP_THISNODE and then draw the line in reclaim and decide to drop __GFP_THISNODE when too much pressure mounts in the local node, but like you said it becomes like node_reclaim and it would be better if it can be done with an opt-in, like MADV_HUGEPAGE because not all workloads would benefit from such extra pagecache reclaim cost (like not all workload benefits from synchronous compaction). I think some NUMA reclaim mode semantics ended up being embedded and hidden in the THP MADV_HUGEPAGE, but they imposed massive slowdown to all workloads that can't cope with the node_reclaim mode behavior because they don't fit in a node. Adding MADV_THISNODE/MADV_NODE_RECLAIM, will guarantee his proprietary software binary will run at maximum performance without cache interference, and he's happy to accept the risk of massive slowdown in case the local node is truly OOM. The fallback, despite very inefficient, will still happen without OOM killer triggering. Thanks, Andrea