Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp3959430imm; Tue, 11 Sep 2018 04:58:04 -0700 (PDT) X-Google-Smtp-Source: ANB0VdZL5UNIjzt8tZ+GyWbswLLKJ4YkkzwMSYJygbHeIAXRwZweUtYMkjw6nAQvqY8HZS27jKJ1 X-Received: by 2002:a17:902:7606:: with SMTP id k6-v6mr26286303pll.300.1536667084582; Tue, 11 Sep 2018 04:58:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1536667084; cv=none; d=google.com; s=arc-20160816; b=KmiWDLh8ki+GyLTLJIXrgJIBV+KGfbiuxaXiZ2rD8JGzWEub3a5yvDohmhO6qfJVN6 i4/jE03l+pqiIOD6RYnOLqqWWB3nEUBN2qhBOtV0fizFtLEWbLv7rJAfU0/LOZXQRwyn WStcvuw2USxluYbjjQG35FHa3YkBHxrvdcux6J/jXEPEHUA6ahQ4YDNHNFcndRGXvWCf mlfcumbXIqPRbIr4ErMt/UG3ndQ/C4cJSdwYX8Gc5IhxE+c6sKV44lqMOn5+E1IJj377 +JRAjgPO1fR5/2KHpADEvMXcDyTwWx1mbVRGvtPKTHpOw9ORatwvDqakKLoSxTuSzasF qWkg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=bwV5R7NzqH5iQjklgj+fiZ6yiZqmFKcKuoluHHz2cOs=; b=Fom+ESxWR8MRhbBnlr5pS05Yx33ax7Qn8+1As+sIE+WDh2alpAeskcoZ3wo+i3M1qF FLjR32FPFVY6m/vCGTYBjHZXCh2rZ2XcmJbI9KAolpyJ2Mrs60wUtpHuFjCxz5K0ZETm /keGuBJotsuLplaYhyY75uFEY+8YXj+joIWBbzyWANr5UCPZuj/k6CzF0Xh4nuirw1NY Txt5/0Y6FhUJT+qDyabiiVYzEZQL2OddvXTHn76cOj5EoqGJFn51FOlR1vvReJ1ZHhk0 18RWkcmIAeG7B16jYwaBQgSCK5J64spOdfXUEXE9vIpO3/gAh+ZmzErC/wPECn4wmf20 Dsgw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j64-v6si21171250pgc.88.2018.09.11.04.57.49; Tue, 11 Sep 2018 04:58:04 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727691AbeIKQzS (ORCPT + 99 others); Tue, 11 Sep 2018 12:55:18 -0400 Received: from mx2.suse.de ([195.135.220.15]:58168 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726563AbeIKQzS (ORCPT ); Tue, 11 Sep 2018 12:55:18 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id C02A9AE41; Tue, 11 Sep 2018 11:56:16 +0000 (UTC) Date: Tue, 11 Sep 2018 13:56:13 +0200 From: Michal Hocko To: David Rientjes Cc: Andrew Morton , Andrea Arcangeli , Zi Yan , "Kirill A. Shutemov" , linux-mm@kvack.org, LKML , Stefan Priebe Subject: Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Message-ID: <20180911115613.GR10951@dhcp22.suse.cz> References: <20180907130550.11885-1-mhocko@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 10-09-18 13:08:34, David Rientjes wrote: > On Fri, 7 Sep 2018, Michal Hocko wrote: [...] > > Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where > > it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we > > juggle gfp flags for different allocation modes. The rationale is that > > __GFP_THISNODE is helpful in relaxed defrag modes because falling back > > to a different node might be more harmful than the benefit of a large page. > > If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has > > a higher priority than local NUMA placement. > > > > That's not entirely true, the remote access latency for remote thp on all > of our platforms is greater than local small pages, this is especially > true for remote thp that is allocated intersocket and must be accessed > through the interconnect. > > Our users of MADV_HUGEPAGE are ok with assuming the burden of increased > allocation latency, but certainly not remote access latency. There are > users who remap their text segment onto transparent hugepages are fine > with startup delay if they are access all of their text from local thp. > Remote thp would be a significant performance degradation. Well, it seems that expectations differ for users. It seems that kvm users do not really agree with your interpretation. > When Andrea brought this up, I suggested that the full solution would be a > MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added > benefit is that we could replace the thp "defrag" mode default by setting > this as part of default_policy. Right now, MADV_HUGEPAGE users are > concerned about (1) getting thp when system-wide it is not default and (2) > additional fault latency when direct compaction is not default. They are > not anticipating the degradation of remote access latency, so overloading > the meaning of the mode is probably not a good idea. hugepage specific MPOL flags sounds like yet another step into even more cluttered API and semantic, I am afraid. Why should this be any different from regular page allocations? You are getting off-node memory once your local node is full. You have to use an explicit binding to disallow that. THP should be similar in that regards. Once you have said that you _really_ want THP then you are closer to what we do for regular pages IMHO. I do realize that this is a gray zone because nobody bothered to define the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4 is exceptionaly short of information). So we are left with more or less undefined behavior and define it properly now. As we can see this might regress in some workloads but I strongly suspect that an explicit binding sounds more logical approach than a thp specific mpol mode. If anything this should be a more generic memory policy basically saying that a zone/node reclaim mode should be enabled for the particular allocation. -- Michal Hocko SUSE Labs