Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp368743ybn; Tue, 1 Oct 2019 22:46:20 -0700 (PDT) X-Google-Smtp-Source: APXvYqxAbu6bMiDkrRSbDVGXVEruzXrd+zw9qt/aN1gYDGzRlo/kWPoDutqytfvXpnt+rScpJD/P X-Received: by 2002:a05:6402:1583:: with SMTP id c3mr1958443edv.286.1569995179911; Tue, 01 Oct 2019 22:46:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569995179; cv=none; d=google.com; s=arc-20160816; b=ycu/hcwwMKliOkK3H59tup1DEIhHxaXWthpzpALoKdU2g9YhqnNe+VwyP5aAXFvHXn j7RXhfVnynmXtgqvLmOrBYwQcKiJ6fKdRwvGP6Mgb9RP+if8Wpu3eHUHnHa7eFOEiHpx t9xLq3CMaKsdqc5s6P7n6IYXvjiALYJu4pEZre2xOnLg24qVQgN5wyZZystQhJfM1r8t puFeag265V44O4AQHCI+IsKXqISvxcs0NalFtItfxrf626WUrJhiqST2//FMPctVNHl8 7UEjgYhfaWU+3dsQ7qJO7g86oXjWrfboiaxubpaBfDMT2zVMNMiTU2GfxqExsscRdm15 mtSg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:autocrypt:from:references:cc:to:subject; bh=5p0XaDz2A604jyWbEnCdA22UJLaP3J9X5HkQ5tGZqrA=; b=zlXrraKoQV5pbLSgjg/RfdLwm6wWnlLDM5M/HNfueFmzMqhMInCWli4FB2ZbVanwfN slmTtqpMqu+2/jpMsJ5IJw3uP/q64sTdTR6zUCS9X/qA9fXVn/hQxboK3uO59xIZjPKM GMEFlAvoTac3CktvspRl95NuQaQI6DeeP0MJ2RgGSX2HxXrzZgyNZ6I7GAj/t3A6Nzjp nRk8mNw/abGLh3/DjJmywvVAZZOUSjbfc8bVZRrkSeiY7kthPVvi9vXONUAWrTgpNzzf HMC+2E0mUgbZnAsTOKjS6AWaeQRzmYtKXSrtaWocnhiA8IiaFpP39XB2SvYAwhD64bsF 4aIg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id va9si2790567ejb.126.2019.10.01.22.45.54; Tue, 01 Oct 2019 22:46:19 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727425AbfJAV6Q (ORCPT + 99 others); Tue, 1 Oct 2019 17:58:16 -0400 Received: from mx2.suse.de ([195.135.220.15]:38854 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725865AbfJAV6P (ORCPT ); Tue, 1 Oct 2019 17:58:15 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id BE58FAD3E; Tue, 1 Oct 2019 21:58:12 +0000 (UTC) Subject: Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages To: David Rientjes Cc: Michal Hocko , Linus Torvalds , Andrea Arcangeli , Andrew Morton , Mel Gorman , "Kirill A. Shutemov" , Linux Kernel Mailing List , Linux-MM References: <20190904205522.GA9871@redhat.com> <20190909193020.GD2063@dhcp22.suse.cz> <20190925070817.GH23050@dhcp22.suse.cz> <20190927074803.GB26848@dhcp22.suse.cz> <20190930112817.GC15942@dhcp22.suse.cz> <20191001054343.GA15624@dhcp22.suse.cz> From: Vlastimil Babka Autocrypt: addr=vbabka@suse.cz; prefer-encrypt=mutual; keydata= mQINBFZdmxYBEADsw/SiUSjB0dM+vSh95UkgcHjzEVBlby/Fg+g42O7LAEkCYXi/vvq31JTB KxRWDHX0R2tgpFDXHnzZcQywawu8eSq0LxzxFNYMvtB7sV1pxYwej2qx9B75qW2plBs+7+YB 87tMFA+u+L4Z5xAzIimfLD5EKC56kJ1CsXlM8S/LHcmdD9Ctkn3trYDNnat0eoAcfPIP2OZ+ 9oe9IF/R28zmh0ifLXyJQQz5ofdj4bPf8ecEW0rhcqHfTD8k4yK0xxt3xW+6Exqp9n9bydiy tcSAw/TahjW6yrA+6JhSBv1v2tIm+itQc073zjSX8OFL51qQVzRFr7H2UQG33lw2QrvHRXqD Ot7ViKam7v0Ho9wEWiQOOZlHItOOXFphWb2yq3nzrKe45oWoSgkxKb97MVsQ+q2SYjJRBBH4 8qKhphADYxkIP6yut/eaj9ImvRUZZRi0DTc8xfnvHGTjKbJzC2xpFcY0DQbZzuwsIZ8OPJCc LM4S7mT25NE5kUTG/TKQCk922vRdGVMoLA7dIQrgXnRXtyT61sg8PG4wcfOnuWf8577aXP1x 6mzw3/jh3F+oSBHb/GcLC7mvWreJifUL2gEdssGfXhGWBo6zLS3qhgtwjay0Jl+kza1lo+Cv BB2T79D4WGdDuVa4eOrQ02TxqGN7G0Biz5ZLRSFzQSQwLn8fbwARAQABtCBWbGFzdGltaWwg QmFia2EgPHZiYWJrYUBzdXNlLmN6PokCVAQTAQoAPgIbAwULCQgHAwUVCgkICwUWAgMBAAIe AQIXgBYhBKlA1DSZLC6OmRA9UCJPp+fMgqZkBQJcbbyGBQkH8VTqAAoJECJPp+fMgqZkpGoP /1jhVihakxw1d67kFhPgjWrbzaeAYOJu7Oi79D8BL8Vr5dmNPygbpGpJaCHACWp+10KXj9yz fWABs01KMHnZsAIUytVsQv35DMMDzgwVmnoEIRBhisMYOQlH2bBn/dqBjtnhs7zTL4xtqEcF 1hoUFEByMOey7gm79utTk09hQE/Zo2x0Ikk98sSIKBETDCl4mkRVRlxPFl4O/w8dSaE4eczH LrKezaFiZOv6S1MUKVKzHInonrCqCNbXAHIeZa3JcXCYj1wWAjOt9R3NqcWsBGjFbkgoKMGD usiGabetmQjXNlVzyOYdAdrbpVRNVnaL91sB2j8LRD74snKsV0Wzwt90YHxDQ5z3M75YoIdl byTKu3BUuqZxkQ/emEuxZ7aRJ1Zw7cKo/IVqjWaQ1SSBDbZ8FAUPpHJxLdGxPRN8Pfw8blKY 8mvLJKoF6i9T6+EmlyzxqzOFhcc4X5ig5uQoOjTIq6zhLO+nqVZvUDd2Kz9LMOCYb516cwS/ Enpi0TcZ5ZobtLqEaL4rupjcJG418HFQ1qxC95u5FfNki+YTmu6ZLXy+1/9BDsPuZBOKYpUm 3HWSnCS8J5Ny4SSwfYPH/JrtberWTcCP/8BHmoSpS/3oL3RxrZRRVnPHFzQC6L1oKvIuyXYF rkybPXYbmNHN+jTD3X8nRqo+4Qhmu6SHi3VquQENBFsZNQwBCACuowprHNSHhPBKxaBX7qOv KAGCmAVhK0eleElKy0sCkFghTenu1sA9AV4okL84qZ9gzaEoVkgbIbDgRbKY2MGvgKxXm+kY n8tmCejKoeyVcn9Xs0K5aUZiDz4Ll9VPTiXdf8YcjDgeP6/l4kHb4uSW4Aa9ds0xgt0gP1Xb AMwBlK19YvTDZV5u3YVoGkZhspfQqLLtBKSt3FuxTCU7hxCInQd3FHGJT/IIrvm07oDO2Y8J DXWHGJ9cK49bBGmK9B4ajsbe5GxtSKFccu8BciNluF+BqbrIiM0upJq5Xqj4y+Xjrpwqm4/M ScBsV0Po7qdeqv0pEFIXKj7IgO/d4W2bABEBAAGJA3IEGAEKACYWIQSpQNQ0mSwujpkQPVAi T6fnzIKmZAUCWxk1DAIbAgUJA8JnAAFACRAiT6fnzIKmZMB0IAQZAQoAHRYhBKZ2GgCcqNxn k0Sx9r6Fd25170XjBQJbGTUMAAoJEL6Fd25170XjDBUH/2jQ7a8g+FC2qBYxU/aCAVAVY0NE YuABL4LJ5+iWwmqUh0V9+lU88Cv4/G8fWwU+hBykSXhZXNQ5QJxyR7KWGy7LiPi7Cvovu+1c 9Z9HIDNd4u7bxGKMpn19U12ATUBHAlvphzluVvXsJ23ES/F1c59d7IrgOnxqIcXxr9dcaJ2K k9VP3TfrjP3g98OKtSsyH0xMu0MCeyewf1piXyukFRRMKIErfThhmNnLiDbaVy6biCLx408L Mo4cCvEvqGKgRwyckVyo3JuhqreFeIKBOE1iHvf3x4LU8cIHdjhDP9Wf6ws1XNqIvve7oV+w B56YWoalm1rq00yUbs2RoGcXmtX1JQ//aR/paSuLGLIb3ecPB88rvEXPsizrhYUzbe1TTkKc 4a4XwW4wdc6pRPVFMdd5idQOKdeBk7NdCZXNzoieFntyPpAq+DveK01xcBoXQ2UktIFIsXey uSNdLd5m5lf7/3f0BtaY//f9grm363NUb9KBsTSnv6Vx7Co0DWaxgC3MFSUhxzBzkJNty+2d 10jvtwOWzUN+74uXGRYSq5WefQWqqQNnx+IDb4h81NmpIY/X0PqZrapNockj3WHvpbeVFAJ0 9MRzYP3x8e5OuEuJfkNnAbwRGkDy98nXW6fKeemREjr8DWfXLKFWroJzkbAVmeIL0pjXATxr +tj5JC0uvMrrXefUhXTo0SNoTsuO/OsAKOcVsV/RHHTwCDR2e3W8mOlA3QbYXsscgjghbuLh J3oTRrOQa8tUXWqcd5A0+QPo5aaMHIK0UAthZsry5EmCY3BrbXUJlt+23E93hXQvfcsmfi0N rNh81eknLLWRYvMOsrbIqEHdZBT4FHHiGjnck6EYx/8F5BAZSodRVEAgXyC8IQJ+UVa02QM5 D2VL8zRXZ6+wARKjgSrW+duohn535rG/ypd0ctLoXS6dDrFokwTQ2xrJiLbHp9G+noNTHSan ExaRzyLbvmblh3AAznb68cWmM3WVkceWACUalsoTLKF1sGrrIBj5updkKkzbKOq5gcC5AQ0E Wxk1NQEIAJ9B+lKxYlnKL5IehF1XJfknqsjuiRzj5vnvVrtFcPlSFL12VVFVUC2tT0A1Iuo9 NAoZXEeuoPf1dLDyHErrWnDyn3SmDgb83eK5YS/K363RLEMOQKWcawPJGGVTIRZgUSgGusKL NuZqE5TCqQls0x/OPljufs4gk7E1GQEgE6M90Xbp0w/r0HB49BqjUzwByut7H2wAdiNAbJWZ F5GNUS2/2IbgOhOychHdqYpWTqyLgRpf+atqkmpIJwFRVhQUfwztuybgJLGJ6vmh/LyNMRr8 J++SqkpOFMwJA81kpjuGR7moSrUIGTbDGFfjxmskQV/W/c25Xc6KaCwXah3OJ40AEQEAAYkC PAQYAQoAJhYhBKlA1DSZLC6OmRA9UCJPp+fMgqZkBQJbGTU1AhsMBQkDwmcAAAoJECJPp+fM gqZkPN4P/Ra4NbETHRj5/fM1fjtngt4dKeX/6McUPDIRuc58B6FuCQxtk7sX3ELs+1+w3eSV rHI5cOFRSdgw/iKwwBix8D4Qq0cnympZ622KJL2wpTPRLlNaFLoe5PkoORAjVxLGplvQIlhg miljQ3R63ty3+MZfkSVsYITlVkYlHaSwP2t8g7yTVa+q8ZAx0NT9uGWc/1Sg8j/uoPGrctml hFNGBTYyPq6mGW9jqaQ8en3ZmmJyw3CHwxZ5FZQ5qc55xgshKiy8jEtxh+dgB9d8zE/S/UGI E99N/q+kEKSgSMQMJ/CYPHQJVTi4YHh1yq/qTkHRX+ortrF5VEeDJDv+SljNStIxUdroPD29 2ijoaMFTAU+uBtE14UP5F+LWdmRdEGS1Ah1NwooL27uAFllTDQxDhg/+LJ/TqB8ZuidOIy1B xVKRSg3I2m+DUTVqBy7Lixo73hnW69kSjtqCeamY/NSu6LNP+b0wAOKhwz9hBEwEHLp05+mj 5ZFJyfGsOiNUcMoO/17FO4EBxSDP3FDLllpuzlFD7SXkfJaMWYmXIlO0jLzdfwfcnDzBbPwO hBM8hvtsyq8lq8vJOxv6XD6xcTtj5Az8t2JjdUX6SF9hxJpwhBU0wrCoGDkWp4Bbv6jnF7zP Nzftr4l8RuJoywDIiJpdaNpSlXKpj/K6KrnyAI/joYc7 Message-ID: Date: Tue, 1 Oct 2019 23:54:14 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/1/19 10:31 PM, David Rientjes wrote: > On Tue, 1 Oct 2019, Vlastimil Babka wrote: > >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index 4ae967bcf954..2c48146f3ee2 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, >> nmask = policy_nodemask(gfp, pol); >> if (!nmask || node_isset(hpage_node, *nmask)) { >> mpol_cond_put(pol); >> + /* >> + * First, try to allocate THP only on local node, but >> + * don't reclaim unnecessarily, just compact. >> + */ >> page = __alloc_pages_node(hpage_node, >> - gfp | __GFP_THISNODE, order); >> + gfp | __GFP_THISNODE | __GFP_NORETRY, order); > > The page allocator has heuristics to determine when compaction should be > retried, reclaim should be retried, and the allocation itself should retry > for high-order allocations (see PAGE_ALLOC_COSTLY_ORDER). Yes, and flags like __GFP_NORETRY and __GFP_RETRY_MAYFAIL affect these heuristics according to the caller's willingness to wait vs availability of some kind of fallback. > PAGE_ALLOC_COSTLY_ORDER exists solely to avoid poor allocator behavior > when reclaim itself is unlikely -- or disruptive enough -- in making that > amount of contiguous memory available. The __GFP_NORETRY check for costly orders is one of the implementation details of that poor allocator behavior avoidance. > Rather than papering over the poor feedback loop between compaction and > reclaim that exists in the page allocator, it's probably better to improve > that and determine when an allocation should fail or it's worthwhile to > retry. That's a separate topic from NUMA locality of thp. I don't disagree. But we are doing that 'papering over' already and I simply believe it can be done better, until we can drop it. Right now at least the hugetlb allocations are regressing, as Michal showed. > In other words, we should likely address how compaction and reclaim is > done for all high order-allocations in the page allocator itself rather > than only here for hugepage allocations and relying on specific gfp bits > to be set. Well, with your patches, decisions are made based on pageblock order (affecting also hugetlbfs) and a specific __GFP_IO bit. I don't see how using a __GFP_NORETRY and costly order is worse - both are concepts already used to guide reclaim/compaction decisions. And it won't affect hugetlbfs. Also with some THP defrag modes, __GFP_NORETRY is also already being used. > Ask: if the allocation here should not retry regardless of why > compaction failed, why should any high-order allocation (user or kernel) > retry if compaction failed and at what order we should just fail? If that's refering to the quoted code above, the allocation here should not retry because it would lead to reclaiming just the local node, effectively a node reclaim mode, leading to the problems Andrea and Mel reported. That's because __GFP_THISNODE makes it rather specific, so we shouldn't conclude anything for the other kinds of allocations, as you're asking. > If > hugetlb wants to stress this to the fullest extent possible, it already > appropriately uses __GFP_RETRY_MAYFAIL. Which doesn't work anymore right now, and should again after this patch. > The code here is saying we care more about NUMA locality than hugepages That's why there's __GFP_THISNODE, yes. > simply because that's where the access latency is better and is specific > to hugepages; allocation behavior for high-order pages needs to live in > the page allocator. I'd rather add the __GFP_NORETRY to __GFP_THISNODE here to influence the allocation behavior, than reintroduce any __GFP_THISNODE checks in the page allocator. There are not that many __GFP_THISNODE users, but there are plenty of __GFP_NORETRY users, so it seems better to adjust behavior based on the latter flag. IIRC in the recent past you argued for putting back __GFP_NORETRY to all THP allocations including madvise, so why is there a problem now? >> >> /* >> - * If hugepage allocations are configured to always >> - * synchronous compact or the vma has been madvised >> - * to prefer hugepage backing, retry allowing remote >> - * memory as well. >> + * If that fails, allow both compaction and reclaim, >> + * but on all nodes. >> */ >> - if (!page && (gfp & __GFP_DIRECT_RECLAIM)) >> + if (!page) >> page = __alloc_pages_node(hpage_node, >> - gfp | __GFP_NORETRY, order); Also this __GFP_NORETRY was added by your patch, so I really don't see why do you think it's wrong in the first allocation attempt. >> + gfp, order); >> >> goto out; >> } > > We certainly don't want this for non-MADV_HUGEPAGE regions when thp > enabled bit is not set to "always". We'd much rather fallback to local > native pages because of its improved access latency. This is allowing all > hugepage allocations to be remote even without MADV_HUGEPAGE which is not > even what Andrea needs: qemu uses MADV_HUGEPAGE. OTOH it's better to use a remote THP than a remote base page, in case the local node is below watermark. But that would probably require a larger surgery of the allocator. But what you're saying is keep the gfp & __GFP_DIRECT_RECLAIM condition. I still suggest to drop the __GFP_NORETRY as that goes against MADV_HUGEPAGE.