Received: by 2002:a25:c593:0:0:0:0:0 with SMTP id v141csp789501ybe; Thu, 5 Sep 2019 06:05:28 -0700 (PDT) X-Google-Smtp-Source: APXvYqwdltN76JdMXlGjRnzd3fzoL0BoqnuWpHb04nf08RLJkdd/zj+2HCDK4hrwWtN/CGOBiLQ0 X-Received: by 2002:aa7:92d1:: with SMTP id k17mr3548759pfa.160.1567688728601; Thu, 05 Sep 2019 06:05:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1567688728; cv=none; d=google.com; s=arc-20160816; b=TVldO4UMM41FmnkDn7H9tSsqYoVgoUvoxs301bBvbRz9UiQtU+YxR2UdehRR0f4DDl 7wSJZY+BDGZPAlwDV7xp7i095XYxZM4VFOqFNul3qlsVC9W4ocPxQ4l/HuL0PiavQNJg 2s0n/JSZgw7ZhnuDQriSZTjcgS/8WfVDuTtFF8+OxXZxZndDtq+iCC2VUvBKXOMouuj5 OKwsPCuIQOU1PH7JQIlwlLyiuB1XrQtm9qx3/HLTJmnA8OhrRA8+pMLCENFoYRpb2BU3 86RFKAsPUetAewoYE/nZshqY9IB04wMeEfLiT7vrHX2iQcRuuAM6fU3IqeUHbtm4wl4M bToA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:autocrypt:openpgp:from:references:cc:to:subject; bh=JD7dNxHO9aEac+3PztGStibWHeqDMbxnc8xkyu+y/GU=; b=cIpq1HVfH+PcVjYDG52wTmd3Rb9gHcbL6Gaqvw/h+LpfGPLUWOwNEsIxQvu2bIhHMR T52fgME7yyl/9jhnrP89wReUElMeXL3ahknFi56QFZbX4himjm6iDMJko82BdeDa7gQC M4KawuLx+LLqTTwY7AfH3Kv6sFTNB5oCIBlHz1RkrD1Vp2+5bDUCxFJtxVe/jHhwLRqs q8w6umxlR30usDIQfDvHzhSb9KQtf8EEzm2eRhQsXLoQcHOnHuZOtEjeXvv6JWoS+72f ZjBTFtzkl/lHKqaKVSVu6gsPTLifbXtQixuTC9ig2V2+Qkllbw6ZsnHbbFTXHXjeJKgN LsZA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k190si1664794pge.308.2019.09.05.06.05.11; Thu, 05 Sep 2019 06:05:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388329AbfIELWm (ORCPT + 99 others); Thu, 5 Sep 2019 07:22:42 -0400 Received: from mx2.suse.de ([195.135.220.15]:40074 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1732810AbfIELWm (ORCPT ); Thu, 5 Sep 2019 07:22:42 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 1C9BEB685; Thu, 5 Sep 2019 11:22:40 +0000 (UTC) Subject: Re: [rfc 3/4] mm, page_alloc: avoid expensive reclaim when compaction may not succeed To: Michal Hocko , David Rientjes Cc: Linus Torvalds , Andrew Morton , Andrea Arcangeli , Mel Gorman , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Mike Kravetz References: <20190905090009.GF3838@dhcp22.suse.cz> From: Vlastimil Babka Openpgp: preference=signencrypt Autocrypt: addr=vbabka@suse.cz; prefer-encrypt=mutual; keydata= mQINBFZdmxYBEADsw/SiUSjB0dM+vSh95UkgcHjzEVBlby/Fg+g42O7LAEkCYXi/vvq31JTB KxRWDHX0R2tgpFDXHnzZcQywawu8eSq0LxzxFNYMvtB7sV1pxYwej2qx9B75qW2plBs+7+YB 87tMFA+u+L4Z5xAzIimfLD5EKC56kJ1CsXlM8S/LHcmdD9Ctkn3trYDNnat0eoAcfPIP2OZ+ 9oe9IF/R28zmh0ifLXyJQQz5ofdj4bPf8ecEW0rhcqHfTD8k4yK0xxt3xW+6Exqp9n9bydiy tcSAw/TahjW6yrA+6JhSBv1v2tIm+itQc073zjSX8OFL51qQVzRFr7H2UQG33lw2QrvHRXqD Ot7ViKam7v0Ho9wEWiQOOZlHItOOXFphWb2yq3nzrKe45oWoSgkxKb97MVsQ+q2SYjJRBBH4 8qKhphADYxkIP6yut/eaj9ImvRUZZRi0DTc8xfnvHGTjKbJzC2xpFcY0DQbZzuwsIZ8OPJCc LM4S7mT25NE5kUTG/TKQCk922vRdGVMoLA7dIQrgXnRXtyT61sg8PG4wcfOnuWf8577aXP1x 6mzw3/jh3F+oSBHb/GcLC7mvWreJifUL2gEdssGfXhGWBo6zLS3qhgtwjay0Jl+kza1lo+Cv BB2T79D4WGdDuVa4eOrQ02TxqGN7G0Biz5ZLRSFzQSQwLn8fbwARAQABtCBWbGFzdGltaWwg QmFia2EgPHZiYWJrYUBzdXNlLmN6PokCVAQTAQoAPgIbAwULCQgHAwUVCgkICwUWAgMBAAIe AQIXgBYhBKlA1DSZLC6OmRA9UCJPp+fMgqZkBQJcbbyGBQkH8VTqAAoJECJPp+fMgqZkpGoP /1jhVihakxw1d67kFhPgjWrbzaeAYOJu7Oi79D8BL8Vr5dmNPygbpGpJaCHACWp+10KXj9yz fWABs01KMHnZsAIUytVsQv35DMMDzgwVmnoEIRBhisMYOQlH2bBn/dqBjtnhs7zTL4xtqEcF 1hoUFEByMOey7gm79utTk09hQE/Zo2x0Ikk98sSIKBETDCl4mkRVRlxPFl4O/w8dSaE4eczH LrKezaFiZOv6S1MUKVKzHInonrCqCNbXAHIeZa3JcXCYj1wWAjOt9R3NqcWsBGjFbkgoKMGD usiGabetmQjXNlVzyOYdAdrbpVRNVnaL91sB2j8LRD74snKsV0Wzwt90YHxDQ5z3M75YoIdl byTKu3BUuqZxkQ/emEuxZ7aRJ1Zw7cKo/IVqjWaQ1SSBDbZ8FAUPpHJxLdGxPRN8Pfw8blKY 8mvLJKoF6i9T6+EmlyzxqzOFhcc4X5ig5uQoOjTIq6zhLO+nqVZvUDd2Kz9LMOCYb516cwS/ Enpi0TcZ5ZobtLqEaL4rupjcJG418HFQ1qxC95u5FfNki+YTmu6ZLXy+1/9BDsPuZBOKYpUm 3HWSnCS8J5Ny4SSwfYPH/JrtberWTcCP/8BHmoSpS/3oL3RxrZRRVnPHFzQC6L1oKvIuyXYF rkybPXYbmNHN+jTD3X8nRqo+4Qhmu6SHi3VquQENBFsZNQwBCACuowprHNSHhPBKxaBX7qOv KAGCmAVhK0eleElKy0sCkFghTenu1sA9AV4okL84qZ9gzaEoVkgbIbDgRbKY2MGvgKxXm+kY n8tmCejKoeyVcn9Xs0K5aUZiDz4Ll9VPTiXdf8YcjDgeP6/l4kHb4uSW4Aa9ds0xgt0gP1Xb AMwBlK19YvTDZV5u3YVoGkZhspfQqLLtBKSt3FuxTCU7hxCInQd3FHGJT/IIrvm07oDO2Y8J DXWHGJ9cK49bBGmK9B4ajsbe5GxtSKFccu8BciNluF+BqbrIiM0upJq5Xqj4y+Xjrpwqm4/M ScBsV0Po7qdeqv0pEFIXKj7IgO/d4W2bABEBAAGJA3IEGAEKACYWIQSpQNQ0mSwujpkQPVAi T6fnzIKmZAUCWxk1DAIbAgUJA8JnAAFACRAiT6fnzIKmZMB0IAQZAQoAHRYhBKZ2GgCcqNxn k0Sx9r6Fd25170XjBQJbGTUMAAoJEL6Fd25170XjDBUH/2jQ7a8g+FC2qBYxU/aCAVAVY0NE YuABL4LJ5+iWwmqUh0V9+lU88Cv4/G8fWwU+hBykSXhZXNQ5QJxyR7KWGy7LiPi7Cvovu+1c 9Z9HIDNd4u7bxGKMpn19U12ATUBHAlvphzluVvXsJ23ES/F1c59d7IrgOnxqIcXxr9dcaJ2K k9VP3TfrjP3g98OKtSsyH0xMu0MCeyewf1piXyukFRRMKIErfThhmNnLiDbaVy6biCLx408L Mo4cCvEvqGKgRwyckVyo3JuhqreFeIKBOE1iHvf3x4LU8cIHdjhDP9Wf6ws1XNqIvve7oV+w B56YWoalm1rq00yUbs2RoGcXmtX1JQ//aR/paSuLGLIb3ecPB88rvEXPsizrhYUzbe1TTkKc 4a4XwW4wdc6pRPVFMdd5idQOKdeBk7NdCZXNzoieFntyPpAq+DveK01xcBoXQ2UktIFIsXey uSNdLd5m5lf7/3f0BtaY//f9grm363NUb9KBsTSnv6Vx7Co0DWaxgC3MFSUhxzBzkJNty+2d 10jvtwOWzUN+74uXGRYSq5WefQWqqQNnx+IDb4h81NmpIY/X0PqZrapNockj3WHvpbeVFAJ0 9MRzYP3x8e5OuEuJfkNnAbwRGkDy98nXW6fKeemREjr8DWfXLKFWroJzkbAVmeIL0pjXATxr +tj5JC0uvMrrXefUhXTo0SNoTsuO/OsAKOcVsV/RHHTwCDR2e3W8mOlA3QbYXsscgjghbuLh J3oTRrOQa8tUXWqcd5A0+QPo5aaMHIK0UAthZsry5EmCY3BrbXUJlt+23E93hXQvfcsmfi0N rNh81eknLLWRYvMOsrbIqEHdZBT4FHHiGjnck6EYx/8F5BAZSodRVEAgXyC8IQJ+UVa02QM5 D2VL8zRXZ6+wARKjgSrW+duohn535rG/ypd0ctLoXS6dDrFokwTQ2xrJiLbHp9G+noNTHSan ExaRzyLbvmblh3AAznb68cWmM3WVkceWACUalsoTLKF1sGrrIBj5updkKkzbKOq5gcC5AQ0E Wxk1NQEIAJ9B+lKxYlnKL5IehF1XJfknqsjuiRzj5vnvVrtFcPlSFL12VVFVUC2tT0A1Iuo9 NAoZXEeuoPf1dLDyHErrWnDyn3SmDgb83eK5YS/K363RLEMOQKWcawPJGGVTIRZgUSgGusKL NuZqE5TCqQls0x/OPljufs4gk7E1GQEgE6M90Xbp0w/r0HB49BqjUzwByut7H2wAdiNAbJWZ F5GNUS2/2IbgOhOychHdqYpWTqyLgRpf+atqkmpIJwFRVhQUfwztuybgJLGJ6vmh/LyNMRr8 J++SqkpOFMwJA81kpjuGR7moSrUIGTbDGFfjxmskQV/W/c25Xc6KaCwXah3OJ40AEQEAAYkC PAQYAQoAJhYhBKlA1DSZLC6OmRA9UCJPp+fMgqZkBQJbGTU1AhsMBQkDwmcAAAoJECJPp+fM gqZkPN4P/Ra4NbETHRj5/fM1fjtngt4dKeX/6McUPDIRuc58B6FuCQxtk7sX3ELs+1+w3eSV rHI5cOFRSdgw/iKwwBix8D4Qq0cnympZ622KJL2wpTPRLlNaFLoe5PkoORAjVxLGplvQIlhg miljQ3R63ty3+MZfkSVsYITlVkYlHaSwP2t8g7yTVa+q8ZAx0NT9uGWc/1Sg8j/uoPGrctml hFNGBTYyPq6mGW9jqaQ8en3ZmmJyw3CHwxZ5FZQ5qc55xgshKiy8jEtxh+dgB9d8zE/S/UGI E99N/q+kEKSgSMQMJ/CYPHQJVTi4YHh1yq/qTkHRX+ortrF5VEeDJDv+SljNStIxUdroPD29 2ijoaMFTAU+uBtE14UP5F+LWdmRdEGS1Ah1NwooL27uAFllTDQxDhg/+LJ/TqB8ZuidOIy1B xVKRSg3I2m+DUTVqBy7Lixo73hnW69kSjtqCeamY/NSu6LNP+b0wAOKhwz9hBEwEHLp05+mj 5ZFJyfGsOiNUcMoO/17FO4EBxSDP3FDLllpuzlFD7SXkfJaMWYmXIlO0jLzdfwfcnDzBbPwO hBM8hvtsyq8lq8vJOxv6XD6xcTtj5Az8t2JjdUX6SF9hxJpwhBU0wrCoGDkWp4Bbv6jnF7zP Nzftr4l8RuJoywDIiJpdaNpSlXKpj/K6KrnyAI/joYc7 Message-ID: Date: Thu, 5 Sep 2019 13:22:39 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <20190905090009.GF3838@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 9/5/19 11:00 AM, Michal Hocko wrote: > [Ccing Mike for checking on the hugetlb side of this change] > > On Wed 04-09-19 12:54:22, David Rientjes wrote: >> Memory compaction has a couple significant drawbacks as the allocation >> order increases, specifically: >> >> - isolate_freepages() is responsible for finding free pages to use as >> migration targets and is implemented as a linear scan of memory >> starting at the end of a zone, Note that's no longer entirely true, see fast_isolate_freepages(). >> - failing order-0 watermark checks in memory compaction does not account >> for how far below the watermarks the zone actually is: to enable >> migration, there must be *some* free memory available. Per the above, >> watermarks are not always suffficient if isolate_freepages() cannot >> find the free memory but it could require hundreds of MBs of reclaim to >> even reach this threshold (read: potentially very expensive reclaim with >> no indication compaction can be successful), and I doubt it's hundreds of MBs for a 2MB hugepage. >> - if compaction at this order has failed recently so that it does not even >> run as a result of deferred compaction, looping through reclaim can often >> be pointless. Agreed. >> For hugepage allocations, these are quite substantial drawbacks because >> these are very high order allocations (order-9 on x86) and falling back to >> doing reclaim can potentially be *very* expensive without any indication >> that compaction would even be successful. You seem to lump together hugetlbfs and THP here, by saying "hugepage", but these are very different things - hugetlbfs reservations are expected to be potentially expensive. >> Reclaim itself is unlikely to free entire pageblocks and certainly no >> reliance should be put on it to do so in isolation (recall lumpy reclaim). >> This means we should avoid reclaim and simply fail hugepage allocation if >> compaction is deferred. It is however possible that reclaim frees enough to make even a previously deferred compaction succeed. >> It is also not helpful to thrash a zone by doing excessive reclaim if >> compaction may not be able to access that memory. If order-0 watermarks >> fail and the allocation order is sufficiently large, it is likely better >> to fail the allocation rather than thrashing the zone. >> >> Signed-off-by: David Rientjes >> --- >> mm/page_alloc.c | 22 ++++++++++++++++++++++ >> 1 file changed, 22 insertions(+) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -4458,6 +4458,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, >> if (page) >> goto got_pg; >> >> + if (order >= pageblock_order && (gfp_mask & __GFP_IO)) { >> + /* >> + * If allocating entire pageblock(s) and compaction >> + * failed because all zones are below low watermarks >> + * or is prohibited because it recently failed at this >> + * order, fail immediately. >> + * >> + * Reclaim is >> + * - potentially very expensive because zones are far >> + * below their low watermarks or this is part of very >> + * bursty high order allocations, >> + * - not guaranteed to help because isolate_freepages() >> + * may not iterate over freed pages as part of its >> + * linear scan, and >> + * - unlikely to make entire pageblocks free on its >> + * own. >> + */ >> + if (compact_result == COMPACT_SKIPPED || >> + compact_result == COMPACT_DEFERRED) >> + goto nopage; As I said, I expect this will make hugetlbfs reservations fail prematurely - Mike can probably confirm or disprove that. I think it also addresses consequences, not the primary problem, IMHO. I believe the primary problem is that we reclaim something even if there's enough memory for compaction. This won't change with your patch, as compact_result won't be SKIPPED in that case. Then we continue through to __alloc_pages_direct_reclaim(), shrink_zones() which will call compaction_ready(), which will only return true and skip reclaim of the zone, if there's high_watermark (!!!) + compact_gap() pages. But as long as one zone isn't compaction_ready(), we enter shrink_node(), which will reclaim something and call should_continue_reclaim() where we might finally notice that compaction_suitable() returns CONTINUE, and abort reclaim. Thus I think the right solution might be to really avoid reclaim for zones where compaction is not skipped, while your patch avoids reclaim when compaction is skipped. The per-node reclaim vs per-zone compaction might complicate those decisions a lot, though. >> + } >> + >> /* >> * Checks for costly allocations with __GFP_NORETRY, which >> * includes THP page fault allocations >