Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp1765166ybn; Wed, 2 Oct 2019 23:18:00 -0700 (PDT) X-Google-Smtp-Source: APXvYqyJAD+HWyD49Y8FrjtDV1AVHfzBxyZGi4Tqdi6BEqw8N+dKDDeWBK8DbvxMfUdrEXmOI2S9 X-Received: by 2002:a17:906:d154:: with SMTP id br20mr6446882ejb.79.1570083480766; Wed, 02 Oct 2019 23:18:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1570083480; cv=none; d=google.com; s=arc-20160816; b=miA/bjpHQLg4/5WCQy5ef8Lmmcu8eVX/U8wEbhi5hQTLZj4b07Bcfud/rB0Y9zQj4m uDHXkQfeBA909mlpmttRxsccWP7ZtKOzz3mRTzZAgc8MtoruQ+BzuHnayYtDYIFxOZJ7 o8TQS8xTi8mZuLYM9fiMO0a96SwfvfI1G5/2yYgGRA2m/nSFyphqKvF5RKO4elAcHzJZ sefR2StEZks+my+M7/o5ZUcJO8qLFezeBq2HNVYLHuezVGiuGNBc3172XhnG7LiTA4bK WUA1n6xbQbT5YVxwNuRHVDd7lg870LAaQaXz04ssHPcVvahH/GdOQ8lDQnHEijSqWpgQ SgsA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=p4vYA0Bb8R8GDcILNlWI1DOeG/Ra9rX9zbFkAK/44bU=; b=gXqfL+2+N+tUZb93cdSXqMcgDVD6NE41dCxWET9aBg9Tqc7vM7XaOA7cmHAc4GPeOH Do2ONKbZhBKeaiQUgSFCwxi8jIWxWaMKobL/pACExpaKpeF4hYiAKVGQobscJx3AXtlQ 50sVipGy8+savRkLKXsXIzrDusQlA8rLUWLuwnlsmUkrBUmGz3uo/pIEB0BJ0qhIs8oP aTbzeaDNZyC0xB53SnlRMKl/Rm48Q4BVebO9yZxHQQiF85bJvnZ7s+ATWQDBo9DNrfXW 8YxjJUzIkPcZU0sVZktu6uTL27UNf38VbefPLjkEhReL/SPet4HzuitbOcpUCJ6WgG3H y5Cg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k6si903108edb.235.2019.10.02.23.17.36; Wed, 02 Oct 2019 23:18:00 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727278AbfJCF1E (ORCPT + 99 others); Thu, 3 Oct 2019 01:27:04 -0400 Received: from mx2.suse.de ([195.135.220.15]:37426 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725290AbfJCF1E (ORCPT ); Thu, 3 Oct 2019 01:27:04 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id E6C2BB009; Thu, 3 Oct 2019 05:27:01 +0000 (UTC) Date: Thu, 3 Oct 2019 07:27:00 +0200 From: Michal Hocko To: David Rientjes Cc: Mike Kravetz , Vlastimil Babka , Linus Torvalds , Andrea Arcangeli , Andrew Morton , Mel Gorman , "Kirill A. Shutemov" , Linux Kernel Mailing List , Linux-MM Subject: Re: [rfc] mm, hugetlb: allow hugepage allocations to excessively reclaim Message-ID: <20191003052700.GB24174@dhcp22.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 02-10-19 16:03:03, David Rientjes wrote: > Hugetlb allocations use __GFP_RETRY_MAYFAIL to aggressively attempt to get > hugepages that the user needs. Commit b39d0ee2632d ("mm, page_alloc: > avoid expensive reclaim when compaction may not succeed") intends to > improve allocator behind for thp allocations to prevent excessive amounts > of reclaim especially when constrained to a single node. > > Since hugetlb allocations have explicitly preferred to loop and do reclaim > and compaction, exempt them from this new behavior at least for the time > being. It is not shown that hugetlb allocation success rate has been > impacted by commit b39d0ee2632d but hugetlb allocations are admittedly > beyond the scope of what the patch is intended to address (thp > allocations). It has become pretty clear that b39d0ee2632d has regressed hugetlb allocation success rate for any non-trivial case (complately free memory) http://lkml.kernel.org/r/20191001054343.GA15624@dhcp22.suse.cz. And this really is not just about hugetlb requests, really. They are likely the most obvious example but __GFP_RETRY_MAYFAIL in general is supposed to try as hard as feasible to success the allocation. The decision to bail out is done at a different spot and b39d0ee2632d is effectively bypassing that logic. Now to the patch itself. I didn't get to test it on my testing workload but hey steps are clearly documented and easily to set up and reproduce. I am at a training for today and unlikely to get to test by the end of the week infortunatelly. Anyway the patch should be fixing the problem because it explicitly opts out for __GFP_RETRY_MAYFAIL. I am pretty sure we will need more follow ups because the bail out logic is simply behaving quite randomly as my measurements show (I would really appreciate a feedback there). We need a more systematic solution because the current logic has been rushed through without a proper analysis and without any actual workloads to verify the effect. > Cc: Mike Kravetz Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") > Signed-off-by: David Rientjes I am willing to give my ack by considering that this is a clear regression and this is probably the simplest fix but the changelog should be explicit about the effect (feel free to borrow my numbers and explanation in this thread). > --- > Mike, you eluded that you may want to opt hugetlbfs out of this for the > time being in https://marc.info/?l=linux-kernel&m=156771690024533 -- > not sure if you want to allow this excessive amount of reclaim for > hugetlb allocations or not given the swap storms Andrea has shown is > possible (and nr_hugepages_mempolicy does exist), but hugetlbfs was not > part of the problem we are trying to address here so no objection to > opting it out. > > You might want to consider how expensive hugetlb allocations can become > and disruptive to the system if it does not yield additional hugepages, > but that can be done at any time later as a general improvement rather > than part of a series aimed at thp. > > mm/page_alloc.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4467,12 +4467,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > if (page) > goto got_pg; > > - if (order >= pageblock_order && (gfp_mask & __GFP_IO)) { > + if (order >= pageblock_order && (gfp_mask & __GFP_IO) && > + !(gfp_mask & __GFP_RETRY_MAYFAIL)) { > /* > * If allocating entire pageblock(s) and compaction > * failed because all zones are below low watermarks > * or is prohibited because it recently failed at this > - * order, fail immediately. > + * order, fail immediately unless the allocator has > + * requested compaction and reclaim retry. > * > * Reclaim is > * - potentially very expensive because zones are far -- Michal Hocko SUSE Labs