Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp4933127pxb; Mon, 15 Feb 2021 05:22:02 -0800 (PST) X-Google-Smtp-Source: ABdhPJyS2z2sIm4wxWo+UVc4++OAblfH+9K55BRfb4Gd3fJRjKgqsufl+8O9qF8gTXF4/+eC5nHD X-Received: by 2002:a17:906:1249:: with SMTP id u9mr15855281eja.484.1613395322696; Mon, 15 Feb 2021 05:22:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1613395322; cv=none; d=google.com; s=arc-20160816; b=VCVzgM1ZrVHrG9Y9DOzzoZizNeOJWT1961z2wlQ6ObsaZkhodxbL1H9ksEesVh2r8K E9pd/FFv9smG/yUgSbwUi5Xs7TQ+p6I7PMnwYQuFtvfK1719YQQOHtUdk9LCzwb9Q0L/ p7K9eO+uvISDRSnZvit25bEzrrfo2+mkuTtgp54+UUPb2J+jF6/MIEu1v42lg6wQDcvP p/seL1JhcXYRJUK4YZnahH2EEemNYKmT29ri/j+Nx09NYvTGD4nypjnizUnnkc87ZPyI roIiPYBaNqp5ZTY6NsEzyfuq9dZYNA0PRwsTtSmFIhznAkz4bBh55II4LrHIfaXUWlWW htZw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=VrYszXWufgZ7DXzjszArcCUC2Xn5qCIgATXFqiBe/qM=; b=QsgZZTHZq4bTwxEgZnzHJGEZt1Xaq7tB0UFZ/sEO5goVIU7/4JSnfDgorEaRHe9fXj oytehZsgtyeqQKkuSacQdplgZtMzNhNWoyY0UyltEz2SZicGhzsdZTYGNqI7TvBrU7Eo wvjl1D3rmeFm3RC+upclVm44dYoEtn0Je2f0RTZsGwdPj2zp2PmXs30pmDjtCdX78jBn XuCxcTPPmUF7ccgowP1snZQyWrUlQ+IYMZqxfm1v4qa1ailC5+OirnWKsUd+CUImuhaN 12pM1JpFgIU0seHpArxnKyyfYO2c5NWSkSMee95OcN784zC7+xpKLwP8XVN2rqAGpM1A 65Uw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=rnIuwIPk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id os3si12187303ejb.661.2021.02.15.05.21.39; Mon, 15 Feb 2021 05:22:02 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=rnIuwIPk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230314AbhBONUW (ORCPT + 99 others); Mon, 15 Feb 2021 08:20:22 -0500 Received: from mx2.suse.de ([195.135.220.15]:59264 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230243AbhBONUT (ORCPT ); Mon, 15 Feb 2021 08:20:19 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1613395171; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=VrYszXWufgZ7DXzjszArcCUC2Xn5qCIgATXFqiBe/qM=; b=rnIuwIPk7q2Bv3pei4lS7wnFxeqEn+3SK8mEAj+fsiO1AP9Z5a6ALd35VcWzcRQvWRHFO3 kR1XO6uKNw3JHPOayE4rlauc3Y/EViD4aEodLjdLjhK4MBg6ZLfaaiglYNUiruETpQ7JB+ hcqOd2U9Ti1TAmDtHq+QgD1dxwPgQ3Q= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id F3691AC32; Mon, 15 Feb 2021 13:19:30 +0000 (UTC) Date: Mon, 15 Feb 2021 14:19:28 +0100 From: Michal Hocko To: Muchun Song Cc: Jonathan Corbet , Mike Kravetz , Thomas Gleixner , mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, Peter Zijlstra , viro@zeniv.linux.org.uk, Andrew Morton , paulmck@kernel.org, mchehab+huawei@kernel.org, pawan.kumar.gupta@linux.intel.com, Randy Dunlap , oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, Mina Almasry , David Rientjes , Matthew Wilcox , Oscar Salvador , "Song Bao Hua (Barry Song)" , David Hildenbrand , HORIGUCHI =?utf-8?B?TkFPWUEo5aCA5Y+jIOebtOS5nyk=?= , Joao Martins , Xiongchun duan , linux-doc@vger.kernel.org, LKML , Linux Memory Management List , linux-fsdevel Subject: Re: [External] Re: [PATCH v15 4/8] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page Message-ID: References: <20210208085013.89436-1-songmuchun@bytedance.com> <20210208085013.89436-5-songmuchun@bytedance.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 15-02-21 20:44:57, Muchun Song wrote: > On Mon, Feb 15, 2021 at 8:18 PM Michal Hocko wrote: > > > > On Mon 15-02-21 20:00:07, Muchun Song wrote: > > > On Mon, Feb 15, 2021 at 7:51 PM Muchun Song wrote: > > > > > > > > On Mon, Feb 15, 2021 at 6:33 PM Michal Hocko wrote: > > > > > > > > > > On Mon 15-02-21 18:05:06, Muchun Song wrote: > > > > > > On Fri, Feb 12, 2021 at 11:32 PM Michal Hocko wrote: > > > > > [...] > > > > > > > > +int alloc_huge_page_vmemmap(struct hstate *h, struct page *head) > > > > > > > > +{ > > > > > > > > + int ret; > > > > > > > > + unsigned long vmemmap_addr = (unsigned long)head; > > > > > > > > + unsigned long vmemmap_end, vmemmap_reuse; > > > > > > > > + > > > > > > > > + if (!free_vmemmap_pages_per_hpage(h)) > > > > > > > > + return 0; > > > > > > > > + > > > > > > > > + vmemmap_addr += RESERVE_VMEMMAP_SIZE; > > > > > > > > + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h); > > > > > > > > + vmemmap_reuse = vmemmap_addr - PAGE_SIZE; > > > > > > > > + > > > > > > > > + /* > > > > > > > > + * The pages which the vmemmap virtual address range [@vmemmap_addr, > > > > > > > > + * @vmemmap_end) are mapped to are freed to the buddy allocator, and > > > > > > > > + * the range is mapped to the page which @vmemmap_reuse is mapped to. > > > > > > > > + * When a HugeTLB page is freed to the buddy allocator, previously > > > > > > > > + * discarded vmemmap pages must be allocated and remapping. > > > > > > > > + */ > > > > > > > > + ret = vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse, > > > > > > > > + GFP_ATOMIC | __GFP_NOWARN | __GFP_THISNODE); > > > > > > > > > > > > > > I do not think that this is a good allocation mode. GFP_ATOMIC is a non > > > > > > > sleeping allocation and a medium memory pressure might cause it to > > > > > > > fail prematurely. I do not think this is really an atomic context which > > > > > > > couldn't afford memory reclaim. I also do not think we want to grant > > > > > > > > > > > > Because alloc_huge_page_vmemmap is called under hugetlb_lock > > > > > > now. So using GFP_ATOMIC indeed makes the code more simpler. > > > > > > > > > > You can have a preallocated list of pages prior taking the lock. > > > > > > > > A discussion about this can refer to here: > > > > > > > > https://patchwork.kernel.org/project/linux-mm/patch/20210117151053.24600-5-songmuchun@bytedance.com/ > > > > > > > > > Moreover do we want to manipulate vmemmaps from under spinlock in > > > > > general. I have to say I have missed that detail when reviewing. Need to > > > > > think more. > > > > > > > > > > > From the document of the kernel, I learned that __GFP_NOMEMALLOC > > > > > > can be used to explicitly forbid access to emergency reserves. So if > > > > > > we do not want to use the reserve memory. How about replacing it to > > > > > > > > > > > > GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_THISNODE > > > > > > > > > > The whole point of GFP_ATOMIC is to grant access to memory reserves so > > > > > the above is quite dubious. If you do not want access to memory reserves > > > > > > > > Look at the code of gfp_to_alloc_flags(). > > > > > > > > static inline unsigned int gfp_to_alloc_flags(gfp_t gfp_mask) > > > > { > > > > [...] > > > > if (gfp_mask & __GFP_ATOMIC) { > > > > /* > > > > * Not worth trying to allocate harder for __GFP_NOMEMALLOC even > > > > * if it can't schedule. > > > > */ > > > > if (!(gfp_mask & __GFP_NOMEMALLOC)) > > > > alloc_flags |= ALLOC_HARDER; > > > > [...] > > > > } > > > > > > > > Seems to allow this operation (GFP_ATOMIC | __GFP_NOMEMALLOC). > > > > Please read my response again more carefully. I am not claiming that > > combination is not allowed. I have said it doesn't make any sense in > > this context. > > I see you are worried that using GFP_ATOMIC will use reverse memory > unlimited. So I think that __GFP_NOMEMALLOC may be suitable for us. > Sorry, I may not understand the point you said. What I missed? OK, let me try to explain again. GFP_ATOMIC is not only a non-sleeping allocation request. It also grants access to memory reserves. The later is a bit more involved because there are more layers of memory reserves to access but that is not really important. Non-sleeping semantic can be achieved by GFP_NOWAIT which will not grant access to reserves unless explicitly stated - e.g. by __GFP_HIGH or __GFP_ATOMIC. Is that more clear? Now again why I do not think access to memory reserves is suitable. Hugetlb pages can be released in a large batches and that might cause a peak depletion of memory reserves which are normally used by other consumers as well. Other GFP_ATOMIC users might see allocation failures. Those shouldn't be really fatal as nobody should be relying on those and a failure usually mean a hand over to a different, less constrained, context. So this concern is more about a more well behaved behavior from the hugetlb side than a correctness. Is that more clear? There shouldn't be any real reason why the memory allocation for vmemmaps, or handling vmemmap in general, has to be done from within the hugetlb lock and therefore requiring a non-sleeping semantic. All that can be deferred to a more relaxed context. If you want to make a GFP_NOWAIT optimistic attempt in the direct free path then no problem but you have to expect failures under memory pressure. If you want to have a more robust allocation request then you have to go outside of the spin lock and use GFP_KERNEL | __GFP_NORETRY or GFP_KERNEL | __GFP_RETRY_MAYFAIL depending on how hard you want to try. __GFP_THISNODE makes a slight difference here but something that I would recommend not depending on. Is that more clear? -- Michal Hocko SUSE Labs