Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp1100177pxk; Fri, 2 Oct 2020 00:34:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzdS4JVnn0APWb/TUvKpP/p/6PfAjCHf7BP4auYI2BfFbYwl7xzwx8RNABpXrf+STUbnjjG X-Received: by 2002:a17:906:564d:: with SMTP id v13mr985936ejr.217.1601624076964; Fri, 02 Oct 2020 00:34:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1601624076; cv=none; d=google.com; s=arc-20160816; b=hnNk0RN3Gc682F1GMZgOmjxyWd39WINBc7AtXeq7x/6JUFkREKHQ2LzodUR6J0UqNc S5Mhlziyv5Z2rlAhfiR7lobkFSt8/uJ6Bno6rq961fuNITBN0ptDNLPpcMSH2fOXY1lZ 6+aOr/TOY4ZDbLjUVyWbpPHyAf26TEDywgYapGiZMmSNkGCcgDSqoYOZseNMvhim8tTv fXiV8znDvyE2VEtMdn/DF1S3GctQbNMVx43hpeYz+UpTl7qX+1icZ5iXIznRxwLtCaRM mH58AxDQ17kXGUfrgyhTzEb7Z89uAbcEH2POIZtv2l+T2I80iQAj9TNXYSI2q17NXM7K MXTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=2+4lGymTonwUKn49OsGa2hsWMCGqR/XylF0uFTOA2l4=; b=VzaNEbbgVGuYgMNjusOC4VRuxlvvuohZfbOtaNUg65iPKc0eUD7l6PmeJgeusqPgzH 1d7Ephewy5JGJOXSqX1bBZNc4H+Pj+yo9kigxFaR/P2j8Ee+IDPO8r7rSXv7H+DCrwKw pP2JTDW1jlCy6z5FHlH+CqQGfE4RDxupO40eaACZzz9T+teYWsaMPJK76vRvMNPhruYR tzAY5gxaHZ+ZNraaICv4XWxAjQk5UILUFaTpkHHruMbxi3646OIq8r/Ihm32QWzezy8q 8PQ858FjAYwC/HXe4ZBKm+zTfpapO0aaoMrzAzm+UZF/NCmqeinB9nL9pzHWyfi9d5hp hthw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=sKboYHkM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w10si598914ejv.268.2020.10.02.00.34.02; Fri, 02 Oct 2020 00:34:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=sKboYHkM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725995AbgJBHcI (ORCPT + 99 others); Fri, 2 Oct 2020 03:32:08 -0400 Received: from mx2.suse.de ([195.135.220.15]:34080 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725961AbgJBHcI (ORCPT ); Fri, 2 Oct 2020 03:32:08 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1601623927; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=2+4lGymTonwUKn49OsGa2hsWMCGqR/XylF0uFTOA2l4=; b=sKboYHkMhjDKJ3drUFYX2m+oMS+TtE4WiL9ObDOu4w3wHcve5iEGkFSngHcBVQEudP18qU 9yftr2g9nZTqoVSbye8IRlzz4W/dsRyM6Rz7A+SbAxmgjt9x7+C0VUqZPb848Iax21Evhk qsuz9k1T33AQ1B+FuKhEYUkvgVHntJ8= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id CE162AC4D; Fri, 2 Oct 2020 07:32:06 +0000 (UTC) Date: Fri, 2 Oct 2020 09:32:05 +0200 From: Michal Hocko To: Zi Yan Cc: linux-mm@kvack.org, "Kirill A . Shutemov" , Roman Gushchin , Rik van Riel , Matthew Wilcox , Shakeel Butt , Yang Shi , Jason Gunthorpe , Mike Kravetz , David Hildenbrand , William Kucharski , Andrea Arcangeli , John Hubbard , David Nellans , linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Message-ID: <20201002073205.GC20872@dhcp22.suse.cz> References: <20200928175428.4110504-1-zi.yan@sent.com> <20200930115505.GT2277@dhcp22.suse.cz> <73394A41-16D8-431C-9E48-B14D44F045F8@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <73394A41-16D8-431C-9E48-B14D44F045F8@nvidia.com> User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 01-10-20 11:14:14, Zi Yan wrote: > On 30 Sep 2020, at 7:55, Michal Hocko wrote: > > > On Mon 28-09-20 13:53:58, Zi Yan wrote: > >> From: Zi Yan > >> > >> Hi all, > >> > >> This patchset adds support for 1GB PUD THP on x86_64. It is on top of > >> v5.9-rc5-mmots-2020-09-18-21-23. It is also available at: > >> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23 > >> > >> Other than PUD THP, we had some discussion on generating THPs and contiguous > >> physical memory via a synchronous system call [0]. I am planning to send out a > >> separate patchset on it later, since I feel that it can be done independently of > >> PUD THP support. > > > > While the technical challenges for the kernel implementation can be > > discussed before the user API is decided I believe we cannot simply add > > something now and then decide about a proper interface. I have raised > > few basic questions we should should find answers for before the any > > interface is added. Let me copy them here for easier reference > Sure. Thank you for doing this. > > For this new interface, I think it should generate THPs out of populated > memory regions synchronously. It would be complement to khugepaged, which > generate THPs asynchronously on the background. > > > - THP allocation time - #PF and/or madvise context > I am not sure this is relevant, since the new interface is supposed to > operate on populated memory regions. For THP allocation, madvise and > the options from /sys/kernel/mm/transparent_hugepage/defrag should give > enough choices to users. OK, so no #PF, this makes things easier. > > - lazy/sync instantiation > > I would say the new interface only does sync instantiation. madvise has > provided the lazy instantiation option by adding MADV_HUGEPAGE to populated > memory regions and letting khugepaged generate THPs from them. OK > > - huge page sizes controllable by the userspace? > > It might be good to allow advanced users to choose the page sizes, so they > have better control of their applications. Could you elaborate more? Those advanced users can use hugetlb, right? They get a very good control over page size and pool preallocation etc. So they can get what they need - assuming there is enough memory. > For normal users, we can provide > best-effort service. Different options can be provided for these two cases. Do we really need two sync mechanisms to compact physical memory? This adds an API complexity because it has to cover all possible huge pages and that can be a large set of sizes. We already have that choice for hugetlb mmap interface but that is needed to cover all existing setups. I would argue this doesn't make the API particurarly easy to use. > The new interface might want to inform user how many THPs are generated > after the call for them to decide what to do with the memory region. Why would that be useful? /proc//smaps should give a good picture already, right? > > - aggressiveness - how hard to try > > The new interface would try as hard as it can, since I assume users really > want THPs when they use this interface. > > > - internal fragmentation - allow to create THPs on sparsely or unpopulated > > ranges > > The new interface would only operate on populated memory regions. MAP_POPULATE > like option can be added if necessary. OK, so initialy you do not want to populate more memory. How do you envision a future extension to provide such a functionality. A different API, modification to existing? > > - do we need some sort of access control or privilege check as some THPs > > would be a really scarce (like those that require pre-reservation). > > It seems too much to me. I suppose if we provide page size options to users > when generating THPs, users apps could coordinate themselves. BTW, do we have > access control for hugetlb pages? If yes, we could borrow their method. We do not. Well, there is a hugetlb cgroup controller but I am not sure this is the right method. A lack of hugetlb access controll is a serious shortcoming which has turned this interface into "only first class citizens" feature with a very closed coordination with an admin. -- Michal Hocko SUSE Labs