Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp627810pxb; Tue, 2 Feb 2021 13:40:01 -0800 (PST) X-Google-Smtp-Source: ABdhPJwFUiZdCQ++wVFcde6otFG9PV6YlvlqywvdwoyO7N7R7rXwYPC9L4OhuNmyAy5vpZwaAEKm X-Received: by 2002:a05:6402:4389:: with SMTP id o9mr111831edc.164.1612302001639; Tue, 02 Feb 2021 13:40:01 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612302001; cv=none; d=google.com; s=arc-20160816; b=AE7HcWTNHnErmDsGoefO+UFguxMaBMmuGTVAr3vqsBoadm7tthSNLvw4x8Yk3qpl91 UrnOc0bnpuP5P3gPA3vSsWAJZJJjLyxQKRjq9mDtGyHtYI8UfPyM5sXlyo+jnOoxq1QQ dz6fWLRL04znPa+9URJzk5pQ4hZTEEsTw/4u8ZB5ShOgFgNTU0PQViedIlLFPEXo3S+o VSr2JHbtx2EgFcZpH/SuGdEaIrz2pPWw2ytTvHXUpXS2WtNT6JkmN3oxzfStIkNHA855 n4XDdOiOrX7/03ugDGOsYIV7zfs8TGp1UhG4OZ7BnMXSQAHjHXRiSDPapkuWERv8KRMI MGmw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=FyskbEnzuwjyuMTe8c7okJRmhrpRGCqmCZwH5GpQjcw=; b=lo8j2vXvuVVsB0DYYgyfTgyPDRnz1bccEglkDmHSEila6Vv51X+qOv2OCTrudFDgdC UBtLuruDXJF2urB9rc7+1HNM4zWixjP+RoUDD5iIg5vNQbcUo1QVPRlCay8E2yeTNkqa 6soXI77iy/jNF1MMijWxE6Vg+axpj4KzcDQVMm68JRElfevKimAJhdziOlHKkhQ/uU97 97Ve1Zu6MutAhLaz8Lv16kHBasUgMKKeM+9ezllWHWsI/+NEb090ZuL9iK/3KOXDHd+D vNpp5CsM7UNpV57Shg43exMXysOWDw3l2OHRaZXjG9LQdHo609ZKqdoq5EJEhR6yH26C H8mw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b="YFgF3+f/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b11si1392936edz.246.2021.02.02.13.39.30; Tue, 02 Feb 2021 13:40:01 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b="YFgF3+f/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232315AbhBBN2H (ORCPT + 99 others); Tue, 2 Feb 2021 08:28:07 -0500 Received: from mx2.suse.de ([195.135.220.15]:34770 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231797AbhBBN2E (ORCPT ); Tue, 2 Feb 2021 08:28:04 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1612272437; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=FyskbEnzuwjyuMTe8c7okJRmhrpRGCqmCZwH5GpQjcw=; b=YFgF3+f/w7PxiCxvUVlULmxEm8Z/1W5tSapPIO7tHCcCN5PlV7lWTgawD72voZ495B7JSR G26R4mG765744DxzeTd+YhGzjItOhlwQjBpR+BWxJqFdQ3PUGtCeIfFWGaLK4QzSX7b+w+ mz7G/s66iaoiPvmR2Dk1KU0Nm01tHO4= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 80F75AF92; Tue, 2 Feb 2021 13:27:16 +0000 (UTC) Date: Tue, 2 Feb 2021 14:27:14 +0100 From: Michal Hocko To: Mike Rapoport Cc: James Bottomley , David Hildenbrand , Andrew Morton , Alexander Viro , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , Elena Reshetova , "H. Peter Anvin" , Ingo Molnar , "Kirill A. Shutemov" , Matthew Wilcox , Mark Rutland , Mike Rapoport , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Rick Edgecombe , Roman Gushchin , Shakeel Butt , Shuah Khan , Thomas Gleixner , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org, Hagen Paul Pfeifer , Palmer Dabbelt Subject: Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation Message-ID: References: <20210126114657.GL827@dhcp22.suse.cz> <303f348d-e494-e386-d1f5-14505b5da254@redhat.com> <20210126120823.GM827@dhcp22.suse.cz> <20210128092259.GB242749@kernel.org> <73738cda43236b5ac2714e228af362b67a712f5d.camel@linux.ibm.com> <6de6b9f9c2d28eecc494e7db6ffbedc262317e11.camel@linux.ibm.com> <20210202124857.GN242749@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210202124857.GN242749@kernel.org> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 02-02-21 14:48:57, Mike Rapoport wrote: > On Tue, Feb 02, 2021 at 10:35:05AM +0100, Michal Hocko wrote: > > On Mon 01-02-21 08:56:19, James Bottomley wrote: > > > > I have also proposed potential ways out of this. Either the pool is not > > fixed sized and you make it a regular unevictable memory (if direct map > > fragmentation is not considered a major problem) > > I think that the direct map fragmentation is not a major problem, and the > data we have confirms it, so I'd be more than happy to entirely drop the > pool, allocate memory page by page and remove each page from the direct > map. > > Still, we cannot prove negative and it could happen that there is a > workload that would suffer a lot from the direct map fragmentation, so > having a pool of large pages upfront is better than trying to fix it > afterwards. As we get more confidence that the direct map fragmentation is > not an issue as it is common to believe we may remove the pool altogether. I would drop the pool altogether and instantiate pages to the unevictable LRU list and internally treat it as ramdisk/mlock so you will get an accounting correctly. The feature should be still opt-in (e.g. a kernel command line parameter) for now. The recent report by Intel (http://lkml.kernel.org/r/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com) there is no clear win to have huge mappings in _general_ but there are still workloads which benefit. > I think that using PMD_ORDER allocations for the pool with a fallback to > order 0 will do the job, but unfortunately I doubt we'll reach a consensus > about this because dogmatic beliefs are hard to shake... If this is opt-in then those beliefs can be relaxed somehow. Long term it makes a lot of sense to optimize for a better direct map management but I do not think this is a hard requirement for an initial implementation if it is not imposed to everybody by default. > A more restrictive possibility is to still use plain PMD_ORDER allocations > to fill the pool, without relying on CMA. In this case there will be no > global secretmem specific pool to exhaust, but then it's possible to drain > high order free blocks in a system, so CMA has an advantage of limiting > secretmem pools to certain amount of memory with somewhat higher > probability for high order allocation to succeed. > > > or you need a careful access control > > Do you mind elaborating what do you mean by "careful access control"? As already mentioned, a mechanism to control who can use this feature - e.g. make it a special device which you can access control by permissions or higher level security policies. But that is really needed only if the pool is fixed sized. > > or you need SIGBUS on the mmap failure (to allow at least some fallback > > mode to caller). > > As I've already said, I agree that SIGBUS is way better than OOM at #PF > time. It would be better than OOM but it would still be a terrible interface. So I would go that path only as a last resort. I do not even want to think what kind of security consequences that would have. E.g. think of somebody depleting the pool and pushing security sensitive workload into fallback which is not backed by security memory. > And we can add some means to fail at mmap() time if the pools are running > low. Welcome to hugetlb reservation world... -- Michal Hocko SUSE Labs