Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp2077554pxb; Thu, 11 Feb 2021 03:54:14 -0800 (PST) X-Google-Smtp-Source: ABdhPJwu2AQqNS4z9CuBFWOnyFmGI37bVQQiECW5x86SLyqb2ciVuZAGjQuUWuqD4sqF61J6/ZRg X-Received: by 2002:a17:906:17d7:: with SMTP id u23mr8162493eje.390.1613044453768; Thu, 11 Feb 2021 03:54:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1613044453; cv=none; d=google.com; s=arc-20160816; b=KMY6mncFIfL+cdti45664ayLX5sVBD7YpXEBwST+BS+/VmcWCN44dDSPHQQJ2uTZVA A8VrEvkpWmG46x8PY54rdOabxzHEM3VP9I0NfZv2JM7qKLHcVkObuS9NN6C9d2lu/Flu /LIHEVY+/gOXsCI/5ltb9201MKPtsL1GJOtAvssefBwQVTad3B9gZHdTZZjHW55bs48G GS4Ve22C8UNBkONcGNJE9nS9biCa180Qidamu2/MgiP+rVy8sIXg4rLi8UXbChj0OyHc 06VmjwCLs1McIDuyO08DOpGcefIKiNxx0Hnw/aMAghVkG4OqVV39WzwJIcy32PVEKbKg z0Bw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=NS+tfNxRadwro9Hy6JcsPogY2MTJwrwXXDtjDn2FfnY=; b=b5ZRdAFEh1cbdqG9yE7ZCle7U2ujMC3Op9HYGQ0UG+Uc/Gc5H6cMkwSOHekB3ckn+o MaEMBBxhmdWbGDw2Wsr1LPgMA9SwBhCkyojLDXkJE5zIxJo8vpM48Ky1QatgWUgBlWOg JJqFeEHNlTmFTpIbwHajnhuvTFvmt6Wg1dKcOHt2KlIYOKE9LaYHZ8Gevk1+8rDqJUkC DxPqnrm/OqxotoAbj7sjQFBiG8uk84T0ljztamQcZpYmpcssMIDNWYMk1xoP+yTkkekA rpT84nAHQk3vfNH/rmoGW2gfuLwtWTPKY79XroAga96Wy2RSl06/bcxoIFELX9idv84Y lObw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="A5/jbbX3"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id n5si3690332eju.567.2021.02.11.03.53.50; Thu, 11 Feb 2021 03:54:13 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="A5/jbbX3"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229792AbhBKLwe (ORCPT + 99 others); Thu, 11 Feb 2021 06:52:34 -0500 Received: from mail.kernel.org ([198.145.29.99]:41390 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230299AbhBKL16 (ORCPT ); Thu, 11 Feb 2021 06:27:58 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 418C664E26; Thu, 11 Feb 2021 11:27:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1613042837; bh=opUcWXi7xUrvrbmjoDXPB2U6cbnKyKBkuV2eGoN2RQ0=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=A5/jbbX3iRUf/Jtm1xCh7A2cWOvkK3kzDtdhjuXDmdNdtTGER2HK1BFbOMjDHidEz SSAvvaYro9hQ+4UOj7rZChw3GBiogzmiN+96ZwTf8KpbmHjppkBqlpNOGx1+GYSKnW iiIhy5VQwzOPRd5nhGKpZX9Mb+eGqJHBo98bQp3LuQGfqZdzRYf/BQyRJyHvb6rQIR vf9QchKbM0VCUg7uK5ktt19qAUj7CPyDJjv4a7QDFu0SNwMWyc8ZvtWr5Inmu5RJdS kK6a/mH5rwOREuKbaRyEGTNNkx6bq86w3omRuJcmAk5yyZCS72ObmTWwCdBPlR5ydV SRb+HO2Qsif8g== Date: Thu, 11 Feb 2021 13:27:02 +0200 From: Mike Rapoport To: David Hildenbrand Cc: Michal Hocko , Mike Rapoport , Andrew Morton , Alexander Viro , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , Elena Reshetova , "H. Peter Anvin" , Ingo Molnar , James Bottomley , "Kirill A. Shutemov" , Matthew Wilcox , Mark Rutland , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Rick Edgecombe , Roman Gushchin , Shakeel Butt , Shuah Khan , Thomas Gleixner , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org, Hagen Paul Pfeifer , Palmer Dabbelt Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas Message-ID: <20210211112702.GI242749@kernel.org> References: <20210208084920.2884-1-rppt@kernel.org> <20210208084920.2884-8-rppt@kernel.org> <20210208212605.GX242749@kernel.org> <20210209090938.GP299309@linux.ibm.com> <20210211071319.GF242749@kernel.org> <0d66baec-1898-987b-7eaf-68a015c027ff@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <0d66baec-1898-987b-7eaf-68a015c027ff@redhat.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote: > On 11.02.21 09:39, Michal Hocko wrote: > > On Thu 11-02-21 09:13:19, Mike Rapoport wrote: > > > On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote: > > > > On Tue 09-02-21 11:09:38, Mike Rapoport wrote: > > [...] > > > > > Citing my older email: > > > > > > > > > > I've hesitated whether to continue to use new flags to memfd_create() or to > > > > > add a new system call and I've decided to use a new system call after I've > > > > > started to look into man pages update. There would have been two completely > > > > > independent descriptions and I think it would have been very confusing. > > > > > > > > Could you elaborate? Unmapping from the kernel address space can work > > > > both for sealed or hugetlb memfds, no? Those features are completely > > > > orthogonal AFAICS. With a dedicated syscall you will need to introduce > > > > this functionality on top if that is required. Have you considered that? > > > > I mean hugetlb pages are used to back guest memory very often. Is this > > > > something that will be a secret memory usecase? > > > > > > > > Please be really specific when giving arguments to back a new syscall > > > > decision. > > > > > > Isn't "syscalls have completely independent description" specific enough? > > > > No, it's not as you can see from questions I've had above. More on that > > below. > > > > > We are talking about API here, not the implementation details whether > > > secretmem supports large pages or not. > > > > > > The purpose of memfd_create() is to create a file-like access to memory. > > > The purpose of memfd_secret() is to create a way to access memory hidden > > > from the kernel. > > > > > > I don't think overloading memfd_create() with the secretmem flags because > > > they happen to return a file descriptor will be better for users, but > > > rather will be more confusing. > > > > This is quite a subjective conclusion. I could very well argue that it > > would be much better to have a single syscall to get a fd backed memory > > with spedific requirements (sealing, unmapping from the kernel address > > space). Neither of us would be clearly right or wrong. A more important > > point is a future extensibility and usability, though. So let's just > > think of few usecases I have outlined above. Is it unrealistic to expect > > that secret memory should be sealable? What about hugetlb? Because if > > the answer is no then a new API is a clear win as the combination of > > flags would never work and then we would just suffer from the syscall > > multiplexing without much gain. On the other hand if combination of the > > functionality is to be expected then you will have to jam it into > > memfd_create and copy the interface likely causing more confusion. See > > what I mean? > > > > I by no means do not insist one way or the other but from what I have > > seen so far I have a feeling that the interface hasn't been thought > > through enough. Sure you have landed with fd based approach and that > > seems fair. But how to get that fd seems to still have some gaps IMHO. > > > > I agree with Michal. This has been raised by different > people already, including on LWN (https://lwn.net/Articles/835342/). > > I can follow Mike's reasoning (man page), and I am also fine if there is > a valid reason. However, IMHO the basic description seems to match quite good: > > memfd_create() creates an anonymous file and returns a file descriptor that refers to it. The > file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on. > However, unlike a regular file, it lives in RAM and has a volatile backing storage. Once all > references to the file are dropped, it is automatically released. Anonymous memory is used > for all backing pages of the file. Therefore, files created by memfd_create() have the same > semantics as other anonymous memory allocations such as those allocated using mmap(2) with the > MAP_ANONYMOUS flag. Even despite my laziness and huge amount of copy-paste you can spot the differences (this is a very old version, update is due): memfd_secret() creates an anonymous file and returns a file descriptor that refers to it. The file can only be memory-mapped; the memory in such mapping will have stronger protection than usual memory mapped files, and so it can be used to store application secrets. Unlike a regular file, a file created with memfd_secret() lives in RAM and has a volatile backing storage. Once all references to the file are dropped, it is automatically released. The initial size of the file is set to 0. Following the call, the file size should be set using ftruncate(2). The memory areas obtained with mmap(2) from the file descriptor are ex‐ clusive to the owning context. These areas are removed from the kernel page tables and only the page table of the process holding the file de‐ scriptor maps the corresponding physical memory. > AFAIKS, we would need MFD_SECRET and disallow > MFD_ALLOW_SEALING and MFD_HUGETLB. So here we start to multiplex. > In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of > temporary mappings (eor migration). TBC. Never map is the default. When we'll need to map we'll add an explicit flag for it. -- Sincerely yours, Mike.