Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp594399ybx; Wed, 30 Oct 2019 01:43:57 -0700 (PDT) X-Google-Smtp-Source: APXvYqytl3EfRAI0T/TL8TE0uWQLfteK7AxlDMDEHm9eQ4UjGSGX4OW2D7b29vYlLRTzE2rCQ8U5 X-Received: by 2002:a17:906:1e07:: with SMTP id g7mr7888643ejj.256.1572425037511; Wed, 30 Oct 2019 01:43:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1572425037; cv=none; d=google.com; s=arc-20160816; b=ptC19bMW1piQP45KL4FBw+AWSgfvxwkTdSK+Wvsu28L3YAwZKKPemYdwcxMJl4wYMf C9AEU1QMxZSQD3UNvKmZp0SAKUyDHXwiK+u/FhC3xOLgfcnehqlwazb6hqvfcUtofRzZ yxKYcWEluDsoYBBAqmkGCYr7S11ymsELMgJC3T0nQ4/0KSWyUslooWk9ZRGohVgFuCMb Vm1Ox/A2QXpGhiTfnTuFxO5VXMfqoEmSgHuRqaOrZ38fiBCCF3zBtxvQlQ4SyJq6Sqya A8Nr3pt3a3yxD8B5qyJjZhKMDia8r2xG9k7RJt23dY6HZjZKWLmnhLyiU0PwWmsEFap4 yaQw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=8s29WZhZAEM8Bad84Bxm6ru0sODlwQqy6gpnW0vRd1g=; b=Sd94ewkayTmJ5H4nkQCXbKQvfOLRvPjQecUY1pNA9dcXgUgsQEeMtKdSRVJRp974f5 ricjCCVZ6GSFbzjLincDasRFJE2Wefvaq+vBjGkkBYsr+QBrTKeAOcgUgjZCo9skaKO7 hiQuAqHZqNLCjFLJqs0qdeGU3ErCDbnB/IHPuvToYOitsUqFJwxXolLIOW3kL7CC3noY S1T6esWCXHtkW+jKaMVv2mkPttQZaCEYdZYJNQ/XZGiuHahE0GwDla3f2mZUtCGUHRXk 9bbUF45Les3H66psHUL83ujXU8JSw5WhemUlUaO9Zj/6Ir5yNJCUbJSjg1qyWjRJJFvC YuDw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=Nesn8HlZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id qt19si799589ejb.5.2019.10.30.01.43.33; Wed, 30 Oct 2019 01:43:57 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=Nesn8HlZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726297AbfJ3IkP (ORCPT + 99 others); Wed, 30 Oct 2019 04:40:15 -0400 Received: from mail.kernel.org ([198.145.29.99]:58964 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726028AbfJ3IkP (ORCPT ); Wed, 30 Oct 2019 04:40:15 -0400 Received: from rapoport-lnx (190.228.71.37.rev.sfr.net [37.71.228.190]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 04AFE20856; Wed, 30 Oct 2019 08:40:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1572424814; bh=/nCth3tAN3zo9YpLVlMHUApCRm+0ZcTx184ZTc9xlgM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Nesn8HlZl7oj3aP21LaeInYQ+pjC5N1ru/Ic4LIsTmwnHs/f8XRQSOkx71rnYe1Au RSpTABrOh3VDdlU2zTYQNbqqVTXVwsnRNGFTBvg7cUfwoi8YwW1psOJ5Z6Fbj5mZP2 f+9qYKTxXa82mHdjLJcKrDzqXxxuQtR9XPh7IhV0= Date: Wed, 30 Oct 2019 09:40:06 +0100 From: Mike Rapoport To: Andy Lutomirski Cc: LKML , Alexey Dobriyan , Andrew Morton , Arnd Bergmann , Borislav Petkov , Dave Hansen , James Bottomley , Peter Zijlstra , Steven Rostedt , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Linux API , Linux-MM , X86 ML , Mike Rapoport Subject: Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Message-ID: <20191030084005.GC20624@rapoport-lnx> References: <1572171452-7958-1-git-send-email-rppt@kernel.org> <20191029093254.GE18773@rapoport-lnx> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 29, 2019 at 10:00:55AM -0700, Andy Lutomirski wrote: > On Tue, Oct 29, 2019 at 2:33 AM Mike Rapoport wrote: > > > > On Mon, Oct 28, 2019 at 02:44:23PM -0600, Andy Lutomirski wrote: > > > > > > > On Oct 27, 2019, at 4:17 AM, Mike Rapoport wrote: > > > > > > > > From: Mike Rapoport > > > > > > > > Hi, > > > > > > > > The patch below aims to allow applications to create mappins that have > > > > pages visible only to the owning process. Such mappings could be used to > > > > store secrets so that these secrets are not visible neither to other > > > > processes nor to the kernel. > > > > > > > > I've only tested the basic functionality, the changes should be verified > > > > against THP/migration/compaction. Yet, I'd appreciate early feedback. > > > > > > I’ve contemplated the concept a fair amount, and I think you should > > > consider a change to the API. In particular, rather than having it be a > > > MAP_ flag, make it a chardev. You can, at least at first, allow only > > > MAP_SHARED, and admins can decide who gets to use it. It might also play > > > better with the VM overall, and you won’t need a VM_ flag for it — you > > > can just wire up .fault to do the right thing. > > > > I think mmap()/mprotect()/madvise() are the natural APIs for such > > interface. > > Then you have a whole bunch of questions to answer. For example: > > What happens if you mprotect() or similar when the mapping is already > in use in a way that's incompatible with MAP_EXCLUSIVE? Then we refuse to mprotect()? Like in any other case when vm_flags are not compatible with required madvise()/mprotect() operation. > Is it actually reasonable to malloc() some memory and then make it exclusive? > > Are you permitted to map a file MAP_EXCLUSIVE? What does it mean? I'd limit MAP_EXCLUSIVE only to anonymous memory. > What does MAP_PRIVATE | MAP_EXCLUSIVE do? My preference is to have only mmap() and then the semantics is more clear: MAP_PRIVATE | MAP_EXCLUSIVE creates a pre-populated region, marks it locked and drops the pages in this region from the direct map. The pages are returned back on munmap(). Then there is no way to change an existing area to be exclusive or vice versa. > How does one pass exclusive memory via SCM_RIGHTS? (If it's a > memfd-like or chardev interface, it's trivial. mmap(), not so much.) Why passing such memory via SCM_RIGHTS would be useful? > And finally, there's my personal giant pet peeve: a major use of this > will be for virtualization. I suspect that a lot of people would like > the majority of KVM guest memory to be unmapped from the host > pagetables. But people might also like for guest memory to be > unmapped in *QEMU's* pagetables, and mmap() is a basically worthless > interface for this. Getting fd-backed memory into a guest will take > some possibly major work in the kernel, but getting vma-backed memory > into a guest without mapping it in the host user address space seems > much, much worse. Well, in my view, the MAP_EXCLUSIVE is intended to keep small secrets rather than use it for the entire guest memory. I even considered adding a limit for the mapping size, but then I decided that since RLIMIT_MEMLOCK is anyway enforced there is no need for a new one. I agree that getting fd-backed memory into a guest would be less pain that VMA, but KVM can already use memory outside the control of the kernel via /dev/map [1]. So unless I'm missing something here, there is no need to use MAP_EXCLUSIVE for the guest memory. [1] https://lwn.net/Articles/778240/ > > Switching to a chardev doesn't solve the major problem of direct > > map fragmentation and defeats the ability to use exclusive memory mappings > > with the existing allocators, while mprotect() and madvise() do not. > > > > Will people really want to do malloc() and then remap it exclusive? > This sounds dubiously useful at best. Again, my preference is to have mmap() only, but I see a value in this use case as well. Application developers allocate memory and then sometimes change its properties rather than go mmap() something. For such usage mprotect() may be usefull. -- Sincerely yours, Mike.