Received: by 2002:a25:d7c1:0:0:0:0:0 with SMTP id o184csp4187968ybg; Tue, 29 Oct 2019 03:32:02 -0700 (PDT) X-Google-Smtp-Source: APXvYqw2KQZcT0EZDVc8PwSmcfMqUbfC8YkTNIATpDQwjhQ+dMaKe2jI2cU4vYD8+WSW0nDdF/0t X-Received: by 2002:a17:906:1f57:: with SMTP id d23mr2386153ejk.233.1572345122453; Tue, 29 Oct 2019 03:32:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1572345122; cv=none; d=google.com; s=arc-20160816; b=CYKaAeYBfZptRzyY3RY4aderV7dvGoSgylbaFk+8tmsgCR3HEU+PlnDCH08IFH2e+H N/mHWovG1+77VtOwQqvv9SXyrQdSB5+oLjIBmUhEfPo/46GhcwaUGwlI9QFUMqbsZeQX lzbsOk2DfyFwxFb+pJuBHI3y6NqAhIjIuO0wtk+fTrZ1lNKlikjXW3/sO/L1LNf65W56 3f6hvgIR1EN8BkcDZlAdUs5hjzbEiLRwsq7P7zjQie3mRjDZQ0/AfbwZBvXeDWryD1HM IvgN1yz+DDHapOFJUE4GXD/Nat8F3giP6+dEwGvajDtdrF7Rg1Jzh2XrxvEIsJVk0xLU bbnQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=YkbzxnGZliuV7XcDrBNvguMI2bAGfmaE9y4QlvBsGOs=; b=TT1F8xrOYu/Py4/IZClu0IlmD3D6FvsCHZEBXfajjRleIqVKY4zaRN6HP41P0Fk80B LggsQZUX03PSzq8Gclg2B8JJPWNNMHObdIKKqqjpRPwy79mW1D/cmNZvdz6H98MU3qKk g1A7DRbLax/NAZD5nJl2C16u4phqrd+zMCWPCHce+hmVdUEPA2jGwizPCIhCvgp0U2Oy jfcHuFJQFQjE0Us8ob0oVLBqHLCvoy2p4ExcxkKeEcRFsR/aKd6tfafgAGvl36X3O1mg 4DQGLCjyjtiXadZ4cRxFQYmlS/Sa3219j2MQCxSGlfzPP7+nWRdAB8RylMe79TA/PWTM GJkQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=M6bG3TOP; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e7si7945765ejk.20.2019.10.29.03.31.38; Tue, 29 Oct 2019 03:32:02 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=M6bG3TOP; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732157AbfJ2JT3 (ORCPT + 99 others); Tue, 29 Oct 2019 05:19:29 -0400 Received: from mail.kernel.org ([198.145.29.99]:59926 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727347AbfJ2JT3 (ORCPT ); Tue, 29 Oct 2019 05:19:29 -0400 Received: from rapoport-lnx (190.228.71.37.rev.sfr.net [37.71.228.190]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 1982D20873; Tue, 29 Oct 2019 09:19:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1572340768; bh=09ecCW/OZCt/5BQeGtrO08ZMgiWal40RLCJlcZrHmzg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=M6bG3TOPT1CgBrvnoj6mEcMekIPIdcA7C9AZxFiSEc8b3BVnBzS0Znr2OWRvBF00J aNjrpKfZ0VeHzVvpVkdFQsO/yOVAkWgLMiRePByul58aDGX4UQX2/xSuMJXPj8LKti t2nzHsmS1tv9Ewfk/K+WR3PwFNj5gydfdHO9KDSM= Date: Tue, 29 Oct 2019 10:19:19 +0100 From: Mike Rapoport To: Dave Hansen Cc: linux-kernel@vger.kernel.org, Alexey Dobriyan , Andrew Morton , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Dave Hansen , James Bottomley , Peter Zijlstra , Steven Rostedt , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , linux-api@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, Mike Rapoport Subject: Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Message-ID: <20191029091918.GC18773@rapoport-lnx> References: <1572171452-7958-1-git-send-email-rppt@kernel.org> <1572171452-7958-2-git-send-email-rppt@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote: > On 10/27/19 3:17 AM, Mike Rapoport wrote: > > The pages in these mappings are removed from the kernel direct map and > > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > > the pages are mapped back into the direct map. > > This looks fun. It's certainly simple. > > But, the description is not really calling out the pros and cons very > well. I'm also not sure that folks will use an interface like this that > requires up-front, special code to do an allocation instead of something > like madvise(). That's why protection keys ended up the way it did: if > you do this as a mmap() replacement, you need to modify all *allocators* > to be enabled for this. If you do it with mprotect()-style, you can > apply it to existing allocations. Actually, I've started with mprotect() and then realized that mmap() would be simpler, so I switched over to mmap(). > Some other random thoughts: > > * The page flag is probably not a good idea. It would be probably > better to set _PAGE_SPECIAL on the PTE and force get_user_pages() > into the slow path. The page flag won't work on 32-bit, indeed. But do we really need such functionality on 32-bit? > * This really stops being "normal" memory. You can't do futexes on it, > cant splice it. Probably need a more fleshed-out list of > incompatible features. True, my bad. I should have mentioned more than THP/compaction/migration. > * As Kirill noted, each 4k page ends up with a potential 1GB "blast > radius" of demoted pages in the direct map. Not cool. This is > probably a non-starter as it stands. > * The global TLB flushes are going to eat you alive. They probably > border on a DoS on larger systems. As I wrote in another email, we could use some kind of pooling to reduce the "blast radius" and that will reduce that amount of TLB flushes as well. The size of the MAP_EXCLUSIVE obeys the RLIMIT_MEMLOCK and we can add a system-wide limit for size of such allocations. > * Do we really want this user interface to dictate the kernel > implementation? In other words, do we really want MAP_EXCLUSIVE, > or do we want MAP_SECRET? One tells the kernel what do *do*, the > other tells the kernel what the memory *IS*. I hesitated quite some time between EXCLUSIVE and SECRET. I've settled down on EXCLUSIVE because in my view that better describes the fact that the region is only mapped in its owner address space. And as such it can be used to store secrets, but it can be used for other purposes as well. > * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME, > Persistent Memory, where the kernel direct map is a liability in some > way. We probably need some kind of overall, architected solution > rather than five or ten things all poking at the direct map. Agree. -- Sincerely yours, Mike.