Received: by 2002:a05:6358:16cc:b0:ea:6187:17c9 with SMTP id r12csp12210730rwl; Tue, 3 Jan 2023 10:36:54 -0800 (PST) X-Google-Smtp-Source: AMrXdXu9Gb+FXa7v2cowQS8oan5x5LDkPHam++4wwT3Hr8S6dn4K5nTX309NTgkIlrxF+jnxXBU3 X-Received: by 2002:a17:902:fe0c:b0:192:5c3e:8939 with SMTP id g12-20020a170902fe0c00b001925c3e8939mr38730972plj.0.1672771014386; Tue, 03 Jan 2023 10:36:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1672771014; cv=none; d=google.com; s=arc-20160816; b=qkOghjIXiuX3ee74oBkfM5M2jzYzslAKx28hBnXzEILddPtdhB30Swfj/1q8PvAzfn 2m1X8wSkBV4A8A7jj8h22G9RiuT+Dd3qwoxULMG53U0m9qvbU77uYlWUHTMA0yFAryGG smg/ep4nN/J6maCE5pXgeVJ4nskFxhVlEJgfSECrjgmrL9FQe9sumzPuKxtlNcxH4QnJ PkqrQhEOfG5TVdsV6mtAfm7ZtLqFtAdbqkJwzlYh1sOWOcdx3KrcoTFNcfYSMjehNixq JwuHp21OpyLqxdzBpv7ycyD3f8kes4sEoTbSvrcJsKhg+4eAG3CikOCRiJ2tJeajISMZ Hfjg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=PgKBvd7zom/Ksf1qd5/ohJ1iPiqJ5KKMpcglSWy3+Z8=; b=0m65hYXYmk9Qcz7TbiVdmpDKlNLHmE318dnKU+GLLDMJUB8jozGn92ZxQLqTA5lLN1 oO1TTMPfBge3ACG99O5Dz/ogIV2asIKyStm3Dhh97Lugjq7vBjz/5gK0nmMCbvc8rnWQ rYHf+Uzd3oRVWq1DyzwfSA8sFV8y2s3c1WTBxDN3LGnjT60+71v7LlnscWun1ScVPfsV 2axVTVwFM66mTeZhXl2nLHcSWEUpqQFRZqTz4RWekGkdJo7+h32NjgYDwaW2vtgxKZN+ yjlWUxxEuEkx/vUYQq4nhZtv76EyLfhuKJEzV3hRyfJx25ax8G0srci6luHmLZuxH8Dl t3bw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=ZceGyxLu; spf=pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-crypto-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i8-20020a654d08000000b004790510bfe5si34495047pgt.692.2023.01.03.10.36.40; Tue, 03 Jan 2023 10:36:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=ZceGyxLu; spf=pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-crypto-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238075AbjACSgi (ORCPT + 99 others); Tue, 3 Jan 2023 13:36:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48422 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238335AbjACSgU (ORCPT ); Tue, 3 Jan 2023 13:36:20 -0500 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BC909EE03 for ; Tue, 3 Jan 2023 10:36:18 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 29240B810A6 for ; Tue, 3 Jan 2023 18:36:17 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D1EABC43392 for ; Tue, 3 Jan 2023 18:36:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1672770975; bh=qmyYVNpngCof8L0z9bRxBJGgthB1paphMx+6zOsfGUY=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=ZceGyxLulP1Ffg0N9eFZITlz0PbWaHWU9wdl8KnetIpDUFk+9vwdsfYzkZv1CmqZ6 6vlFXiUdzOgnjfo3fjTSRVMlI+0Jezi5r+MFGk2lROx5qsmf6zrd9t2ghGIfG69oyS PTGZ+8DSFyDtLaNWMWbPR9t+qsk6OXEw1871o+2ZwsXo9QH/7XzVggCFWfA0lguW7O 9pjlXnWGMCrkd+v50ka/cVfQ6jk4kA1NuyviIsGIkD0s3RCh5pAcm2SnEDrakU1Pto BiOWIL23tPgixi669Buux3TBPy3+U8zY4TEac6GVi6d6pIGsBTe6Bn1UVLdPFuiKdW t0zcQqdO9FwwA== Received: by mail-ed1-f52.google.com with SMTP id c17so45062540edj.13 for ; Tue, 03 Jan 2023 10:36:15 -0800 (PST) X-Gm-Message-State: AFqh2kp4zARJgje1GHmxSBaxOdpCbsBUi98kGhkc+tYtWk+e5I96BisL GgH6/r9eu9I2Q1ujhiYhZbPRN1WQUeWS8pIk6V452Q== X-Received: by 2002:aa7:cb4f:0:b0:486:1c44:a6fa with SMTP id w15-20020aa7cb4f000000b004861c44a6famr3256292edt.372.1672770974013; Tue, 03 Jan 2023 10:36:14 -0800 (PST) MIME-Version: 1.0 References: <20230101162910.710293-1-Jason@zx2c4.com> <20230101162910.710293-3-Jason@zx2c4.com> In-Reply-To: From: Andy Lutomirski Date: Tue, 3 Jan 2023 10:36:01 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings To: Ingo Molnar Cc: "Jason A. Donenfeld" , linux-kernel@vger.kernel.org, patches@lists.linux.dev, tglx@linutronix.de, linux-crypto@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, Greg Kroah-Hartman , Adhemerval Zanella Netto , "Carlos O'Donell" , Florian Weimer , Arnd Bergmann , Jann Horn , Christian Brauner , linux-mm@kvack.org, Linus Torvalds Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-crypto@vger.kernel.org On Tue, Jan 3, 2023 at 2:50 AM Ingo Molnar wrote: > > > * Jason A. Donenfeld wrote: > > > The vDSO getrandom() implementation works with a buffer allocated with a > > new system call that has certain requirements: > > > > - It shouldn't be written to core dumps. > > * Easy: VM_DONTDUMP. > > - It should be zeroed on fork. > > * Easy: VM_WIPEONFORK. I have a rather different suggestion: make a special mapping. Jason, you're trying to shoehorn all kinds of bizarre behavior into the core mm, and none of that seems to me to belong to the core mm. Instead, have an actual special mapping with callbacks that does the right thing. No fancy VM flags. Memory pressure: have it free and unmap it self. Gets accessed again? ->fault can handle it. Want to mlock it? No, don't do that -- that's absurd. Just arrange so that, if it gets evicted, it's not written out anywhere. And when it gets faulted back in it does the right thing -- see above. Zero on fork? I'm sure that's manageable with a special mapping. If not, you can add a new vm operation or similar to make it work. (Kind of like how we extended special mappings to get mremap right a couple years go.) But maybe you don't want to *zero* it on fork and you want to do something more intelligent. Fine -- you control ->fault! > > > > - It shouldn't be written to swap. > > * Uh-oh: mlock is rlimited. > > * Uh-oh: mlock isn't inherited by forks. No mlock, no problems. > > > > - It shouldn't reserve actual memory, but it also shouldn't crash when > > page faulting in memory if none is available > > * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2. > > * Uh-oh: VM_NORESERVE means segfaults. ->fault can do whatever you want. And there is no shortage of user memory that *must* be made available on fault in order to resume the faulting process. ->fault can handle this. > > > > It turns out that the vDSO getrandom() function has three really nice > > characteristics that we can exploit to solve this problem: > > > > 1) Due to being wiped during fork(), the vDSO code is already robust to > > having the contents of the pages it reads zeroed out midway through > > the function's execution. > > > > 2) In the absolute worst case of whatever contingency we're coding for, > > we have the option to fallback to the getrandom() syscall, and > > everything is fine. > > > > 3) The buffers the function uses are only ever useful for a maximum of > > 60 seconds -- a sort of cache, rather than a long term allocation. > > > > These characteristics mean that we can introduce VM_DROPPABLE, which > > has the following semantics: No need for another vm flag. > > > > a) It never is written out to swap. No need to complicate the swap logic for this. > > b) Under memory pressure, mm can just drop the pages (so that they're > > zero when read back again). Or ->fault could even repopulate it without needing to ever read zeros. > > c) If there's not enough memory to service a page fault, it's not fatal, > > and no signal is sent. Instead, writes are simply lost. This just seems massively overcomplicated to me. If there isn't enough memory to fault in a page of code, we don't have some magic instruction emulator in the kernel. We either OOM or we wait for memory to show up. > > d) It is inherited by fork. If you have a special mapping and you fork, it doesn't magically turn into normal memory. > > e) It doesn't count against the mlock budget, since nothing is locked. Special mapping -> no mlock. > > > > This is fairly simple to implement, with the one snag that we have to > > use 64-bit VM_* flags, but this shouldn't be a problem, since the only > > consumers will probably be 64-bit anyway. > > > > This way, allocations used by vDSO getrandom() can use: > > > > VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE > > > > And there will be no problem with OOMing, crashing on overcommitment, > > using memory when not in use, not wiping on fork(), coredumps, or > > writing out to swap. > > > > At the moment, rather than skipping writes on OOM, the fault handler > > just returns to userspace, and the instruction is retried. This isn't > > terrible, but it's not quite what is intended. The actual instruction > > skipping has to be implemented arch-by-arch, but so does this whole > > vDSO series, so that's fine. The following commit addresses it for x86. I really dislike this. I'm with Ingo.