Received: by 2002:a05:6358:700f:b0:131:369:b2a3 with SMTP id 15csp1437090rwo; Wed, 2 Aug 2023 14:12:15 -0700 (PDT) X-Google-Smtp-Source: APBJJlGA5Scndnv+HXKLh3efXUwCiW6HH7gY2rAffxfGKw0k1EZOjTvIsJxrR/SGAQdlyBrphhr8 X-Received: by 2002:a17:902:db08:b0:1b8:a19e:a3d3 with SMTP id m8-20020a170902db0800b001b8a19ea3d3mr20674485plx.52.1691010735527; Wed, 02 Aug 2023 14:12:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691010735; cv=none; d=google.com; s=arc-20160816; b=lBsHVZfyOh6lebGv0iBFs2pzOXMJHC638ip+Qi066NEDysT0NtoKduhcO24DwbB4dn +tQKq9iuqXbNfk07lmnWEPmdt4hat2B+fWko2BYbJkaNe3XGT4i5HBNH5JUeGqpKZz1Z anfMfQk4d3PfECIve0Qz/CBtbi9wQEkizrjrMK0plcjO1JYgFK/E6FDxnvjkAYMrgnXq sA9WKtg9gqJuVFthO/OPnMzr+75fgrU6PUH5DyAynhnBS7JOq6TmoFHo7HXSL7biHLpS W39ucJAqYlakQ95Q4YPWgEEXB8t3pTgF060Jls9xMRHIQmKPcf0aV7Onumm1ewj51NNg FWqg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=g/1Bfm6suFyViDWNlr8+HrMuba+po2ALntmKmq7g2XA=; fh=Zz6jSJoFd9ObsgyhKYK6ZEA2JurkVhl2y8lcpQjL7N0=; b=kfvb+IyFRz+Z+4HRniqx/LHLIxXrAlehyKiVnoe6Wtu/wha6WjWAIYNaIH2MKVqifs nXFy9m1bdPcLuz9oVbupILZ7theWi/3KH7z0pebIRQlPjMT1M3YW9DSIL7eo3KwO3YiV 4avbhl3d43E+lFPRNhqS1zydxT0x88FazOCU0wmUvyB2WUu5MIT1kzGZowKchOhmeT3W v3JrK0gITYi7fk/QlMX0m7XXc2nEW/BoYPgKruB8XC7Xxmebxczup9O3b7uuurOVn+bt k7nzc8HHToaRQjZ8lvdOqImDQsp8QquAAMZu9N39U25KRBGGH2cgNBU06LVtznPKd+vy OyfA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=fWDVHQD6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e6-20020a170902cf4600b001bc33214101si1827882plg.596.2023.08.02.14.12.03; Wed, 02 Aug 2023 14:12:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=fWDVHQD6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229796AbjHBUph (ORCPT + 99 others); Wed, 2 Aug 2023 16:45:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40556 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229534AbjHBUpf (ORCPT ); Wed, 2 Aug 2023 16:45:35 -0400 Received: from mail-oa1-x2d.google.com (mail-oa1-x2d.google.com [IPv6:2001:4860:4864:20::2d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E9BBA1FF0 for ; Wed, 2 Aug 2023 13:45:33 -0700 (PDT) Received: by mail-oa1-x2d.google.com with SMTP id 586e51a60fabf-1bed90ee8b7so115354fac.0 for ; Wed, 02 Aug 2023 13:45:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1691009133; x=1691613933; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=g/1Bfm6suFyViDWNlr8+HrMuba+po2ALntmKmq7g2XA=; b=fWDVHQD6E/n9Ca7eabCCX6b0FFMGlyISauUxdo0n2rpIZnfZ95RCQCHEPckZtOKLCA Sddo6fzWI4gkqr1sajuR+gUUXb/J9BJ7WjG/y+FBliSpaWUz/rczqYOGle4EdddiwjtP i4w41LX3Z1PNFXlt8JNAPBre/xxE2c3QRnk/E= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691009133; x=1691613933; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=g/1Bfm6suFyViDWNlr8+HrMuba+po2ALntmKmq7g2XA=; b=Bvx4JzY9Zbpwpb1S6aH8x+VoVdT3xRmbTOrYvkDG8n5glkGLOG2Dqc7GFuEEjbNqud q+Io+JJY+3Zus8YJpVSRYDh/33GUgFrIdP/pVYy1bM52C7MPL3RRoPZcLwrEKsRB2ypN anCnRXNA/RYQSTJg4Ih5iwE+BDDniCj0Kuy5uuPH8834ll9JOtUkq808br4H3g7Ns7dV RJ30My3vTqBwS4C0TxRNHx/ep2EC2a9rEu7NZyuFAhB5P8x0Y3Mtke3AdPKdydd0fBHp 5JV3guauIgHmo6Y+PIOqp48DTRcq99qJ5fDEfy0fQyin1yfEXPefmWtsNWo0Ha+U4Ndi iNfQ== X-Gm-Message-State: ABy/qLYx+8lWfmIwlNtcLlXhqNKkLzhdmNwHIf+53wjIr4G7eL5mnKVX fIw/qKvfZjaVtF38fSzHyoNgWV7WwGhTEnqdB3nqsg== X-Received: by 2002:a05:6870:2052:b0:1bb:85c3:929e with SMTP id l18-20020a056870205200b001bb85c3929emr16774760oad.48.1691009133246; Wed, 02 Aug 2023 13:45:33 -0700 (PDT) MIME-Version: 1.0 References: <20230713143406.14342-1-cyphar@cyphar.com> <20230801.032503-medium.noises.extinct.omen-CStYZUqcNLCS@cyphar.com> In-Reply-To: <20230801.032503-medium.noises.extinct.omen-CStYZUqcNLCS@cyphar.com> From: Jeff Xu Date: Wed, 2 Aug 2023 13:45:21 -0700 Message-ID: Subject: Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec To: Aleksa Sarai Cc: Jeff Xu , Andrew Morton , Shuah Khan , Kees Cook , Daniel Verkamp , Luis Chamberlain , YueHaibing , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-hardening@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > > > > > * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls > > > > > because it will make it far to difficult to ever migrate. Instead it > > > > > should imply MFD_EXEC. > > > > > > > > > Though the purpose of memfd_noexec=2 is not to help with migration - > > > > but to disable creation of executable memfd for the current system/pid > > > > namespace. > > > > During the migration, vm.memfd_noexe = 1 helps overwriting for > > > > unmigrated user code as a temporary measure. > > > > > > My point is that the current behaviour for =2 means that nobody other > > > than *maybe* ChromeOS will ever be able to use it because it requires > > > auditing every program on the system. In fact, it's possible even > > > ChromeOS will run into issues given that one of the arguments made for > > > the nosymfollow mount option was that auditing all of ChromeOS to > > > replace every open with RESOLVE_NO_SYMLINKS would be too much effort[1] > > > (which I agreed with). Maybe this is less of an issue with > > > memfd_create(2) (which is much newer than open(2)) but it still seems > > > like a lot of busy work when the =1 behaviour is entirely sane even in > > > the strict threat model that =2 is trying to protect against. > > > > > It can also be a container (that have all memfd_create migrated to new API) > > If ChromeOS would struggle to rewrite all of the libraries they use, > containers are in even worse shape -- most container users don't have a > complete list of every package installed in a container, let alone the > ability to audit whether they pass a (no-op) flag to memfd_create(2) in > every codepath. > > > One option I considered previously was "=2" would do overwrite+block , > > and "=3" just block. But then I worry that applications won't have > > motivation to ever change their existing code, the setting will > > forever stay at "=2", making "=3" even more impossible to ever be used > > system side. > > What is the downside of overwriting? Backwards-compatibility is a very > important part of Linux -- being able to use old programs without having > to modify them is incredibly important. Yes, this behaviour is opt-in -- > but I don't see the point of making opting in more difficult than > necessary. Surely overwite+block provides the security guarantee you > need from the threat model -- othewise nobody will be able to use block > because you never know if one library will call memfd_create() > "incorrectly" without the new flags. > > > > > If you want to block syscalls that don't explicitly pass NOEXEC_SEAL, > > > there are several tools for doing this (both seccomp and LSM hooks). > > > > > > [1]: https://lore.kernel.org/linux-fsdevel/20200131212021.GA108613@google.com/ > > > > > > > Additional functionality/features should be implemented through > > > > security hook and LSM, not sysctl, I think. > > > > > > This issue with =2 cannot be fixed in an LSM. (On the other hand, you > > > could implement either =2 behaviour with an LSM using =1, and the > > > current strict =2 behaviour could be implemented purely with seccomp.) > > > > > By migration, I mean a system that is not fully migrated, such a > > system should just use "=0" or "=1". Additional features can be > > implemented in SELinux/Landlock/other LSM by a motivated dev. e.g. if > > a system wants to limit executable memfd to specific programs or fully > > disable it. > > "=2" is for a system/container that is fully migrated, in that case, > > SELinux/Landlock/LSM can do the same, but sysctl provides a convenient > > alternative. > > Yes, seccomp provides a similar mechanism. Indeed, combining "=1" and > > seccomp (block MFD_EXEC), it will overwrite + block X mfd, which is > > essentially what you want, iiuc.However, I do not wish to have this > > implemented in kernel, due to the thinking that I want kernel to get > > out of business of "overwriting" eventually. > > See my above comments -- "overwriting" is perfectly acceptable to me. > There's also no way to "get out of the business of overwriting" -- Linux > has strict backwards compatibility requirements. > I agree, if we weigh on the short term goal of letting the user space applications to do minimum, then having 4 state sysctl (or 2 sysctl, one controls overwrite, one disable/enable executable memfd) will do. But with that approach, I'm afraid a version of the future (say in 20 years), most applications stays with memfd_create with the old API style, not setting the NX bit. With the current approach, it might seem to be less convenient, but I hope it offers a bit of incentive to make applications migrating their code towards the new API, explicitly setting the NX bit. I understand this hope is questionable, we might still end up the same in 20 years, but at least I tried :-). I will leave this decision to maintainers when you supply patches for that, and I wouldn't feel bad either way, there is a valid reason on both sides. To supplement, there are two other ways for what you want: 1> seccomp to block MFD_EXEC, and leaving the setting to 1. 2> implement the blocking using a security hook and LSM, imo, which is probably the most common way to deal with this type of request (block something). I admit those two ways will be less convenient than just having sysctl do all the things, from the user space's perspective. Thanks -Jeff