Received: by 2002:a05:6a10:9afc:0:0:0:0 with SMTP id t28csp1074832pxm; Wed, 23 Feb 2022 17:27:43 -0800 (PST) X-Google-Smtp-Source: ABdhPJwDc6BCkIN7wPBgM4ht5YdH17Yz5qohdkp1Y2ZZxJxk6pu0OWg0SFB85ry29AcSFeYyF7mr X-Received: by 2002:a17:902:bd95:b0:14f:40ab:270e with SMTP id q21-20020a170902bd9500b0014f40ab270emr367273pls.101.1645666063806; Wed, 23 Feb 2022 17:27:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1645666063; cv=none; d=google.com; s=arc-20160816; b=IPCat10BUFn/nN/B+9IeO9A9Ty6BHgY6UWM82DgF5XJAugcFEGfF6XpxDvOUeLDMdO T8tYc9JPha2finapoigOqtcHa7qy6Ej3tLherPWCymbNy7c+1Gst4n2dyqLZaNI7oCf9 lJuB3hIXrebSgOtu7ZXH9LaLlinDw3u2/e3opKnPAzLW9GyKP1H7hw+1yqGaGoVTXMWv SAApYrkL23n5rhO0eznBepyKAtma086A2JWPCULBu7Mo5jGPrFvH5wHbLXN1q2TLwT3l A1vBxsuulzrJJmi70DgiGnnOz55K53vKamkeYNiP40ZeVqsd28ZXZgrdCFxJZQ68RLSW Zh6Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:subject:mime-version:message-id:date:user-agent :references:in-reply-to:cc:to:from; bh=QF+xMCOJfT37vWB4areL5q8D8Qh8hG22FoQEXFRiLPI=; b=FFQowXsoVjvb/YaS7xDG99cXpZl+haPaVqY3JifHRizBu3SpsPovu8Jf9u4NahP4+U Ju4D+Tk6C55gZCjDLqdWb4ZO17/i+rRk2d+PSnuIIN9bXRlYt0u4VcpA21tVkmLshRqi M9t/ycywJkuk5rn2a77kQyQp5Wk0dtMZVyE8PyTZXa8FJ1tF/zi4Bz08u9lcjLvjVaLl Yw0ZhUSmtJ6WipwBrXwzz931Jpzzn26TEP7sGAxdi4Zk5wRJkinrRSDzGehLsUcvpJvy 7saco7qAa8BL90KN4D4UoTUBqAi4YUL0Ct5E9EZ0exdAKumN0WmCrIZyOpUG+luQfIOb f1AQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=xmission.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id z9si1191947ple.488.2022.02.23.17.27.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 23 Feb 2022 17:27:43 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=xmission.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id F390B20B395; Wed, 23 Feb 2022 17:08:43 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235008AbiBWSBT (ORCPT + 99 others); Wed, 23 Feb 2022 13:01:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57430 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236233AbiBWSBR (ORCPT ); Wed, 23 Feb 2022 13:01:17 -0500 Received: from out02.mta.xmission.com (out02.mta.xmission.com [166.70.13.232]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7D96C3E5EA; Wed, 23 Feb 2022 10:00:49 -0800 (PST) Received: from in01.mta.xmission.com ([166.70.13.51]:50190) by out02.mta.xmission.com with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1nMvwT-001Zwz-5t; Wed, 23 Feb 2022 11:00:45 -0700 Received: from ip68-227-174-4.om.om.cox.net ([68.227.174.4]:34504 helo=email.froward.int.ebiederm.org.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1nMvwS-00CTRp-1i; Wed, 23 Feb 2022 11:00:44 -0700 From: "Eric W. Biederman" To: Cc: Etienne Dechamps , Alexey Gladkov , Kees Cook , Shuah Khan , Christian Brauner , Solar Designer , Ran Xiaokai , linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Linux Containers , Michal =?utf-8?Q?Koutn=C3=BD?= , , Neil Brown , NeilBrown , "Serge E. Hallyn" , Kees Cook , Jann Horn In-Reply-To: <87zgmi5rhm.fsf@email.froward.int.ebiederm.org> (Eric W. Biederman's message of "Tue, 22 Feb 2022 18:57:57 -0600") References: <20220207121800.5079-1-mkoutny@suse.com> <20220215101150.GD21589@blackbody.suse.cz> <87zgmi5rhm.fsf@email.froward.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) Date: Wed, 23 Feb 2022 12:00:16 -0600 Message-ID: <87fso91n0v.fsf_-_@email.froward.int.ebiederm.org> MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1nMvwS-00CTRp-1i;;;mid=<87fso91n0v.fsf_-_@email.froward.int.ebiederm.org>;;;hst=in01.mta.xmission.com;;;ip=68.227.174.4;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1/9RCjhfluUYzpGOmLuFphN7caHHRxw2iw= X-SA-Exim-Connect-IP: 68.227.174.4 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Virus: No X-Spam-DCC: XMission; sa02 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ; X-Spam-Relay-Country: X-Spam-Timing: total 488 ms - load_scoreonly_sql: 0.03 (0.0%), signal_user_changed: 4.0 (0.8%), b_tie_ro: 2.8 (0.6%), parse: 0.79 (0.2%), extract_message_metadata: 3.0 (0.6%), get_uri_detail_list: 1.53 (0.3%), tests_pri_-1000: 3.3 (0.7%), tests_pri_-950: 1.02 (0.2%), tests_pri_-900: 0.81 (0.2%), tests_pri_-90: 105 (21.5%), check_bayes: 104 (21.3%), b_tokenize: 7 (1.4%), b_tok_get_all: 7 (1.5%), b_comp_prob: 1.93 (0.4%), b_tok_touch_all: 85 (17.3%), b_finish: 0.72 (0.1%), tests_pri_0: 357 (73.1%), check_dkim_signature: 0.43 (0.1%), check_dkim_adsp: 1.94 (0.4%), poll_dns_idle: 0.66 (0.1%), tests_pri_10: 1.76 (0.4%), tests_pri_500: 6 (1.2%), rewrite_mail: 0.00 (0.0%) Subject: How should rlimits, suid exec, and capabilities interact? X-SA-Exim-Version: 4.2.1 (built Sat, 08 Feb 2020 21:53:50 +0000) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [CC'd the security list because I really don't know who the right people are to drag into this discussion] While looking at some issues that have cropped up with making it so that RLIMIT_NPROC cannot be escaped by creating a user namespace I have stumbled upon a very old issue of how rlimits and suid exec interact poorly. This specific saga starts with commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits (RLIMIT_NPROC)") from https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git which essentially replaced a capable() check with a an open-coded implementation of suser(), for RLIMIT_NPROC. The description from Neil Brown was: 1/ If a setuid process swaps it's real and effective uids and then forks, the fork fails if the new realuid has more processes than the original process was limited to. This is particularly a problem if a user with a process limit (e.g. 256) runs a setuid-root program which does setuid() + fork() (e.g. lprng) while root already has more than 256 process (which is quite possible). The root problem here is that a limit which should be a per-user limit is being implemented as a per-process limit with per-process (e.g. CAP_SYS_RESOURCE) controls. Being a per-user limit, it should be that the root-user can over-ride it, not just some process with CAP_SYS_RESOURCE. This patch adds a test to ignore process limits if the real user is root. The test to see if the real user is root was: if (p->real_cred->user != INIT_USER) ... which persists to this day in fs/fork.c:copy_process(). The practical problem with this test is that it works like nothing else in the kernel, and so does not look like what it is. Saying: if (!uid_eq(p->real_cred->uid, GLOBAL_ROOT_USER)) ... would at least be more recognizable. Really this entire test should be if (!capable(CAP_SYS_RESOURCE) because CAP_SYS_RESOURCE is the capability that controls if you are allowed to exceed your rlimits. Which brings us to the practical issues of how all of these things are wired together today. The per-user rlimits are accounted based upon a processes real user, not the effective user. All other permission checks are based upon the effective user. This has the practical effect that uids are swapped as above that the processes are charged to root, but use the permissions of an ordinary user. The problems get worse when you realize that suid exec does not reset any of the rlimits except for RLIMIT_STACK. The rlimits that are particularly affected and are per-user are: RLIMIT_NPROC, RLIMIT_MSGQUEUE, RLIMIT_SIGPENDING, RLIMIT_MEMLOCK. But I think failing to reset rlimits during exec has the potential to effect any suid exec. Does anyone have any historical knowledge or sense of how this should work? Right now it feels like we have coded ourselves into a corner and will have to risk breaking userspace to get out of it. AKA I think we need a policy of reseting rlimits on suid exec, and I think we need to store global rlimits based upon the effective user not the real user. Those changes should allow making capable calls where they belong, and removing the much too magic user == INIT_USER test for RLIMIT_NPROC. Eric