Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp31326pxb; Fri, 15 Jan 2021 07:02:05 -0800 (PST) X-Google-Smtp-Source: ABdhPJyJvmQUO87FMwr02m1cK/GTX6gJ6qJEqIpPJCmHclv/maXiL1TIKP2Zbi+B0tVQB927TywY X-Received: by 2002:a50:d80c:: with SMTP id o12mr9557984edj.338.1610722924986; Fri, 15 Jan 2021 07:02:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610722924; cv=none; d=google.com; s=arc-20160816; b=tmet5iYfxjmp19VAMCqIfAqyHzTIleF+8mk8MxVEBEHfc0EMbgyzClgcs0LBToeJjY QGTA1MtJBiVJKJqfzDVvaiLHVb6ecDf8t7whz+OzUb2cEyCFgkij4wUzF2g8fWI3a71w Xfg/8rAPtC+qFhe9wnmacI2fPYLo41cj9Mp7sVRxoozsbwwjei2QDi83LwKMZuVo6iXz eBOte8AGdCZKp/wpvVINDPzRTick3fN2gKYfrhmydDuhpD6Aup8ESMdPf8zR5PIGwJLy 829LhBZHbx65APt5gNI+4lGuAfinaRChjA4keg77bS5v0nLNMN9ephcq/GN61sYdWwNV XkGA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=KE1NQqg+NmWjX2ClbJCShKuThu9P4cKqoZuZNMAOCXs=; b=QDHiX5Ed5SNFu+sVf0fEpWv+5gGs2qmBalPYjO4yCrloVPTWsYSEZwNpvCdWZ451Db GNSLKRmteRiotCpsPFPbN4x6kAjaT1b2HXKIvQQF9Vm3P3d7h+1UWo9eOKtaF3XetN86 ceP73yxdac+VVbaRxUAhmHoO0Ij4kGeV3n0xQlAKdsSbBW2pxnHaB24uAhltTSGyHP0r 9j8Nnf3vcc97f3cnanmdu1vYbElA+j5aBGVtwCLdsQDDb0xzyygICanL+PHUIPPXuPMD ioZnrE59FFHv2AL1z3jTZGa9rqts4V3aIJXnZcuO9Wk+qlNWBS3R4irtxorICeT9bvSd n6NQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id r23si2595888ejs.503.2021.01.15.07.01.28; Fri, 15 Jan 2021 07:02:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731879AbhAOO7y (ORCPT + 99 others); Fri, 15 Jan 2021 09:59:54 -0500 Received: from raptor.unsafe.ru ([5.9.43.93]:53656 "EHLO raptor.unsafe.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729489AbhAOO7x (ORCPT ); Fri, 15 Jan 2021 09:59:53 -0500 Received: from comp-core-i7-2640m-0182e6.redhat.com (ip-89-103-122-167.net.upcbroadband.cz [89.103.122.167]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) (No client certificate requested) by raptor.unsafe.ru (Postfix) with ESMTPSA id 3CFFB20478; Fri, 15 Jan 2021 14:58:59 +0000 (UTC) From: Alexey Gladkov To: LKML , io-uring@vger.kernel.org, Kernel Hardening , Linux Containers , linux-mm@kvack.org Cc: Alexey Gladkov , Andrew Morton , Christian Brauner , "Eric W . Biederman" , Jann Horn , Jens Axboe , Kees Cook , Linus Torvalds , Oleg Nesterov Subject: [RFC PATCH v3 0/8] Count rlimits in each user namespace Date: Fri, 15 Jan 2021 15:57:21 +0100 Message-Id: X-Mailer: git-send-email 2.29.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.6.1 (raptor.unsafe.ru [5.9.43.93]); Fri, 15 Jan 2021 14:59:10 +0000 (UTC) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Preface ------- These patches are for binding the rlimit counters to a user in user namespace. This patch set can be applied on top of: git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.11-rc2 Problem ------- The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits implementation places the counters in user_struct [1]. These limits are global between processes and persists for the lifetime of the process, even if processes are in different user namespaces. To illustrate the impact of rlimits, let's say there is a program that does not fork. Some service-A wants to run this program as user X in multiple containers. Since the program never fork the service wants to set RLIMIT_NPROC=1. service-A \- program (uid=1000, container1, rlimit_nproc=1) \- program (uid=1000, container2, rlimit_nproc=1) The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails since user X already has one running process. The problem is not that the limit from container1 affects container2. The problem is that limit is verified against the global counter that reflects the number of processes in all containers. This problem can be worked around by using different users for each container but in this case we face a different problem of uid mapping when transferring files from one container to another. Eric W. Biederman mentioned this issue [2][3]. Introduced changes ------------------ To address the problem, we bind rlimit counters to user namespace. Each counter reflects the number of processes in a given uid in a given user namespace. The result is a tree of rlimit counters with the biggest value at the root (aka init_user_ns). The limit is considered exceeded if it's exceeded up in the tree. [1] https://lore.kernel.org/containers/87imd2incs.fsf@x220.int.ebiederm.org/ [2] https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html [3] https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html Changelog --------- v3: * Added get_ucounts() function to increase the reference count. The existing get_counts() function renamed to __get_ucounts(). * The type of ucounts.count changed from atomic_t to refcount_t. * Dropped 'const' from set_cred_ucounts() arguments. * Fixed a bug with freeing the cred structure after calling cred_alloc_blank(). * Commit messages have been updated. * Added selftest. v2: * RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts. * Added ucounts for pair uid and user namespace into cred. * Added the ability to increase ucount by more than 1. v1: * After discussion with Eric W. Biederman, I increased the size of ucounts to atomic_long_t. * Added ucount_max to avoid the fork bomb. -- Alexey Gladkov (8): Use refcount_t for ucounts reference counting Add a reference to ucounts for each cred Move RLIMIT_NPROC counter to ucounts Move RLIMIT_MSGQUEUE counter to ucounts Move RLIMIT_SIGPENDING counter to ucounts Move RLIMIT_MEMLOCK counter to ucounts Move RLIMIT_NPROC check to the place where we increment the counter kselftests: Add test to check for rlimit changes in different user namespaces fs/exec.c | 2 +- fs/hugetlbfs/inode.c | 17 +- fs/io-wq.c | 22 ++- fs/io-wq.h | 2 +- fs/io_uring.c | 2 +- fs/proc/array.c | 2 +- include/linux/cred.h | 3 + include/linux/hugetlb.h | 3 +- include/linux/mm.h | 4 +- include/linux/sched/user.h | 6 - include/linux/shmem_fs.h | 2 +- include/linux/signal_types.h | 4 +- include/linux/user_namespace.h | 31 +++- ipc/mqueue.c | 29 ++-- ipc/shm.c | 31 ++-- kernel/cred.c | 46 ++++- kernel/exit.c | 2 +- kernel/fork.c | 12 +- kernel/signal.c | 53 +++--- kernel/sys.c | 13 -- kernel/ucount.c | 111 +++++++++--- kernel/user.c | 2 - kernel/user_namespace.c | 7 +- mm/memfd.c | 4 +- mm/mlock.c | 35 ++-- mm/mmap.c | 3 +- mm/shmem.c | 8 +- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/rlimits/.gitignore | 2 + tools/testing/selftests/rlimits/Makefile | 6 + tools/testing/selftests/rlimits/config | 1 + .../selftests/rlimits/rlimits-per-userns.c | 161 ++++++++++++++++++ 32 files changed, 445 insertions(+), 182 deletions(-) create mode 100644 tools/testing/selftests/rlimits/.gitignore create mode 100644 tools/testing/selftests/rlimits/Makefile create mode 100644 tools/testing/selftests/rlimits/config create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c -- 2.29.2