Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp906630pxf; Wed, 7 Apr 2021 14:43:46 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzB4K+o9qBj5cqs9UtovPyTALC0UpX6SIiV+9wq/4BGo1+yAyn76zk4nDzK+yIILQwHstI6 X-Received: by 2002:a17:906:935a:: with SMTP id p26mr2787542ejw.521.1617831825989; Wed, 07 Apr 2021 14:43:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617831825; cv=none; d=google.com; s=arc-20160816; b=ThhwaGEgtL88OxzwyEb1yPjXr2aW5mbpRnqHn0pgttNj4F/GeGZJT1jd4plY7+NeHH FIM45Z39KXS1GGvAHTdwZQtpCti/KX0LdIgL4zP04vYjiXa1BvyK9C+GYymLv1vAFdmX UI0+iV4C8uzp8vf0qooT1trVqClN+o8LWLX4zVR+ij39QfCGvrq7RL91WMsjLrETs07b y4ZEFkWteBvr/C0044uJ0NjJhT9ujXDkDZd0dQI6opFB69DIrIFgMYIk4bsrDQNqW0h3 WSzNjyaHY2KTQmHCVkOKb2wDkwZjzRO6Bsmt0IgixpWtZgqcSNbXsKir1dw1wVAWhWkp iFIg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=OTBUdK3dAn1sq4pMe+bs6yL8wERHeVvjCTaGLDzkJWA=; b=dFDpESHNQpPqbinB76sFJNdudTuRQ3CEVNFApUe4aLtgZA8/LSiUBlVZGRQnAaPxpz ARtxVZJy1MSZtA2pngHD/APTFy188zHT8gmZRf8vl75E5/ttbgGysmuSfimu47RaRd8X Au/Tyk1wij4AGfjWAh7Ozfw1pBjbd5iTKuc2S4q5vDJ6vNJQ6yxNCMMmSh74b+u00QiS w3yGz131Giu9je7UJVUL/v5wjRRRjsVCw9uSRpIa3JragAB5K7vNBF/ENNCCeR9aMjP8 vH4MfM5NBkFicxnE/7eHkqbHsW0Dosr65Tnz7GzqBSmzFy4GnXLo706/MMtbpT+dDIxp gIKA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id n9si6164189edo.256.2021.04.07.14.43.21; Wed, 07 Apr 2021 14:43:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235667AbhDGRIu (ORCPT + 99 others); Wed, 7 Apr 2021 13:08:50 -0400 Received: from raptor.unsafe.ru ([5.9.43.93]:46574 "EHLO raptor.unsafe.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234050AbhDGRIs (ORCPT ); Wed, 7 Apr 2021 13:08:48 -0400 Received: from comp-core-i7-2640m-0182e6.redhat.com (ip-94-113-225-162.net.upcbroadband.cz [94.113.225.162]) by raptor.unsafe.ru (Postfix) with ESMTPSA id E1C0040C8C; Wed, 7 Apr 2021 17:08:28 +0000 (UTC) From: Alexey Gladkov To: LKML , Kernel Hardening , Linux Containers , linux-mm@kvack.org Cc: Alexey Gladkov , Andrew Morton , Christian Brauner , "Eric W . Biederman" , Jann Horn , Jens Axboe , Kees Cook , Linus Torvalds , Oleg Nesterov Subject: [PATCH v10 0/9] Count rlimits in each user namespace Date: Wed, 7 Apr 2021 19:08:05 +0200 Message-Id: X-Mailer: git-send-email 2.29.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.6.4 (raptor.unsafe.ru [5.9.43.93]); Wed, 07 Apr 2021 17:08:31 +0000 (UTC) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Preface ------- These patches are for binding the rlimit counters to a user in user namespace. This patch set can be applied on top of: git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.12-rc4 Problem ------- The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits implementation places the counters in user_struct [1]. These limits are global between processes and persists for the lifetime of the process, even if processes are in different user namespaces. To illustrate the impact of rlimits, let's say there is a program that does not fork. Some service-A wants to run this program as user X in multiple containers. Since the program never fork the service wants to set RLIMIT_NPROC=1. service-A \- program (uid=1000, container1, rlimit_nproc=1) \- program (uid=1000, container2, rlimit_nproc=1) The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails since user X already has one running process. The problem is not that the limit from container1 affects container2. The problem is that limit is verified against the global counter that reflects the number of processes in all containers. This problem can be worked around by using different users for each container but in this case we face a different problem of uid mapping when transferring files from one container to another. Eric W. Biederman mentioned this issue [2][3]. Introduced changes ------------------ To address the problem, we bind rlimit counters to user namespace. Each counter reflects the number of processes in a given uid in a given user namespace. The result is a tree of rlimit counters with the biggest value at the root (aka init_user_ns). The limit is considered exceeded if it's exceeded up in the tree. [1]: https://lore.kernel.org/containers/87imd2incs.fsf@x220.int.ebiederm.org/ [2]: https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html [3]: https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html Changelog --------- v10: * Fixed memory leak in __sigqueue_alloc. * Handled an unlikely situation when all consumers will return ucounts at once. * Addressed other review comments from Eric W. Biederman. v9: * Used a negative value to check that the ucounts->count is close to overflow. * Rebased onto v5.12-rc4. v8: * Used atomic_t for ucounts reference counting. Also added counter overflow check (thanks to Linus Torvalds for the idea). * Fixed other issues found by lkp-tests project in the patch that Reimplements RLIMIT_MEMLOCK on top of ucounts. v7: * Fixed issues found by lkp-tests project in the patch that Reimplements RLIMIT_MEMLOCK on top of ucounts. v6: * Fixed issues found by lkp-tests project. * Rebased onto v5.11. v5: * Split the first commit into two commits: change ucounts.count type to atomic_long_t and add ucounts to cred. These commits were merged by mistake during the rebase. * The __get_ucounts() renamed to alloc_ucounts(). * The cred.ucounts update has been moved from commit_creds() as it did not allow to handle errors. * Added error handling of set_cred_ucounts(). v4: * Reverted the type change of ucounts.count to refcount_t. * Fixed typo in the kernel/cred.c v3: * Added get_ucounts() function to increase the reference count. The existing get_counts() function renamed to __get_ucounts(). * The type of ucounts.count changed from atomic_t to refcount_t. * Dropped 'const' from set_cred_ucounts() arguments. * Fixed a bug with freeing the cred structure after calling cred_alloc_blank(). * Commit messages have been updated. * Added selftest. v2: * RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts. * Added ucounts for pair uid and user namespace into cred. * Added the ability to increase ucount by more than 1. v1: * After discussion with Eric W. Biederman, I increased the size of ucounts to atomic_long_t. * Added ucount_max to avoid the fork bomb. -- Alexey Gladkov (9): Increase size of ucounts to atomic_long_t Add a reference to ucounts for each cred Use atomic_t for ucounts reference counting Reimplement RLIMIT_NPROC on top of ucounts Reimplement RLIMIT_MSGQUEUE on top of ucounts Reimplement RLIMIT_SIGPENDING on top of ucounts Reimplement RLIMIT_MEMLOCK on top of ucounts kselftests: Add test to check for rlimit changes in different user namespaces ucounts: Set ucount_max to the largest positive value the type can hold fs/exec.c | 6 +- fs/hugetlbfs/inode.c | 16 +- fs/proc/array.c | 2 +- include/linux/cred.h | 4 + include/linux/hugetlb.h | 4 +- include/linux/mm.h | 4 +- include/linux/sched/user.h | 7 - include/linux/shmem_fs.h | 2 +- include/linux/signal_types.h | 4 +- include/linux/user_namespace.h | 32 +++- ipc/mqueue.c | 41 ++--- ipc/shm.c | 26 +-- kernel/cred.c | 50 +++++- kernel/exit.c | 2 +- kernel/fork.c | 18 +- kernel/signal.c | 58 +++---- kernel/sys.c | 14 +- kernel/ucount.c | 123 ++++++++++--- kernel/user.c | 3 - kernel/user_namespace.c | 9 +- mm/memfd.c | 4 +- mm/mlock.c | 23 ++- mm/mmap.c | 4 +- mm/shmem.c | 8 +- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/rlimits/.gitignore | 2 + tools/testing/selftests/rlimits/Makefile | 6 + tools/testing/selftests/rlimits/config | 1 + .../selftests/rlimits/rlimits-per-userns.c | 161 ++++++++++++++++++ 29 files changed, 490 insertions(+), 145 deletions(-) create mode 100644 tools/testing/selftests/rlimits/.gitignore create mode 100644 tools/testing/selftests/rlimits/Makefile create mode 100644 tools/testing/selftests/rlimits/config create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c -- 2.29.3