Received: by 2002:a05:6a10:9e8c:0:0:0:0 with SMTP id y12csp330235pxx; Thu, 29 Oct 2020 03:50:42 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzmukHdSQOjwpDxHW2shzHRwbsqG3C3u7ZqrmS9NcL6KW3z2cgQnTW5DTlPCXvwss4RJDj2 X-Received: by 2002:a17:906:4155:: with SMTP id l21mr3574736ejk.204.1603968642314; Thu, 29 Oct 2020 03:50:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1603968642; cv=none; d=google.com; s=arc-20160816; b=gKQVmtKndbtEiyUdG7OOSvLkUrODujMndDRFEMb0sbOB1l08p56zmyxxh0WZD4Rn81 sl/9qYHmG6/zy/LhLixYN8LtCmsGivn/6YcAk5TuvCLBnR5LsCPahJmEhc9C3CbSM3ej cYkxSrlbWneJM9y3+mEX1NQ1ElrG8Ya9t8dJ0FrFTTcSuBkM9YcUGVwy1QddcQvS5HWa zId4xkju7FwHb4mCEoTRSMBIs3vfVtI0tP3TvCsmGvJaFU7NdoyGchpLh8Ht1cz5OolQ hA03EdUykDbKB0vorenG62fcls//kK2xq6MjFklDP1crX2A7ILiMfhJtXZdUGLw+ASOZ 7dug== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=ZXiaW9XiVK/6fMZ3eSJRSReb8TQGHyjFAXNib2QPb5U=; b=1DB7dEBupDsbpeouO61LfzQQ/dNw1nYj8pFos5+uNH88EedC6tdlk3W7MB1nqYj1ft YYUZRS4xuEFi17vk9ozh1cX5V6v3XfZJxD7afxcQeCjxylQg8gziWkBs+CWHVRugkTVE wKKTHF/9iRQRZz8LUZUmNQQd6vvKMSK8G2N1CkI16A4YlxajZOSlQmSm7pPDaEBn0r7a Mfls+J15rApK+ZQIeB1vBaf7tFKwB+r0ynhxiBfqL1qLDMHBundCkFd5UXN0FbjppBU5 Y/4Q9EfoPpH25NZmSfewkj77N3ssLYf9Z4BGjULUoTqDCRhyGmzq+lYTtGIXHaf+cBV1 gAMg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id qc1si1667460ejb.315.2020.10.29.03.50.19; Thu, 29 Oct 2020 03:50:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726046AbgJ2Krr (ORCPT + 99 others); Thu, 29 Oct 2020 06:47:47 -0400 Received: from smtp-190d.mail.infomaniak.ch ([185.125.25.13]:58703 "EHLO smtp-190d.mail.infomaniak.ch" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726025AbgJ2Krq (ORCPT ); Thu, 29 Oct 2020 06:47:46 -0400 X-Greylist: delayed 139417 seconds by postgrey-1.27 at vger.kernel.org; Thu, 29 Oct 2020 06:47:44 EDT Received: from smtp-2-0000.mail.infomaniak.ch (unknown [10.5.36.107]) by smtp-2-3000.mail.infomaniak.ch (Postfix) with ESMTPS id 4CMMbM18Dqzlhss3; Thu, 29 Oct 2020 11:47:43 +0100 (CET) Received: from ns3096276.ip-94-23-54.eu (unknown [94.23.54.103]) by smtp-2-0000.mail.infomaniak.ch (Postfix) with ESMTPA id 4CMMbK0FgSzlh8TT; Thu, 29 Oct 2020 11:47:41 +0100 (CET) Subject: Re: [PATCH v22 07/12] landlock: Support filesystem access-control To: Jann Horn Cc: James Morris , "Serge E . Hallyn" , Al Viro , Andy Lutomirski , Anton Ivanov , Arnd Bergmann , Casey Schaufler , Jeff Dike , Jonathan Corbet , Kees Cook , Michael Kerrisk , Richard Weinberger , Shuah Khan , Vincent Dagonneau , Kernel Hardening , Linux API , linux-arch , "open list:DOCUMENTATION" , linux-fsdevel , kernel list , "open list:KERNEL SELFTEST FRAMEWORK" , linux-security-module , the arch/x86 maintainers , =?UTF-8?Q?Micka=c3=abl_Sala=c3=bcn?= References: <20201027200358.557003-1-mic@digikod.net> <20201027200358.557003-8-mic@digikod.net> From: =?UTF-8?Q?Micka=c3=abl_Sala=c3=bcn?= Message-ID: <341f2e5d-1d7f-e1d7-b982-0135dfc276e3@digikod.net> Date: Thu, 29 Oct 2020 11:47:40 +0100 User-Agent: MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 29/10/2020 02:06, Jann Horn wrote: > (On Tue, Oct 27, 2020 at 9:04 PM Mickaël Salaün wrote: >> Thanks to the Landlock objects and ruleset, it is possible to identify >> inodes according to a process's domain. To enable an unprivileged >> process to express a file hierarchy, it first needs to open a directory >> (or a file) and pass this file descriptor to the kernel through >> landlock_add_rule(2). When checking if a file access request is >> allowed, we walk from the requested dentry to the real root, following >> the different mount layers. The access to each "tagged" inodes are >> collected according to their rule layer level, and ANDed to create >> access to the requested file hierarchy. This makes possible to identify >> a lot of files without tagging every inodes nor modifying the >> filesystem, while still following the view and understanding the user >> has from the filesystem. >> >> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not >> keep the same struct inodes for the same inodes whereas these inodes are >> in use. >> >> This commit adds a minimal set of supported filesystem access-control >> which doesn't enable to restrict all file-related actions. This is the >> result of multiple discussions to minimize the code of Landlock to ease >> review. Thanks to the Landlock design, extending this access-control >> without breaking user space will not be a problem. Moreover, seccomp >> filters can be used to restrict the use of syscall families which may >> not be currently handled by Landlock. > [...] >> diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h > [...] >> +/** >> + * DOC: fs_access >> + * >> + * A set of actions on kernel objects may be defined by an attribute (e.g. >> + * &struct landlock_path_beneath_attr) including a bitmask of access. >> + * >> + * Filesystem flags >> + * ~~~~~~~~~~~~~~~~ >> + * >> + * These flags enable to restrict a sandbox process to a set of actions on > > s/sandbox/sandboxed/ OK > > [...] >> diff --git a/security/landlock/fs.c b/security/landlock/fs.c > [...] >> +static const struct landlock_object_underops landlock_fs_underops = { >> + .release = release_inode >> +}; > [...] >> +/* Access-control management */ >> + >> +static bool check_access_path_continue( >> + const struct landlock_ruleset *const domain, >> + const struct path *const path, const u32 access_request, >> + bool *const allow, u64 *const layer_mask) >> +{ >> + const struct landlock_rule *rule; >> + const struct inode *inode; >> + bool next = true; >> + >> + prefetch(path->dentry->d_parent); > > IIRC software prefetch() turned out to only rarely actually have a > performance benefit, and they often actually make things worse; see > e.g. . Unless you have strong > evidence that this actually brings a performance benefit, I'd probably > get rid of this. I took inspiration from the fs/d_path.c:prepend_path() but I agree. I'll remove prefetch() calls in the next series. I'll add them later if a benchmark shows an interesting performance impact. > >> + if (d_is_negative(path->dentry)) >> + /* Continues to walk while there is no mapped inode. */ >> + return true; >> + inode = d_backing_inode(path->dentry); >> + rcu_read_lock(); >> + rule = landlock_find_rule(domain, >> + rcu_dereference(landlock_inode(inode)->object)); >> + rcu_read_unlock(); >> + >> + /* Checks for matching layers. */ >> + if (rule && (rule->layers | *layer_mask)) { >> + *allow = (rule->access & access_request) == access_request; >> + if (*allow) { >> + *layer_mask &= ~rule->layers; >> + /* Stops when a rule from each layer granted access. */ >> + next = !!*layer_mask; >> + } else { >> + next = false; >> + } >> + } >> + return next; >> +} >> + >> +static int check_access_path(const struct landlock_ruleset *const domain, >> + const struct path *const path, u32 access_request) >> +{ >> + bool allow = false; >> + struct path walker_path; >> + u64 layer_mask; >> + >> + if (WARN_ON_ONCE(!domain || !path)) >> + return 0; >> + /* >> + * Allows access to pseudo filesystems that will never be mountable >> + * (e.g. sockfs, pipefs), but can still be reachable through >> + * /proc/self/fd . >> + */ >> + if ((path->dentry->d_sb->s_flags & SB_NOUSER) || >> + (d_is_positive(path->dentry) && >> + unlikely(IS_PRIVATE(d_backing_inode(path->dentry))))) >> + return 0; >> + if (WARN_ON_ONCE(domain->nb_layers < 1)) >> + return -EACCES; >> + >> + layer_mask = GENMASK_ULL(domain->nb_layers - 1, 0); >> + /* >> + * An access request which is not handled by the domain should be >> + * allowed. >> + */ >> + access_request &= domain->fs_access_mask; >> + if (access_request == 0) >> + return 0; >> + walker_path = *path; >> + path_get(&walker_path); >> + /* >> + * We need to walk through all the hierarchy to not miss any relevant >> + * restriction. >> + */ >> + while (check_access_path_continue(domain, &walker_path, access_request, >> + &allow, &layer_mask)) { > > The logic in this code might be clearer if > check_access_path_continue() just returns whether the rule permitted > the access. Then it'd look like: > > bool allow = false; > [...] > while (check_access_path_continue(domain, &walker_path, > access_request, &layer_mask)) { > if (layer_mask == 0) { > allow = true; > break; > } > [...] > } > > I think that would make it clearer under which conditions we can end > up returning "true" from check_access_path(). > > (The current code also looks correct to me, I just think it'd be > clearer this way. If you disagree, you can keep it as-is.) I agree, applied and tested. > > >> + struct dentry *parent_dentry; >> + >> +jump_up: >> + /* >> + * Does not work with orphaned/private mounts like overlayfs >> + * layers for now (cf. ovl_path_real() and ovl_path_open()). >> + */ >> + if (walker_path.dentry == walker_path.mnt->mnt_root) { >> + if (follow_up(&walker_path)) { >> + /* Ignores hidden mount points. */ >> + goto jump_up; >> + } else { >> + /* >> + * Stops at the real root. Denies access >> + * because not all layers have granted access. >> + */ >> + allow = false; >> + break; >> + } >> + } >> + if (unlikely(IS_ROOT(walker_path.dentry))) { >> + /* >> + * Stops at disconnected root directories. Only allows >> + * access to internal filesystems (e.g. nsfs which is >> + * reachable through /proc/self/ns). >> + */ >> + allow = !!(walker_path.mnt->mnt_flags & MNT_INTERNAL); >> + break; >> + } >> + parent_dentry = dget_parent(walker_path.dentry); >> + dput(walker_path.dentry); >> + walker_path.dentry = parent_dentry; >> + } >> + path_put(&walker_path); >> + return allow ? 0 : -EACCES; >> +} > [...] >> +static inline u32 get_file_access(const struct file *const file) >> +{ >> + u32 access = 0; >> + >> + if (file->f_mode & FMODE_READ) { >> + /* A directory can only be opened in read mode. */ >> + if (S_ISDIR(file_inode(file)->i_mode)) >> + return LANDLOCK_ACCESS_FS_READ_DIR; >> + access = LANDLOCK_ACCESS_FS_READ_FILE; >> + } >> + /* >> + * A LANDLOCK_ACCESS_FS_APPEND could be added but we also need to check >> + * fcntl(2). >> + */ > > Once https://lore.kernel.org/linux-api/20200831153207.GO3265@brightrain.aerifal.cx/ > lands, pwritev2() with RWF_NOAPPEND will also be problematic for > classifying "write" vs "append"; you may want to include that in the > comment. (Or delete the comment.) Right, I'll include it in the comment. > >> + if (file->f_mode & FMODE_WRITE) >> + access |= LANDLOCK_ACCESS_FS_WRITE_FILE; >> + /* __FMODE_EXEC is indeed part of f_flags, not f_mode. */ >> + if (file->f_flags & __FMODE_EXEC) >> + access |= LANDLOCK_ACCESS_FS_EXECUTE; >> + return access; >> +} > [...] >