The patch expands subset= option. If the proc is mounted with the
subset=allowlist option, the /proc/allowlist file will appear. This file
contains the filenames and directories that are allowed for this
mountpoint. By default, /proc/allowlist contains only its own name.
Changing the allowlist is possible as long as it is present in the
allowlist itself.
This allowlist is applied in lookup/readdir so files that will create
modules after mounting will not be visible.
Compared to the previous patches [1][2], I switched to a special virtual
file from listing filenames in the mount options.
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Alexey Gladkov <[email protected]>
---
Alexey Gladkov (6):
proc: Fix separator for subset option
proc: Add allowlist to control access to procfs files
proc: Check that subset= option has been set
proc: Allow to use the allowlist filter in userns
proc: Validate incoming allowlist
doc: proc: Add description of subset=allowlist
Documentation/filesystems/proc.rst | 10 +
fs/proc/Kconfig | 10 +
fs/proc/Makefile | 1 +
fs/proc/generic.c | 15 +-
fs/proc/inode.c | 16 +-
fs/proc/internal.h | 33 ++++
fs/proc/proc_allowlist.c | 300 +++++++++++++++++++++++++++++
fs/proc/root.c | 36 +++-
include/linux/proc_fs.h | 18 +-
9 files changed, 420 insertions(+), 19 deletions(-)
create mode 100644 fs/proc/proc_allowlist.c
--
2.33.6
If, after creating a container and mounting procfs, the system
configuration may change and new files may appear in procfs. Files
including writable root or any other users.
In most cases, it is known in advance which files in procfs the user
needs in the container. It is much easier to control the list of what
you want than to control the list of unwanted files.
To do this, subset=allowlist is added to control the visibility of
static files in procfs (not process pids). After that, the control file
/proc/allowlist appears in the root of the filesystem. This file
contains a list of files and directories that will be visible in this
vmountpoint. Immediately after mount, this file contains only one
name - the name of the file itself.
The admin can add names, read this file to get the current state of the
allowlist. The file behaves almost like a regular file. Changes are
applied when the file is closed.
To prevent changes to allowlist, admin should remove its name from the
list of allowed files. After this change, the file will disappear.
Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/proc/Kconfig | 10 ++
fs/proc/Makefile | 1 +
fs/proc/generic.c | 15 ++-
fs/proc/internal.h | 29 +++++
fs/proc/proc_allowlist.c | 221 +++++++++++++++++++++++++++++++++++++++
fs/proc/root.c | 27 ++++-
include/linux/proc_fs.h | 8 ++
7 files changed, 306 insertions(+), 5 deletions(-)
create mode 100644 fs/proc/proc_allowlist.c
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 32b1116ae137..bfe80b1fd31f 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -108,3 +108,13 @@ config PROC_PID_ARCH_STATUS
config PROC_CPU_RESCTRL
def_bool n
depends on PROC_FS
+
+config PROC_ALLOW_LIST
+ bool "/proc/allowlist support"
+ depends on PROC_FS
+ default n
+ help
+ Provides a way to restrict access to certain files in procfs. Mounting
+ procfs with subset=allowlist will add the file /proc/allowlist which
+ contains a list of files and directories that should be accessed. To
+ prevent the list from being changed, the file itself must be excluded.
diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index bd08616ed8ba..3c7d3dacbd2f 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -34,3 +34,4 @@ proc-$(CONFIG_PROC_VMCORE) += vmcore.o
proc-$(CONFIG_PRINTK) += kmsg.o
proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o
proc-$(CONFIG_BOOT_CONFIG) += bootconfig.o
+proc-$(CONFIG_PROC_ALLOW_LIST) += proc_allowlist.o
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 587b91d9d998..d4c8589987e7 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -43,7 +43,7 @@ void pde_free(struct proc_dir_entry *pde)
kmem_cache_free(proc_dir_entry_cache, pde);
}
-static int proc_match(const char *name, struct proc_dir_entry *de, unsigned int len)
+int proc_match(const char *name, struct proc_dir_entry *de, unsigned int len)
{
if (len < de->namelen)
return -1;
@@ -251,6 +251,9 @@ struct dentry *proc_lookup_de(struct inode *dir, struct dentry *dentry,
if (de) {
pde_get(de);
read_unlock(&proc_subdir_lock);
+ if (!proc_pde_access_allowed(proc_sb_info(dir->i_sb), de)) {
+ return ERR_PTR(-ENOENT);
+ }
inode = proc_get_inode(dir->i_sb, de);
if (!inode)
return ERR_PTR(-ENOMEM);
@@ -266,7 +269,7 @@ struct dentry *proc_lookup(struct inode *dir, struct dentry *dentry,
{
struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
- if (fs_info->pidonly == PROC_PIDONLY_ON)
+ if (fs_info->pidonly == PROC_PIDONLY_ON && !proc_has_allowlist(fs_info))
return ERR_PTR(-ENOENT);
return proc_lookup_de(dir, dentry, PDE(dir));
@@ -284,6 +287,9 @@ struct dentry *proc_lookup(struct inode *dir, struct dentry *dentry,
int proc_readdir_de(struct file *file, struct dir_context *ctx,
struct proc_dir_entry *de)
{
+ struct inode *inode = file_inode(file);
+ struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
+
int i;
if (!dir_emit_dots(file, ctx))
@@ -307,7 +313,8 @@ int proc_readdir_de(struct file *file, struct dir_context *ctx,
struct proc_dir_entry *next;
pde_get(de);
read_unlock(&proc_subdir_lock);
- if (!dir_emit(ctx, de->name, de->namelen,
+ if (proc_pde_access_allowed(fs_info, de) &&
+ !dir_emit(ctx, de->name, de->namelen,
de->low_ino, de->mode >> 12)) {
pde_put(de);
return 0;
@@ -327,7 +334,7 @@ int proc_readdir(struct file *file, struct dir_context *ctx)
struct inode *inode = file_inode(file);
struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
- if (fs_info->pidonly == PROC_PIDONLY_ON)
+ if (fs_info->pidonly == PROC_PIDONLY_ON && !proc_has_allowlist(fs_info))
return 1;
return proc_readdir_de(file, ctx, PDE(inode));
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index b701d0207edf..999d105f6f96 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -84,6 +84,16 @@ static inline void pde_make_permanent(struct proc_dir_entry *pde)
pde->flags |= PROC_ENTRY_PERMANENT;
}
+static inline bool pde_is_allowlist(const struct proc_dir_entry *pde)
+{
+ return pde->flags & PROC_ENTRY_ALLOWLIST;
+}
+
+static inline void pde_make_allowlist(struct proc_dir_entry *pde)
+{
+ pde->flags |= PROC_ENTRY_ALLOWLIST;
+}
+
extern struct kmem_cache *proc_dir_entry_cache;
void pde_free(struct proc_dir_entry *pde);
@@ -187,6 +197,7 @@ struct proc_dir_entry *proc_create_reg(const char *name, umode_t mode,
struct proc_dir_entry **parent, void *data);
struct proc_dir_entry *proc_register(struct proc_dir_entry *dir,
struct proc_dir_entry *dp);
+extern int proc_match(const char *, struct proc_dir_entry *, unsigned int);
extern struct dentry *proc_lookup(struct inode *, struct dentry *, unsigned int);
struct dentry *proc_lookup_de(struct inode *, struct dentry *, struct proc_dir_entry *);
extern int proc_readdir(struct file *, struct dir_context *);
@@ -318,3 +329,21 @@ static inline void pde_force_lookup(struct proc_dir_entry *pde)
/* /proc/net/ entries can be changed under us by setns(CLONE_NEWNET) */
pde->proc_dops = &proc_net_dentry_ops;
}
+
+/*
+ * proc_allowlist.c
+ */
+#ifdef CONFIG_PROC_ALLOW_LIST
+extern bool proc_has_allowlist(struct proc_fs_info *);
+extern bool proc_pde_access_allowed(struct proc_fs_info *, struct proc_dir_entry *);
+#else
+static inline bool proc_has_allowlist(struct proc_fs_info *fs_info)
+{
+ return false;
+}
+
+static inline bool proc_pde_access_allowed(struct proc_fs_info *fs_info, struct proc_dir_entry *pde)
+{
+ return true;
+}
+#endif
diff --git a/fs/proc/proc_allowlist.c b/fs/proc/proc_allowlist.c
new file mode 100644
index 000000000000..b38e11b04199
--- /dev/null
+++ b/fs/proc/proc_allowlist.c
@@ -0,0 +1,221 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * linux/fs/proc/proc_allowlist.c
+ *
+ * Copyright (C) 2022
+ *
+ * Author: Alexey Gladkov <[email protected]>
+ */
+#include <linux/sizes.h>
+#include <linux/fs.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/rwlock.h>
+#include "internal.h"
+
+#define FILE_SEQFILE(f) ((struct seq_file *)((f)->private_data))
+#define FILE_DATA(f) (FILE_SEQFILE(f)->private)
+
+bool proc_has_allowlist(struct proc_fs_info *fs_info)
+{
+ bool ret;
+ unsigned long flags;
+
+ read_lock_irqsave(&fs_info->allowlist_lock, flags);
+ ret = (fs_info->allowlist == NULL);
+ read_unlock_irqrestore(&fs_info->allowlist_lock, flags);
+
+ return ret;
+}
+
+bool proc_pde_access_allowed(struct proc_fs_info *fs_info, struct proc_dir_entry *de)
+{
+ bool ret = false;
+ char *ptr;
+ unsigned long flags;
+
+ read_lock_irqsave(&fs_info->allowlist_lock, flags);
+
+ if (!fs_info->allowlist) {
+ read_unlock_irqrestore(&fs_info->allowlist_lock, flags);
+
+ if (!pde_is_allowlist(de))
+ ret = true;
+
+ return ret;
+ }
+
+ ptr = fs_info->allowlist;
+
+ while (*ptr != '\0') {
+ struct proc_dir_entry *pde;
+ char *sep, *end;
+ size_t len, pathlen;
+
+ if (!(sep = strchr(ptr, '\n')))
+ pathlen = strlen(ptr);
+ else
+ pathlen = (sep - ptr);
+
+ if (!pathlen)
+ goto next;
+
+ pde = de;
+ end = NULL;
+ len = pathlen;
+
+ while (ptr != end && len > 0) {
+ end = ptr + len - 1;
+
+ while (1) {
+ if (*end == '/') {
+ end++;
+ break;
+ }
+ if (end == ptr)
+ break;
+ end--;
+ }
+
+ if (proc_match(end, pde, ptr + len - end))
+ goto next;
+
+ len = end - ptr - 1;
+ pde = pde->parent;
+ }
+
+ ret = true;
+ break;
+next:
+ ptr += pathlen + 1;
+ }
+
+ read_unlock_irqrestore(&fs_info->allowlist_lock, flags);
+
+ return ret;
+}
+
+static int show_allowlist(struct seq_file *m, void *v)
+{
+ struct proc_fs_info *fs_info = proc_sb_info(m->file->f_inode->i_sb);
+ char *p = fs_info->allowlist;
+ unsigned long flags;
+
+ read_lock_irqsave(&fs_info->allowlist_lock, flags);
+ if (p)
+ seq_puts(m, p);
+ read_unlock_irqrestore(&fs_info->allowlist_lock, flags);
+
+ return 0;
+}
+
+static int open_allowlist(struct inode *inode, struct file *file)
+{
+ struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
+ int ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ // we need this because shrink_dcache_sb() can't drop our own dentry.
+ if (!proc_pde_access_allowed(fs_info, PDE(inode)))
+ return -ENOENT;
+
+ // we want a null-terminated string so not all 128K are available.
+ ret = single_open_size(file, show_allowlist, NULL, SZ_128K);
+
+ if (!ret && (file->f_mode & FMODE_WRITE) &&
+ (file->f_flags & O_APPEND) && !(file->f_flags & O_TRUNC))
+ show_allowlist(FILE_SEQFILE(file), NULL);
+
+ return ret;
+}
+
+static ssize_t write_allowlist(struct file *file, const char __user *buffer, size_t count, loff_t *pos)
+{
+ struct seq_file *seq_file = FILE_SEQFILE(file);
+ ssize_t ret;
+ ssize_t n = count;
+ const char *ptr = buffer;
+
+ if ((seq_file->count + count) >= (seq_file->size - 1))
+ return -EFBIG;
+
+ while (n > 0) {
+ char chunk[SZ_256];
+ loff_t chkpos = 0;
+ ssize_t i, len;
+
+ len = simple_write_to_buffer(chunk, sizeof(chunk), &chkpos, ptr, n);
+ if (len < 0)
+ return len;
+
+ for (i = 0; i < len; i++) {
+ if (!isprint(chunk[i]) && chunk[i] != '\n')
+ return -EINVAL;
+ }
+
+ ret = seq_write(seq_file, chunk, len);
+ if (ret < 0)
+ return -EINVAL;
+
+ ptr += len;
+ n -= len;
+ }
+
+ if (pos)
+ *pos += count;
+
+ return count;
+}
+
+static int close_allowlist(struct inode *inode, struct file *file)
+{
+ struct seq_file *seq_file = FILE_SEQFILE(file);
+ struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
+
+ if (seq_file->buf && (file->f_mode & FMODE_WRITE)) {
+ char *buf;
+
+ if (!seq_get_buf(seq_file, &buf))
+ return -EIO;
+ *buf = '\0';
+
+ if (strcmp(seq_file->buf, fs_info->allowlist)) {
+ unsigned long flags;
+
+ buf = kstrndup(seq_file->buf, seq_file->count, GFP_KERNEL_ACCOUNT);
+ if (!buf)
+ return -EIO;
+
+ write_lock_irqsave(&fs_info->allowlist_lock, flags);
+
+ shrink_dcache_sb(inode->i_sb);
+
+ kfree(fs_info->allowlist);
+ fs_info->allowlist = buf;
+
+ write_unlock_irqrestore(&fs_info->allowlist_lock, flags);
+ }
+ }
+
+ return single_release(inode, file);
+}
+
+static const struct proc_ops proc_allowlist_ops = {
+ .proc_open = open_allowlist,
+ .proc_read = seq_read,
+ .proc_lseek = seq_lseek,
+ .proc_write = write_allowlist,
+ .proc_release = close_allowlist,
+};
+
+static int __init proc_allowlist_init(void)
+{
+ struct proc_dir_entry *pde;
+ pde = proc_create("allowlist", S_IRUSR | S_IWUSR, NULL, &proc_allowlist_ops);
+ pde_make_permanent(pde);
+ pde_make_allowlist(pde);
+ return 0;
+}
+fs_initcall(proc_allowlist_init);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5f1015b6418d..1564f5cd118d 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -32,6 +32,7 @@ struct proc_fs_context {
enum proc_hidepid hidepid;
int gid;
enum proc_pidonly pidonly;
+ enum proc_allowlist allowlist;
};
enum proc_param {
@@ -99,6 +100,9 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
if (*value != '\0') {
if (!strcmp(value, "pid")) {
ctx->pidonly = PROC_PIDONLY_ON;
+ } else if (IS_ENABLED(CONFIG_PROC_ALLOW_LIST) &&
+ !strcmp(value, "allowlist")) {
+ ctx->allowlist = PROC_ALLOWLIST_ON;
} else {
return invalf(fc, "proc: unsupported subset option - %s\n", value);
}
@@ -142,6 +146,18 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
return 0;
}
+static char *proc_init_allowlist(void)
+{
+ char *content = kstrdup("allowlist\n", GFP_KERNEL_ACCOUNT);
+
+ if (!content) {
+ pr_err("proc_init_allowlist: allocation allowlist failed\n");
+ return NULL;
+ }
+
+ return content;
+}
+
static void proc_apply_options(struct proc_fs_info *fs_info,
struct fs_context *fc,
struct user_namespace *user_ns)
@@ -152,8 +168,14 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
if (ctx->mask & (1 << Opt_hidepid))
fs_info->hide_pid = ctx->hidepid;
- if (ctx->mask & (1 << Opt_subset))
+ if (ctx->mask & (1 << Opt_subset)) {
fs_info->pidonly = ctx->pidonly;
+ if (ctx->allowlist == PROC_ALLOWLIST_ON) {
+ fs_info->allowlist = proc_init_allowlist();
+ } else {
+ fs_info->allowlist = NULL;
+ }
+ }
}
static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -167,6 +189,8 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
if (!fs_info)
return -ENOMEM;
+ rwlock_init(&fs_info->allowlist_lock);
+
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
proc_apply_options(fs_info, fc, current_user_ns());
@@ -271,6 +295,7 @@ static void proc_kill_sb(struct super_block *sb)
kill_anon_super(sb);
put_pid_ns(fs_info->pid_ns);
+ kfree(fs_info->allowlist);
kfree(fs_info);
}
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 0260f5ea98fe..9105d75aeb18 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -24,6 +24,7 @@ enum {
#else
PROC_ENTRY_PERMANENT = 1U << 0,
#endif
+ PROC_ENTRY_ALLOWLIST = 1U << 1,
};
struct proc_ops {
@@ -58,6 +59,11 @@ enum proc_pidonly {
PROC_PIDONLY_ON = 1,
};
+enum proc_allowlist {
+ PROC_ALLOWLIST_OFF = 0,
+ PROC_ALLOWLIST_ON = 1,
+};
+
struct proc_fs_info {
struct pid_namespace *pid_ns;
struct dentry *proc_self; /* For /proc/self */
@@ -65,6 +71,8 @@ struct proc_fs_info {
kgid_t pid_gid;
enum proc_hidepid hide_pid;
enum proc_pidonly pidonly;
+ char *allowlist;
+ rwlock_t allowlist_lock;
};
static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
--
2.33.6
Signed-off-by: Alexey Gladkov <[email protected]>
---
Documentation/filesystems/proc.rst | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index e224b6d5b642..c2598bca8193 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2213,6 +2213,16 @@ information about processes information, just add identd to this group.
subset=pid hides all top level files and directories in the procfs that
are not related to tasks.
+subset=allowlist allows you to specify a list of files and directories to
+which you want to provide access. If the option is specified, then the
+/proc/allowlist will appear at the top level of the filesystem. By default, this
+file contains only its name. The user can add or remove other filenames and
+directories. To prohibit editing the allowlist, you need to exclude its name
+from the list of allowed ones.
+
+Different subset= option arguments can be combined using the plus(+) delimiter.
+For example: subset=pid+allowlist
+
Chapter 5: Filesystem behavior
==============================
--
2.33.6
Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/proc/root.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 3c2ee3eb1138..5f1015b6418d 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -91,7 +91,7 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
struct proc_fs_context *ctx = fc->fs_private;
while (value) {
- char *ptr = strchr(value, ',');
+ char *ptr = strchr(value, '+');
if (ptr != NULL)
*ptr++ = '\0';
--
2.33.6
Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/proc/internal.h | 10 +++
fs/proc/proc_allowlist.c | 165 ++++++++++++++++++++++++++++++---------
fs/proc/root.c | 22 +-----
include/linux/proc_fs.h | 7 +-
4 files changed, 149 insertions(+), 55 deletions(-)
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 3e1b1f29b13d..2ca4e53a4b4b 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -334,8 +334,18 @@ static inline void pde_force_lookup(struct proc_dir_entry *pde)
* proc_allowlist.c
*/
#ifdef CONFIG_PROC_ALLOW_LIST
+extern int proc_allowlist_append(struct list_head *, const char *, size_t);
+extern void proc_allowlist_free(struct list_head *);
extern bool proc_pde_access_allowed(struct proc_fs_info *, struct proc_dir_entry *);
#else
+static inline int proc_allowlist_append(struct list_head *, const char *, size_t)
+{
+ return 0;
+}
+static inline void proc_allowlist_free(struct list_head *)
+{
+ return;
+}
static inline bool proc_pde_access_allowed(struct proc_fs_info *fs_info, struct proc_dir_entry *pde)
{
return true;
diff --git a/fs/proc/proc_allowlist.c b/fs/proc/proc_allowlist.c
index c605f73622bd..0115015c74f0 100644
--- a/fs/proc/proc_allowlist.c
+++ b/fs/proc/proc_allowlist.c
@@ -11,16 +11,56 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/rwlock.h>
+#include <linux/list_sort.h>
#include "internal.h"
#define FILE_SEQFILE(f) ((struct seq_file *)((f)->private_data))
#define FILE_DATA(f) (FILE_SEQFILE(f)->private)
+int proc_allowlist_append(struct list_head *allowlist, const char *path, size_t len)
+{
+ struct allowlist_entry *new;
+
+ if (!len)
+ return 0;
+
+ new = kmalloc(sizeof(*new), GFP_KERNEL_ACCOUNT);
+ if (!new)
+ goto nomem;
+
+ new->path = kstrndup(path, len, GFP_KERNEL_ACCOUNT);
+ if (!new->path)
+ goto nomem;
+
+ INIT_LIST_HEAD(&new->list);
+ list_add_tail(&new->list, allowlist);
+
+ return 0;
+nomem:
+ if (new) {
+ kfree(new->path);
+ kfree(new);
+ }
+ return -ENOMEM;
+}
+
+void proc_allowlist_free(struct list_head *allowlist)
+{
+ struct list_head *el, *next;
+ struct allowlist_entry *entry;
+
+ list_for_each_safe(el, next, allowlist) {
+ entry = list_entry(el, struct allowlist_entry, list);
+ kfree(entry->path);
+ kfree(entry);
+ }
+}
+
bool proc_pde_access_allowed(struct proc_fs_info *fs_info, struct proc_dir_entry *de)
{
bool ret = false;
- char *ptr;
unsigned long flags;
+ struct list_head *el, *next;
if (!(fs_info->subset & PROC_SUBSET_ALLOWLIST)) {
if (!pde_is_allowlist(de))
@@ -31,24 +71,13 @@ bool proc_pde_access_allowed(struct proc_fs_info *fs_info, struct proc_dir_entry
read_lock_irqsave(&fs_info->allowlist_lock, flags);
- ptr = fs_info->allowlist;
-
- while (ptr && *ptr != '\0') {
- struct proc_dir_entry *pde;
- char *sep, *end;
- size_t len, pathlen;
+ list_for_each_safe(el, next, &fs_info->allowlist) {
+ struct allowlist_entry *entry = list_entry(el, struct allowlist_entry, list);
- if (!(sep = strchr(ptr, '\n')))
- pathlen = strlen(ptr);
- else
- pathlen = (sep - ptr);
-
- if (!pathlen)
- goto next;
-
- pde = de;
- end = NULL;
- len = pathlen;
+ struct proc_dir_entry *pde = de;
+ char *end = NULL;
+ char *ptr = entry->path;
+ size_t len = strlen(entry->path);
while (ptr != end && len > 0) {
end = ptr + len - 1;
@@ -72,8 +101,7 @@ bool proc_pde_access_allowed(struct proc_fs_info *fs_info, struct proc_dir_entry
ret = true;
break;
-next:
- ptr += pathlen + 1;
+next: ;
}
read_unlock_irqrestore(&fs_info->allowlist_lock, flags);
@@ -84,12 +112,18 @@ bool proc_pde_access_allowed(struct proc_fs_info *fs_info, struct proc_dir_entry
static int show_allowlist(struct seq_file *m, void *v)
{
struct proc_fs_info *fs_info = proc_sb_info(m->file->f_inode->i_sb);
- char *p = fs_info->allowlist;
unsigned long flags;
+ struct list_head *el, *next;
+ struct allowlist_entry *entry;
read_lock_irqsave(&fs_info->allowlist_lock, flags);
- if (p)
- seq_puts(m, p);
+
+ list_for_each_safe(el, next, &fs_info->allowlist) {
+ entry = list_entry(el, struct allowlist_entry, list);
+ seq_puts(m, entry->path);
+ seq_puts(m, "\n");
+ }
+
read_unlock_irqrestore(&fs_info->allowlist_lock, flags);
return 0;
@@ -155,34 +189,93 @@ static ssize_t write_allowlist(struct file *file, const char __user *buffer, siz
return count;
}
+static int allowlist_cmp(void *priv, const struct list_head *a, const struct list_head *b)
+{
+ struct allowlist_entry *ia = list_entry(a, struct allowlist_entry, list);
+ struct allowlist_entry *ib = list_entry(b, struct allowlist_entry, list);
+
+ return strcmp(ia->path, ib->path);
+}
+
+static int recreate_allowlist(struct proc_fs_info *fs_info, const char *buf, size_t buflen)
+{
+ const char *ptr = buf;
+ size_t len = buflen;
+ size_t lineno = 1;
+ int ret = 0;
+ LIST_HEAD(allowlist);
+
+ while (len > 0) {
+ char *sep;
+ size_t pathlen;
+
+ if (!(sep = memchr(ptr, '\n', len)))
+ pathlen = buflen;
+ else
+ pathlen = (sep - ptr);
+
+ if (pathlen > 0) {
+ ret = -ENAMETOOLONG;
+ if (pathlen >= PATH_MAX) {
+ pr_crit("allowlist:%lu: pathname is too long\n", lineno);
+ goto err;
+ }
+
+ ret = -EINVAL;
+ if (*ptr == '/') {
+ pr_crit("allowlist:%lu: the name must be relative to the mount point\n", lineno);
+ goto err;
+ }
+ if (!isalpha(*ptr)) {
+ pr_crit("allowlist:%lu: name must start with a letter\n", lineno);
+ goto err;
+ }
+
+ proc_allowlist_append(&allowlist, ptr, pathlen);
+ }
+
+ ptr += pathlen + 1;
+ len -= pathlen + 1;
+
+ lineno++;
+ }
+
+ proc_allowlist_free(&fs_info->allowlist);
+ INIT_LIST_HEAD(&fs_info->allowlist);
+
+ if (!list_empty(&allowlist)) {
+ list_replace(&allowlist, &fs_info->allowlist);
+ list_sort(NULL, &fs_info->allowlist, allowlist_cmp);
+ }
+
+ return 0;
+err:
+ proc_allowlist_free(&allowlist);
+ return ret;
+}
+
static int close_allowlist(struct inode *inode, struct file *file)
{
struct seq_file *seq_file = FILE_SEQFILE(file);
struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
if (seq_file->buf && (file->f_mode & FMODE_WRITE)) {
+ unsigned long flags;
char *buf;
if (!seq_get_buf(seq_file, &buf))
return -EIO;
*buf = '\0';
- if (strcmp(seq_file->buf, fs_info->allowlist)) {
- unsigned long flags;
-
- buf = kstrndup(seq_file->buf, seq_file->count, GFP_KERNEL_ACCOUNT);
- if (!buf)
- return -EIO;
-
- write_lock_irqsave(&fs_info->allowlist_lock, flags);
-
- shrink_dcache_sb(inode->i_sb);
-
- kfree(fs_info->allowlist);
- fs_info->allowlist = buf;
+ write_lock_irqsave(&fs_info->allowlist_lock, flags);
+ if (recreate_allowlist(fs_info, seq_file->buf, seq_file->count) < 0) {
write_unlock_irqrestore(&fs_info->allowlist_lock, flags);
+ return -EIO;
}
+
+ shrink_dcache_sb(inode->i_sb);
+ write_unlock_irqrestore(&fs_info->allowlist_lock, flags);
}
return single_release(inode, file);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 6e9b125072e5..18436d70bb12 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -147,18 +147,6 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
return 0;
}
-static char *proc_init_allowlist(void)
-{
- char *content = kstrdup("allowlist\n", GFP_KERNEL_ACCOUNT);
-
- if (!content) {
- pr_err("proc_init_allowlist: allocation allowlist failed\n");
- return NULL;
- }
-
- return content;
-}
-
static void proc_apply_options(struct proc_fs_info *fs_info,
struct fs_context *fc,
struct user_namespace *user_ns)
@@ -171,11 +159,8 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
fs_info->hide_pid = ctx->hidepid;
if (ctx->mask & (1 << Opt_subset)) {
fs_info->subset = ctx->subset;
- if (ctx->subset & PROC_SUBSET_ALLOWLIST) {
- fs_info->allowlist = proc_init_allowlist();
- } else {
- fs_info->allowlist = NULL;
- }
+ if (ctx->subset & PROC_SUBSET_ALLOWLIST)
+ proc_allowlist_append(&fs_info->allowlist, "allowlist", 10);
}
}
@@ -191,6 +176,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
return -ENOMEM;
rwlock_init(&fs_info->allowlist_lock);
+ INIT_LIST_HEAD(&fs_info->allowlist);
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
proc_apply_options(fs_info, fc, current_user_ns());
@@ -296,7 +282,7 @@ static void proc_kill_sb(struct super_block *sb)
kill_anon_super(sb);
put_pid_ns(fs_info->pid_ns);
- kfree(fs_info->allowlist);
+ proc_allowlist_free(&fs_info->allowlist);
kfree(fs_info);
}
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 08d0d0ae6e42..81c6b4b2ae97 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -59,6 +59,11 @@ enum proc_subset {
PROC_SUBSET_ALLOWLIST = (1 << 2),
};
+struct allowlist_entry {
+ struct list_head list;
+ char *path;
+};
+
struct proc_fs_info {
struct pid_namespace *pid_ns;
struct dentry *proc_self; /* For /proc/self */
@@ -66,8 +71,8 @@ struct proc_fs_info {
kgid_t pid_gid;
enum proc_hidepid hide_pid;
unsigned int subset;
- char *allowlist;
rwlock_t allowlist_lock;
+ struct list_head allowlist;
};
static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
--
2.33.6
Refactor subset option. Before this option had only one value - pid. Now
another meaning has appeared and therefore their combinations are
possible.
Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/proc/generic.c | 4 ++--
fs/proc/inode.c | 16 +++++++++++++---
fs/proc/internal.h | 6 ------
fs/proc/proc_allowlist.c | 22 ++++------------------
fs/proc/root.c | 27 +++++++++++++++++++--------
include/linux/proc_fs.h | 15 +++++----------
6 files changed, 43 insertions(+), 47 deletions(-)
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index d4c8589987e7..71a38b275814 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -269,7 +269,7 @@ struct dentry *proc_lookup(struct inode *dir, struct dentry *dentry,
{
struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
- if (fs_info->pidonly == PROC_PIDONLY_ON && !proc_has_allowlist(fs_info))
+ if ((fs_info->subset & PROC_SUBSET_PIDONLY) && !(fs_info->subset & PROC_SUBSET_ALLOWLIST))
return ERR_PTR(-ENOENT);
return proc_lookup_de(dir, dentry, PDE(dir));
@@ -334,7 +334,7 @@ int proc_readdir(struct file *file, struct dir_context *ctx)
struct inode *inode = file_inode(file);
struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
- if (fs_info->pidonly == PROC_PIDONLY_ON && !proc_has_allowlist(fs_info))
+ if ((fs_info->subset & PROC_SUBSET_PIDONLY) && !(fs_info->subset & PROC_SUBSET_ALLOWLIST))
return 1;
return proc_readdir_de(file, ctx, PDE(inode));
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index f495fdb39151..4c486237a16b 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -184,9 +184,19 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, fs_info->pid_gid));
if (fs_info->hide_pid != HIDEPID_OFF)
seq_printf(seq, ",hidepid=%s", hidepid2str(fs_info->hide_pid));
- if (fs_info->pidonly != PROC_PIDONLY_OFF)
- seq_printf(seq, ",subset=pid");
-
+ if (fs_info->subset & PROC_SUBSET_SET) {
+ bool need_delim = false;
+ seq_printf(seq, ",subset=");
+ if (fs_info->subset & PROC_SUBSET_PIDONLY) {
+ seq_printf(seq, "pid");
+ need_delim = true;
+ }
+ if (fs_info->subset & PROC_SUBSET_ALLOWLIST) {
+ if (need_delim)
+ seq_printf(seq, "+");
+ seq_printf(seq, "allowlist");
+ }
+ }
return 0;
}
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 999d105f6f96..3e1b1f29b13d 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -334,14 +334,8 @@ static inline void pde_force_lookup(struct proc_dir_entry *pde)
* proc_allowlist.c
*/
#ifdef CONFIG_PROC_ALLOW_LIST
-extern bool proc_has_allowlist(struct proc_fs_info *);
extern bool proc_pde_access_allowed(struct proc_fs_info *, struct proc_dir_entry *);
#else
-static inline bool proc_has_allowlist(struct proc_fs_info *fs_info)
-{
- return false;
-}
-
static inline bool proc_pde_access_allowed(struct proc_fs_info *fs_info, struct proc_dir_entry *pde)
{
return true;
diff --git a/fs/proc/proc_allowlist.c b/fs/proc/proc_allowlist.c
index b38e11b04199..2153acb8e467 100644
--- a/fs/proc/proc_allowlist.c
+++ b/fs/proc/proc_allowlist.c
@@ -16,38 +16,24 @@
#define FILE_SEQFILE(f) ((struct seq_file *)((f)->private_data))
#define FILE_DATA(f) (FILE_SEQFILE(f)->private)
-bool proc_has_allowlist(struct proc_fs_info *fs_info)
-{
- bool ret;
- unsigned long flags;
-
- read_lock_irqsave(&fs_info->allowlist_lock, flags);
- ret = (fs_info->allowlist == NULL);
- read_unlock_irqrestore(&fs_info->allowlist_lock, flags);
-
- return ret;
-}
-
bool proc_pde_access_allowed(struct proc_fs_info *fs_info, struct proc_dir_entry *de)
{
bool ret = false;
char *ptr;
unsigned long flags;
- read_lock_irqsave(&fs_info->allowlist_lock, flags);
-
- if (!fs_info->allowlist) {
- read_unlock_irqrestore(&fs_info->allowlist_lock, flags);
-
+ if (!(fs_info->subset & PROC_SUBSET_ALLOWLIST)) {
if (!pde_is_allowlist(de))
ret = true;
return ret;
}
+ read_lock_irqsave(&fs_info->allowlist_lock, flags);
+
ptr = fs_info->allowlist;
- while (*ptr != '\0') {
+ while (ptr && *ptr != '\0') {
struct proc_dir_entry *pde;
char *sep, *end;
size_t len, pathlen;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 1564f5cd118d..6e9b125072e5 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -31,8 +31,7 @@ struct proc_fs_context {
unsigned int mask;
enum proc_hidepid hidepid;
int gid;
- enum proc_pidonly pidonly;
- enum proc_allowlist allowlist;
+ unsigned int subset;
};
enum proc_param {
@@ -91,6 +90,8 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
{
struct proc_fs_context *ctx = fc->fs_private;
+ ctx->subset |= PROC_SUBSET_SET;
+
while (value) {
char *ptr = strchr(value, '+');
@@ -99,10 +100,10 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
if (*value != '\0') {
if (!strcmp(value, "pid")) {
- ctx->pidonly = PROC_PIDONLY_ON;
+ ctx->subset |= PROC_SUBSET_PIDONLY;
} else if (IS_ENABLED(CONFIG_PROC_ALLOW_LIST) &&
!strcmp(value, "allowlist")) {
- ctx->allowlist = PROC_ALLOWLIST_ON;
+ ctx->subset |= PROC_SUBSET_ALLOWLIST;
} else {
return invalf(fc, "proc: unsupported subset option - %s\n", value);
}
@@ -169,8 +170,8 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
if (ctx->mask & (1 << Opt_hidepid))
fs_info->hide_pid = ctx->hidepid;
if (ctx->mask & (1 << Opt_subset)) {
- fs_info->pidonly = ctx->pidonly;
- if (ctx->allowlist == PROC_ALLOWLIST_ON) {
+ fs_info->subset = ctx->subset;
+ if (ctx->subset & PROC_SUBSET_ALLOWLIST) {
fs_info->allowlist = proc_init_allowlist();
} else {
fs_info->allowlist = NULL;
@@ -346,14 +347,21 @@ static int proc_root_getattr(struct user_namespace *mnt_userns,
static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentry, unsigned int flags)
{
- if (!proc_pid_lookup(dentry, flags))
- return NULL;
+ struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
+
+ if (!(fs_info->subset & PROC_SUBSET_SET) || (fs_info->subset & PROC_SUBSET_PIDONLY)) {
+ if (!proc_pid_lookup(dentry, flags))
+ return NULL;
+ }
return proc_lookup(dir, dentry, flags);
}
static int proc_root_readdir(struct file *file, struct dir_context *ctx)
{
+ struct inode *inode = file_inode(file);
+ struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
+
if (ctx->pos < FIRST_PROCESS_ENTRY) {
int error = proc_readdir(file, ctx);
if (unlikely(error <= 0))
@@ -361,6 +369,9 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
ctx->pos = FIRST_PROCESS_ENTRY;
}
+ if ((fs_info->subset & PROC_SUBSET_SET) && !(fs_info->subset & PROC_SUBSET_PIDONLY))
+ return 1;
+
return proc_pid_readdir(file, ctx);
}
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 9105d75aeb18..08d0d0ae6e42 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -53,15 +53,10 @@ enum proc_hidepid {
HIDEPID_NOT_PTRACEABLE = 4, /* Limit pids to only ptraceable pids */
};
-/* definitions for proc mount option pidonly */
-enum proc_pidonly {
- PROC_PIDONLY_OFF = 0,
- PROC_PIDONLY_ON = 1,
-};
-
-enum proc_allowlist {
- PROC_ALLOWLIST_OFF = 0,
- PROC_ALLOWLIST_ON = 1,
+enum proc_subset {
+ PROC_SUBSET_SET = (1 << 0),
+ PROC_SUBSET_PIDONLY = (1 << 1),
+ PROC_SUBSET_ALLOWLIST = (1 << 2),
};
struct proc_fs_info {
@@ -70,7 +65,7 @@ struct proc_fs_info {
struct dentry *proc_thread_self; /* For /proc/thread-self */
kgid_t pid_gid;
enum proc_hidepid hide_pid;
- enum proc_pidonly pidonly;
+ unsigned int subset;
char *allowlist;
rwlock_t allowlist_lock;
};
--
2.33.6
Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/proc/proc_allowlist.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/proc/proc_allowlist.c b/fs/proc/proc_allowlist.c
index 2153acb8e467..c605f73622bd 100644
--- a/fs/proc/proc_allowlist.c
+++ b/fs/proc/proc_allowlist.c
@@ -100,7 +100,7 @@ static int open_allowlist(struct inode *inode, struct file *file)
struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
int ret;
- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
return -EPERM;
// we need this because shrink_dcache_sb() can't drop our own dentry.
@@ -199,7 +199,7 @@ static const struct proc_ops proc_allowlist_ops = {
static int __init proc_allowlist_init(void)
{
struct proc_dir_entry *pde;
- pde = proc_create("allowlist", S_IRUSR | S_IWUSR, NULL, &proc_allowlist_ops);
+ pde = proc_create("allowlist", S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH, NULL, &proc_allowlist_ops);
pde_make_permanent(pde);
pde_make_allowlist(pde);
return 0;
--
2.33.6
On Wed, 25 Jan 2023 16:28:47 +0100 Alexey Gladkov <[email protected]> wrote:
> The patch expands subset= option. If the proc is mounted with the
> subset=allowlist option, the /proc/allowlist file will appear. This file
> contains the filenames and directories that are allowed for this
> mountpoint. By default, /proc/allowlist contains only its own name.
> Changing the allowlist is possible as long as it is present in the
> allowlist itself.
>
> This allowlist is applied in lookup/readdir so files that will create
> modules after mounting will not be visible.
>
> Compared to the previous patches [1][2], I switched to a special virtual
> file from listing filenames in the mount options.
>
Changlog doesn't explain why you think Linux needs this feature. The
[2/6] changelog hints that containers might be involved. IOW, please
fully describe the requirement and use-case(s).
Also, please describe why /proc/allowlist is made available via a mount
option, rather than being permanently present.
And why add to subset=, instead of a separate mount option.
Does /proc/allowlist work in subdirectories? Like, permit presence of
/proc/sys/vm/compact_memory?
I think the whole thing is misnamed, really. "allowlist" implies
access permissions. Some of the test here uses "visibility" and other
places use "presence", which are better. "presentlist" and
/proc/presentlist might be better. But why not simply /proc/contents?
Please run these patches through checkpatch and consider the result.
On Wed, 25 Jan 2023 16:28:49 +0100 Alexey Gladkov <[email protected]> wrote:
> +config PROC_ALLOW_LIST
> + bool "/proc/allowlist support"
> + depends on PROC_FS
> + default n
> + help
> + Provides a way to restrict access to certain files in procfs. Mounting
I'd say "to restrict presence of files in procfs".
> + procfs with subset=allowlist will add the file /proc/allowlist which
> + contains a list of files and directories that should be accessed. To
s/accessed/present/
> + prevent the list from being changed, the file itself must be excluded.
On Wed, 25 Jan 2023 16:28:49 +0100 Alexey Gladkov <[email protected]> wrote:
> If, after creating a container and mounting procfs, the system
> configuration may change and new files may appear in procfs. Files
> including writable root or any other users.
>
> In most cases, it is known in advance which files in procfs the user
> needs in the container. It is much easier to control the list of what
> you want than to control the list of unwanted files.
>
> To do this, subset=allowlist is added to control the visibility of
> static files in procfs (not process pids). After that, the control file
> /proc/allowlist appears in the root of the filesystem. This file
> contains a list of files and directories that will be visible in this
> vmountpoint. Immediately after mount, this file contains only one
> name - the name of the file itself.
>
> The admin can add names, read this file to get the current state of the
> allowlist. The file behaves almost like a regular file. Changes are
> applied when the file is closed.
"the admin" is a bit vague. Precisely which capabilities are required
for this?
> To prevent changes to allowlist, admin should remove its name from the
> list of allowed files. After this change, the file will disappear.
On Wed, Jan 25, 2023 at 03:36:28PM -0800, Andrew Morton wrote:
> On Wed, 25 Jan 2023 16:28:47 +0100 Alexey Gladkov <[email protected]> wrote:
>
> > The patch expands subset= option. If the proc is mounted with the
> > subset=allowlist option, the /proc/allowlist file will appear. This file
> > contains the filenames and directories that are allowed for this
> > mountpoint. By default, /proc/allowlist contains only its own name.
> > Changing the allowlist is possible as long as it is present in the
> > allowlist itself.
> >
> > This allowlist is applied in lookup/readdir so files that will create
> > modules after mounting will not be visible.
> >
> > Compared to the previous patches [1][2], I switched to a special virtual
> > file from listing filenames in the mount options.
> >
>
> Changlog doesn't explain why you think Linux needs this feature. The
> [2/6] changelog hints that containers might be involved. IOW, please
> fully describe the requirement and use-case(s).
>
> Also, please describe why /proc/allowlist is made available via a mount
> option, rather than being permanently present.
>
> And why add to subset=, instead of a separate mount option.
>
> Does /proc/allowlist work in subdirectories? Like, permit presence of
> /proc/sys/vm/compact_memory?
>
> I think the whole thing is misnamed, really. "allowlist" implies
> access permissions. Some of the test here uses "visibility" and other
> places use "presence", which are better. "presentlist" and
> /proc/presentlist might be better. But why not simply /proc/contents?
Currently, a lot of container runtimes - even if they mount a new procfs
instance - overmount various procfs files and directories to ensure that
they're hidden from the container workload. (The motivations for this
are mixed and usually it's only needed for containers that run with the
same privilege level as the host.)
The consequence of overmounting is that we need to refuse mounting
procfs again somewhere else otherwise the procfs instance might reveal
files and directories that were supposed to be hidden.
So this patchset moves the ability to hide entries into the kernel
through an allowlist. This way you can hide files and directories while
being able to mount procfs again because it will inherit the same
allowlist.
I get the motivation. The question is whether this belongs into the
kernel at all. I'm unfortunately not convinced.
This adds a lot of string parsing to procfs and I think we would also
need to decide what a reasonable maximum limit for such allowlists would
be. The data structure likely shouldn't be a linked list but at least an
rbtree especially if the size isn't limited.
But fundamentally I think it moves something that should be and
currently is a userspace policy into the kernel which I think is wrong.
Sure you can't predict what files show up in procfs over time but then
subset=pid is already your friend - even if not as fine-grained.
If this where another simple subset style mount option that allowlists a
bunch of well-known global proc files then sure. But making this
dynamically configurable from userspace doesn't make sense to me. I
mean, users could write /gobble/dy/gook into /proc/allowlist or use it
to stash secrets or hashes or whatever as we have no way of figuring out
whether the entry they allowlist does or will actually ever exist.
In general, such flexibility belongs into userspace imho.
Frankly, if that is really required it would almost make more sense to
be able to attach a new bpf program type to procfs that would allow to
filter procfs entries. Then the filter could be done purely in
userspace. If signed bpf lands one could then even ship signed programs
that are attachable by userns root.
On Wed, Jan 25, 2023 at 03:36:42PM -0800, Andrew Morton wrote:
> On Wed, 25 Jan 2023 16:28:49 +0100 Alexey Gladkov <[email protected]> wrote:
>
> > If, after creating a container and mounting procfs, the system
> > configuration may change and new files may appear in procfs. Files
> > including writable root or any other users.
> >
> > In most cases, it is known in advance which files in procfs the user
> > needs in the container. It is much easier to control the list of what
> > you want than to control the list of unwanted files.
> >
> > To do this, subset=allowlist is added to control the visibility of
> > static files in procfs (not process pids). After that, the control file
> > /proc/allowlist appears in the root of the filesystem. This file
> > contains a list of files and directories that will be visible in this
> > vmountpoint. Immediately after mount, this file contains only one
> > name - the name of the file itself.
> >
> > The admin can add names, read this file to get the current state of the
> > allowlist. The file behaves almost like a regular file. Changes are
> > applied when the file is closed.
>
> "the admin" is a bit vague. Precisely which capabilities are required
> for this?
In this commit, the admin is the one with CAP_SYS_ADMIN. You're right,
I'll fix the commit message.
> > To prevent changes to allowlist, admin should remove its name from the
> > list of allowed files. After this change, the file will disappear.
>
--
Rgrds, legion
On Wed, Jan 25, 2023 at 03:36:28PM -0800, Andrew Morton wrote:
> On Wed, 25 Jan 2023 16:28:47 +0100 Alexey Gladkov <[email protected]> wrote:
>
> > The patch expands subset= option. If the proc is mounted with the
> > subset=allowlist option, the /proc/allowlist file will appear. This file
> > contains the filenames and directories that are allowed for this
> > mountpoint. By default, /proc/allowlist contains only its own name.
> > Changing the allowlist is possible as long as it is present in the
> > allowlist itself.
> >
> > This allowlist is applied in lookup/readdir so files that will create
> > modules after mounting will not be visible.
> >
> > Compared to the previous patches [1][2], I switched to a special virtual
> > file from listing filenames in the mount options.
> >
>
> Changlog doesn't explain why you think Linux needs this feature. The
> [2/6] changelog hints that containers might be involved. IOW, please
> fully describe the requirement and use-case(s).
Ok. I will.
Basically, as Christian described, the motivation is to give
containerization programs (docker, podman, etc.) a way to control the
content in procfs.
Now container tools use a list of dangerous files that they hide with
overmount. But procfs is not a static filesystem and using a bad list to
hide dangerous files can't be the solution.
I believe that a container should define a list of files that it considers
useful within the container, and not try to hide what it considers
unwanted.
> Also, please describe why /proc/allowlist is made available via a mount
> option, rather than being permanently present.
Like subset=pid, this file is needed to change the visibility of files in
the procfs mountpoint.
> And why add to subset=, instead of a separate mount option.
>
> Does /proc/allowlist work in subdirectories? Like, permit presence of
> /proc/sys/vm/compact_memory?
Yes. But /proc/allowlist is limited in size to 128K.
> I think the whole thing is misnamed, really. "allowlist" implies
> access permissions. Some of the test here uses "visibility" and other
> places use "presence", which are better. "presentlist" and
> /proc/presentlist might be better. But why not simply /proc/contents?
I don't hold on to the name allowlist at all :) present list is perfect
for me. The /proc/contents is confusing to me.
> Please run these patches through checkpatch and consider the result.
Ok. I will.
--
Rgrds, legion
On Thu, Jan 26, 2023 at 11:16:07AM +0100, Christian Brauner wrote:
> On Wed, Jan 25, 2023 at 03:36:28PM -0800, Andrew Morton wrote:
> > On Wed, 25 Jan 2023 16:28:47 +0100 Alexey Gladkov <[email protected]> wrote:
> >
> > > The patch expands subset= option. If the proc is mounted with the
> > > subset=allowlist option, the /proc/allowlist file will appear. This file
> > > contains the filenames and directories that are allowed for this
> > > mountpoint. By default, /proc/allowlist contains only its own name.
> > > Changing the allowlist is possible as long as it is present in the
> > > allowlist itself.
> > >
> > > This allowlist is applied in lookup/readdir so files that will create
> > > modules after mounting will not be visible.
> > >
> > > Compared to the previous patches [1][2], I switched to a special virtual
> > > file from listing filenames in the mount options.
> > >
> >
> > Changlog doesn't explain why you think Linux needs this feature. The
> > [2/6] changelog hints that containers might be involved. IOW, please
> > fully describe the requirement and use-case(s).
> >
> > Also, please describe why /proc/allowlist is made available via a mount
> > option, rather than being permanently present.
> >
> > And why add to subset=, instead of a separate mount option.
> >
> > Does /proc/allowlist work in subdirectories? Like, permit presence of
> > /proc/sys/vm/compact_memory?
> >
> > I think the whole thing is misnamed, really. "allowlist" implies
> > access permissions. Some of the test here uses "visibility" and other
> > places use "presence", which are better. "presentlist" and
> > /proc/presentlist might be better. But why not simply /proc/contents?
>
> Currently, a lot of container runtimes - even if they mount a new procfs
> instance - overmount various procfs files and directories to ensure that
> they're hidden from the container workload. (The motivations for this
> are mixed and usually it's only needed for containers that run with the
> same privilege level as the host.)
>
> The consequence of overmounting is that we need to refuse mounting
> procfs again somewhere else otherwise the procfs instance might reveal
> files and directories that were supposed to be hidden.
>
> So this patchset moves the ability to hide entries into the kernel
> through an allowlist. This way you can hide files and directories while
> being able to mount procfs again because it will inherit the same
> allowlist.
>
> I get the motivation. The question is whether this belongs into the
> kernel at all. I'm unfortunately not convinced.
>
> This adds a lot of string parsing to procfs and I think we would also
> need to decide what a reasonable maximum limit for such allowlists would
> be.> The data structure likely shouldn't be a linked list but at least an
> rbtree especially if the size isn't limited.
There is a limit. So far I've limited the file size to 128k. I think this
is a reasonable limit.
> But fundamentally I think it moves something that should be and
> currently is a userspace policy into the kernel which I think is wrong.
We don't have mechanisms to implement this userspace policy. overmount is
not a solution but plugging holes in the absence of other ways to control
the visibility of files in procfs.
> Sure you can't predict what files show up in procfs over time but then
> subset=pid is already your friend - even if not as fine-grained.
>
> If this where another simple subset style mount option that allowlists a
> bunch of well-known global proc files then sure. But making this
> dynamically configurable from userspace doesn't make sense to me. I
> mean, users could write /gobble/dy/gook into /proc/allowlist or use it
> to stash secrets or hashes or whatever as we have no way of figuring out
> whether the entry they allowlist does or will actually ever exist.
BTW I only allow printable data to be written to the file.
We can make this file write-only and then writing any extraneous data
there will not make sense.
> In general, such flexibility belongs into userspace imho.
>
> Frankly, if that is really required it would almost make more sense to
> be able to attach a new bpf program type to procfs that would allow to
> filter procfs entries. Then the filter could be done purely in
> userspace. If signed bpf lands one could then even ship signed programs
> that are attachable by userns root.
I'll ask the podman developers how much more comfortable they would be
using bpf to control file visibility in procfs. thanks for the idea.
--
Rgrds, legion
On Thu, Jan 26, 2023 at 02:39:30PM +0100, Alexey Gladkov wrote:
> > In general, such flexibility belongs into userspace imho.
> >
> > Frankly, if that is really required it would almost make more sense to
> > be able to attach a new bpf program type to procfs that would allow to
> > filter procfs entries. Then the filter could be done purely in
> > userspace. If signed bpf lands one could then even ship signed programs
> > that are attachable by userns root.
>
> I'll ask the podman developers how much more comfortable they would be
> using bpf to control file visibility in procfs. thanks for the idea.
I write for history.
After digging into eBPF, I came to the conclusion that nothing needs to be
done in kernel space. Access can be controlled via "lsm/file_open". Access
can be controlled per cgroup or per mountpoint, depending on the task.
Each project has its own choice.
Many thanks for pointing out eBPF.
--
Rgrds, legion