Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
after which it got allowed in commit 198214a7ee50 ("proc: enable
writing to /proc/pid/mem"). Famous last words from that patch:
"no longer a security hazard". :)
Afterwards exploits started causing drama like [1]. The exploits
using /proc/*/mem can be rather sophisticated like [2] which
installed an arbitrary payload from noexec storage into a running
process then exec'd it, which itself could include an ELF loader
to run arbitrary code off noexec storage.
One of the well-known problems with /proc/*/mem writes is they
ignore page permissions via FOLL_FORCE, as opposed to writes via
process_vm_writev which respect page permissions. These writes can
also be used to bypass mode bits.
To harden against these types of attacks, distrbutions might want
to restrict /proc/pid/mem accesses, either entirely or partially,
for eg. to restrict FOLL_FORCE usage.
Known valid use-cases which still need these accesses are:
* Debuggers which also have ptrace permissions, so they can access
memory anyway via PTRACE_POKEDATA & co. Some debuggers like GDB
are designed to write /proc/pid/mem for basic functionality.
* Container supervisors using the seccomp notifier to intercept
syscalls and rewrite memory of calling processes by passing
around /proc/pid/mem file descriptors.
There might be more, that's why these params default to disabled.
Regarding other mechanisms which can block these accesses:
* seccomp filters can be used to block mmap/mprotect calls with W|X
perms, but they often can't block open calls as daemons want to
read/write their runtime state and seccomp filters cannot check
file paths, so plain write calls can't be easily blocked.
* Since the mem file is part of the dynamic /proc/<pid>/ space, we
can't run chmod once at boot to restrict it (and trying to react
to every process and run chmod doesn't scale, and the kernel no
longer allows chmod on any of these paths).
* SELinux could be used with a rule to cover all /proc/*/mem files,
but even then having multiple ways to deny an attack is useful in
case one layer fails.
Thus we introduce three kernel parameters to restrict /proc/*/mem
access: read, write and foll_force. All three can be independently
set to the following values:
all => restrict all access unconditionally.
ptracer => restrict all access except for ptracer processes.
If left unset, the existing behaviour is preserved, i.e. access
is governed by basic file permissions.
Examples which can be passed by bootloaders:
restrict_proc_mem_foll_force=all
restrict_proc_mem_write=ptracer
restrict_proc_mem_read=ptracer
Each distribution needs to decide what restrictions to apply,
depending on its use-cases. Embedded systems might want to do
more, while general-purpouse distros might want a more relaxed
policy, because for e.g. foll_force=all and write=all both break
break GDB, so it might be a bit excessive.
Based on an initial patch by Mike Frysinger <[email protected]>.
Link: https://lwn.net/Articles/476947/ [1]
Link: https://issues.chromium.org/issues/40089045 [2]
Cc: Guenter Roeck <[email protected]>
Cc: Doug Anderson <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Christian Brauner <[email protected]>
Co-developed-by: Mike Frysinger <[email protected]>
Signed-off-by: Mike Frysinger <[email protected]>
Signed-off-by: Adrian Ratiu <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 27 +++++
fs/proc/base.c | 103 +++++++++++++++++-
include/linux/jump_label.h | 5 +
3 files changed, 133 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6e62b8cb19c8d..d7f7db41369c7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5665,6 +5665,33 @@
reset_devices [KNL] Force drivers to reset the underlying device
during initialization.
+ restrict_proc_mem_read= [KNL]
+ Format: {all | ptracer}
+ Allows restricting read access to /proc/*/mem files.
+ Depending on restriction level, open for reads return -EACCESS.
+ Can be one of:
+ - 'all' restricts all access unconditionally.
+ - 'ptracer' allows access only for ptracer processes.
+ If not specified, then basic file permissions continue to apply.
+
+ restrict_proc_mem_write= [KNL]
+ Format: {all | ptracer}
+ Allows restricting write access to /proc/*/mem files.
+ Depending on restriction level, open for writes return -EACCESS.
+ Can be one of:
+ - 'all' restricts all access unconditionally.
+ - 'ptracer' allows access only for ptracer processes.
+ If not specified, then basic file permissions continue to apply.
+
+ restrict_proc_mem_foll_force= [KNL]
+ Format: {all | ptracer}
+ Restricts the use of the FOLL_FORCE flag for /proc/*/mem access.
+ If restricted, the FOLL_FORCE flag will not be added to vm accesses.
+ Can be one of:
+ - 'all' restricts all access unconditionally.
+ - 'ptracer' allows access only for ptracer processes.
+ If not specified, FOLL_FORCE is always used.
+
resume= [SWSUSP]
Specify the partition device for software suspend
Format:
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 18550c071d71c..c733836c42a65 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -152,6 +152,41 @@ struct pid_entry {
NULL, &proc_pid_attr_operations, \
{ .lsmid = LSMID })
+/*
+ * each restrict_proc_mem_* param controls the following static branches:
+ * key[0] = restrict all writes
+ * key[1] = restrict writes except for ptracers
+ * key[2] = restrict all reads
+ * key[3] = restrict reads except for ptracers
+ * key[4] = restrict all FOLL_FORCE usage
+ * key[5] = restrict FOLL_FORCE usage except for ptracers
+ */
+DEFINE_STATIC_KEY_ARRAY_FALSE_RO(restrict_proc_mem, 6);
+
+static int __init early_restrict_proc_mem(char *buf, int offset)
+{
+ if (!buf)
+ return -EINVAL;
+
+ if (strncmp(buf, "all", 3) == 0)
+ static_branch_enable(&restrict_proc_mem[offset]);
+ else if (strncmp(buf, "ptracer", 7) == 0)
+ static_branch_enable(&restrict_proc_mem[offset + 1]);
+
+ return 0;
+}
+
+#define DEFINE_EARLY_RESTRICT_PROC_MEM(name, offset) \
+static int __init early_restrict_proc_mem_##name(char *buf) \
+{ \
+ return early_restrict_proc_mem(buf, offset); \
+} \
+early_param("restrict_proc_mem_" #name, early_restrict_proc_mem_##name)
+
+DEFINE_EARLY_RESTRICT_PROC_MEM(write, 0);
+DEFINE_EARLY_RESTRICT_PROC_MEM(read, 2);
+DEFINE_EARLY_RESTRICT_PROC_MEM(foll_force, 4);
+
/*
* Count the number of hardlinks for the pid_entry table, excluding the .
* and .. links.
@@ -825,9 +860,58 @@ static int __mem_open(struct inode *inode, struct file *file, unsigned int mode)
return 0;
}
+static bool __mem_open_current_is_ptracer(struct file *file)
+{
+ struct inode *inode = file_inode(file);
+ struct task_struct *task = get_proc_task(inode);
+ int ret = false;
+
+ if (task) {
+ rcu_read_lock();
+ if (current == ptrace_parent(task))
+ ret = true;
+ rcu_read_unlock();
+ put_task_struct(task);
+ }
+
+ return ret;
+}
+
+static int __mem_open_check_access_restriction(struct file *file)
+{
+ if (file->f_mode & FMODE_WRITE) {
+ /* Deny if writes are unconditionally disabled via param */
+ if (static_branch_unlikely(&restrict_proc_mem[0]))
+ return -EACCES;
+
+ /* Deny if writes are allowed only for ptracers via param */
+ if (static_branch_unlikely(&restrict_proc_mem[1]) &&
+ !__mem_open_current_is_ptracer(file))
+ return -EACCES;
+
+ } else if (file->f_mode & FMODE_READ) {
+ /* Deny if reads are unconditionally disabled via param */
+ if (static_branch_unlikely(&restrict_proc_mem[2]))
+ return -EACCES;
+
+ /* Deny if reads are allowed only for ptracers via param */
+ if (static_branch_unlikely(&restrict_proc_mem[3]) &&
+ !__mem_open_current_is_ptracer(file))
+ return -EACCES;
+ }
+
+ return 0;
+}
+
static int mem_open(struct inode *inode, struct file *file)
{
- int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
+ int ret;
+
+ ret = __mem_open_check_access_restriction(file);
+ if (ret)
+ return ret;
+
+ ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
/* OK to pass negative loff_t, we can catch out-of-range */
file->f_mode |= FMODE_UNSIGNED_OFFSET;
@@ -835,6 +919,20 @@ static int mem_open(struct inode *inode, struct file *file)
return ret;
}
+static unsigned int __mem_rw_get_foll_force_flag(struct file *file)
+{
+ /* Deny if FOLL_FORCE is disabled via param */
+ if (static_branch_unlikely(&restrict_proc_mem[4]))
+ return 0;
+
+ /* Deny if FOLL_FORCE is allowed only for ptracers via param */
+ if (static_branch_unlikely(&restrict_proc_mem[5]) &&
+ !__mem_open_current_is_ptracer(file))
+ return 0;
+
+ return FOLL_FORCE;
+}
+
static ssize_t mem_rw(struct file *file, char __user *buf,
size_t count, loff_t *ppos, int write)
{
@@ -855,7 +953,8 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
if (!mmget_not_zero(mm))
goto free;
- flags = FOLL_FORCE | (write ? FOLL_WRITE : 0);
+ flags = (write ? FOLL_WRITE : 0);
+ flags |= __mem_rw_get_foll_force_flag(file);
while (count > 0) {
size_t this_len = min_t(size_t, count, PAGE_SIZE);
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index f5a2727ca4a9a..ba2460fe878c5 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -398,6 +398,11 @@ struct static_key_false {
[0 ... (count) - 1] = STATIC_KEY_FALSE_INIT, \
}
+#define DEFINE_STATIC_KEY_ARRAY_FALSE_RO(name, count) \
+ struct static_key_false name[count] __ro_after_init = { \
+ [0 ... (count) - 1] = STATIC_KEY_FALSE_INIT, \
+ }
+
#define _DEFINE_STATIC_KEY_1(name) DEFINE_STATIC_KEY_TRUE(name)
#define _DEFINE_STATIC_KEY_0(name) DEFINE_STATIC_KEY_FALSE(name)
#define DEFINE_STATIC_KEY_MAYBE(cfg, name) \
--
2.30.2
Some systems might have difficulty changing their bootloaders
to enable the newly added restrict_proc_mem* params, for e.g.
remote embedded doing OTA updates, so this provides a set of
Kconfigs to set /proc/pid/mem restrictions at build-time.
The boot params take precedence over the Kconfig values. This
can be reversed, but doing it this way I think makes sense.
Another idea is to have a global bool Kconfig which can enable
or disable this mechanism in its entirety, however it does not
seem necessary since all three knobs default to off, the branch
logic overhead is rather minimal and I assume most of systems
will want to restrict at least the use of FOLL_FORCE.
Cc: Guenter Roeck <[email protected]>
Cc: Doug Anderson <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Adrian Ratiu <[email protected]>
---
fs/proc/base.c | 33 +++++++++++++++++++++++++++++++++
security/Kconfig | 42 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 75 insertions(+)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index c733836c42a65..e8ee848fc4a98 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -889,6 +889,17 @@ static int __mem_open_check_access_restriction(struct file *file)
!__mem_open_current_is_ptracer(file))
return -EACCES;
+#ifdef CONFIG_SECURITY_PROC_MEM_WRITE_RESTRICT
+ /* Deny if writes are unconditionally disabled via Kconfig */
+ if (!strncmp(CONFIG_SECURITY_PROC_MEM_WRITE_RESTRICT, "all", 3))
+ return -EACCES;
+
+ /* Deny if writes are allowed only for ptracers via Kconfig */
+ if (!strncmp(CONFIG_SECURITY_PROC_MEM_WRITE_RESTRICT, "ptracer", 7) &&
+ !__mem_open_current_is_ptracer(file))
+ return -EACCES;
+#endif
+
} else if (file->f_mode & FMODE_READ) {
/* Deny if reads are unconditionally disabled via param */
if (static_branch_unlikely(&restrict_proc_mem[2]))
@@ -898,6 +909,17 @@ static int __mem_open_check_access_restriction(struct file *file)
if (static_branch_unlikely(&restrict_proc_mem[3]) &&
!__mem_open_current_is_ptracer(file))
return -EACCES;
+
+#ifdef CONFIG_SECURITY_PROC_MEM_READ_RESTRICT
+ /* Deny if reads are unconditionally disabled via Kconfig */
+ if (!strncmp(CONFIG_SECURITY_PROC_MEM_READ_RESTRICT, "all", 3))
+ return -EACCES;
+
+ /* Deny if reads are allowed only for ptracers via Kconfig */
+ if (!strncmp(CONFIG_SECURITY_PROC_MEM_READ_RESTRICT, "ptracer", 7) &&
+ !__mem_open_current_is_ptracer(file))
+ return -EACCES;
+#endif
}
return 0;
@@ -930,6 +952,17 @@ static unsigned int __mem_rw_get_foll_force_flag(struct file *file)
!__mem_open_current_is_ptracer(file))
return 0;
+#ifdef CONFIG_SECURITY_PROC_MEM_FOLL_FORCE_RESTRICT
+ /* Deny if FOLL_FORCE is disabled via Kconfig */
+ if (!strncmp(CONFIG_SECURITY_PROC_MEM_FOLL_FORCE_RESTRICT, "all", 3))
+ return 0;
+
+ /* Deny if FOLL_FORCE is only allowed for ptracers via Kconfig */
+ if (!strncmp(CONFIG_SECURITY_PROC_MEM_FOLL_FORCE_RESTRICT, "ptracer", 7) &&
+ !__mem_open_current_is_ptracer(file))
+ return 0;
+#endif
+
return FOLL_FORCE;
}
diff --git a/security/Kconfig b/security/Kconfig
index 412e76f1575d0..31a588cedec8d 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -19,6 +19,48 @@ config SECURITY_DMESG_RESTRICT
If you are unsure how to answer this question, answer N.
+config SECURITY_PROC_MEM_READ_RESTRICT
+ string "Restrict read access to /proc/*/mem files"
+ depends on PROC_FS
+ default "none"
+ help
+ This option allows specifying a restriction level for read access
+ to /proc/*/mem files. Can be one of:
+ - 'all' restricts all access unconditionally.
+ - 'ptracer' allows access only for ptracer processes.
+
+ This can also be set at boot with the "restrict_proc_mem_read=" param.
+
+ If unsure leave empty to continue using basic file permissions.
+
+config SECURITY_PROC_MEM_WRITE_RESTRICT
+ string "Restrict write access to /proc/*/mem files"
+ depends on PROC_FS
+ default "none"
+ help
+ This option allows specifying a restriction level for write access
+ to /proc/*/mem files. Can be one of:
+ - 'all' restricts all access unconditionally.
+ - 'ptracer' allows access only for ptracer processes.
+
+ This can also be set at boot with the "restrict_proc_mem_write=" param.
+
+ If unsure leave empty to continue using basic file permissions.
+
+config SECURITY_PROC_MEM_FOLL_FORCE_RESTRICT
+ string "Restrict use of FOLL_FORCE for /proc/*/mem access"
+ depends on PROC_FS
+ default ""
+ help
+ This option allows specifying a restriction level for FOLL_FORCE usage
+ for /proc/*/mem access. Can be one of:
+ - 'all' restricts all access unconditionally.
+ - 'ptracer' allows access only for ptracer processes.
+
+ This can also be set at boot with the "restrict_proc_mem_foll_force=" param.
+
+ If unsure leave empty to continue using FOLL_FORCE without restriction.
+
config SECURITY
bool "Enable different security models"
depends on SYSFS
--
2.30.2
On Tue, Apr 09, 2024 at 08:57:49PM +0300, Adrian Ratiu wrote:
> Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
> after which it got allowed in commit 198214a7ee50 ("proc: enable
> writing to /proc/pid/mem"). Famous last words from that patch:
> "no longer a security hazard". :)
>
> Afterwards exploits started causing drama like [1]. The exploits
> using /proc/*/mem can be rather sophisticated like [2] which
> installed an arbitrary payload from noexec storage into a running
> process then exec'd it, which itself could include an ELF loader
> to run arbitrary code off noexec storage.
>
> One of the well-known problems with /proc/*/mem writes is they
> ignore page permissions via FOLL_FORCE, as opposed to writes via
> process_vm_writev which respect page permissions. These writes can
> also be used to bypass mode bits.
>
> To harden against these types of attacks, distrbutions might want
> to restrict /proc/pid/mem accesses, either entirely or partially,
> for eg. to restrict FOLL_FORCE usage.
>
> Known valid use-cases which still need these accesses are:
>
> * Debuggers which also have ptrace permissions, so they can access
> memory anyway via PTRACE_POKEDATA & co. Some debuggers like GDB
> are designed to write /proc/pid/mem for basic functionality.
>
> * Container supervisors using the seccomp notifier to intercept
> syscalls and rewrite memory of calling processes by passing
> around /proc/pid/mem file descriptors.
>
> There might be more, that's why these params default to disabled.
>
> Regarding other mechanisms which can block these accesses:
>
> * seccomp filters can be used to block mmap/mprotect calls with W|X
> perms, but they often can't block open calls as daemons want to
> read/write their runtime state and seccomp filters cannot check
> file paths, so plain write calls can't be easily blocked.
>
> * Since the mem file is part of the dynamic /proc/<pid>/ space, we
> can't run chmod once at boot to restrict it (and trying to react
> to every process and run chmod doesn't scale, and the kernel no
> longer allows chmod on any of these paths).
>
> * SELinux could be used with a rule to cover all /proc/*/mem files,
> but even then having multiple ways to deny an attack is useful in
> case one layer fails.
>
> Thus we introduce three kernel parameters to restrict /proc/*/mem
> access: read, write and foll_force. All three can be independently
> set to the following values:
>
> all => restrict all access unconditionally.
> ptracer => restrict all access except for ptracer processes.
>
> If left unset, the existing behaviour is preserved, i.e. access
> is governed by basic file permissions.
>
> Examples which can be passed by bootloaders:
>
> restrict_proc_mem_foll_force=all
> restrict_proc_mem_write=ptracer
> restrict_proc_mem_read=ptracer
>
> Each distribution needs to decide what restrictions to apply,
> depending on its use-cases. Embedded systems might want to do
> more, while general-purpouse distros might want a more relaxed
> policy, because for e.g. foll_force=all and write=all both break
> break GDB, so it might be a bit excessive.
>
> Based on an initial patch by Mike Frysinger <[email protected]>.
Thanks for this new version!
>
> Link: https://lwn.net/Articles/476947/ [1]
> Link: https://issues.chromium.org/issues/40089045 [2]
> Cc: Guenter Roeck <[email protected]>
> Cc: Doug Anderson <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Jann Horn <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Co-developed-by: Mike Frysinger <[email protected]>
> Signed-off-by: Mike Frysinger <[email protected]>
> Signed-off-by: Adrian Ratiu <[email protected]>
> ---
> .../admin-guide/kernel-parameters.txt | 27 +++++
> fs/proc/base.c | 103 +++++++++++++++++-
> include/linux/jump_label.h | 5 +
> 3 files changed, 133 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 6e62b8cb19c8d..d7f7db41369c7 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5665,6 +5665,33 @@
> reset_devices [KNL] Force drivers to reset the underlying device
> during initialization.
>
> + restrict_proc_mem_read= [KNL]
> + Format: {all | ptracer}
> + Allows restricting read access to /proc/*/mem files.
> + Depending on restriction level, open for reads return -EACCESS.
> + Can be one of:
> + - 'all' restricts all access unconditionally.
> + - 'ptracer' allows access only for ptracer processes.
> + If not specified, then basic file permissions continue to apply.
> +
> + restrict_proc_mem_write= [KNL]
> + Format: {all | ptracer}
> + Allows restricting write access to /proc/*/mem files.
> + Depending on restriction level, open for writes return -EACCESS.
> + Can be one of:
> + - 'all' restricts all access unconditionally.
> + - 'ptracer' allows access only for ptracer processes.
> + If not specified, then basic file permissions continue to apply.
> +
> + restrict_proc_mem_foll_force= [KNL]
> + Format: {all | ptracer}
> + Restricts the use of the FOLL_FORCE flag for /proc/*/mem access.
> + If restricted, the FOLL_FORCE flag will not be added to vm accesses.
> + Can be one of:
> + - 'all' restricts all access unconditionally.
> + - 'ptracer' allows access only for ptracer processes.
> + If not specified, FOLL_FORCE is always used.
bike shedding: I wonder if this should be a fake namespace (adding a dot
just to break it up for reading more easily), and have words reordered
to the kernel's more common subject-verb-object: proc_mem.restrict_read=...
> +
> resume= [SWSUSP]
> Specify the partition device for software suspend
> Format:
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 18550c071d71c..c733836c42a65 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -152,6 +152,41 @@ struct pid_entry {
> NULL, &proc_pid_attr_operations, \
> { .lsmid = LSMID })
>
> +/*
> + * each restrict_proc_mem_* param controls the following static branches:
> + * key[0] = restrict all writes
> + * key[1] = restrict writes except for ptracers
> + * key[2] = restrict all reads
> + * key[3] = restrict reads except for ptracers
> + * key[4] = restrict all FOLL_FORCE usage
> + * key[5] = restrict FOLL_FORCE usage except for ptracers
> + */
> +DEFINE_STATIC_KEY_ARRAY_FALSE_RO(restrict_proc_mem, 6);
So, I don't like having open-coded numbers. And I'm not sure there's a
benefit to stuffing these all into an array? So:
DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_read);
DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_write);
DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_foll_force);
> +
> +static int __init early_restrict_proc_mem(char *buf, int offset)
> +{
> + if (!buf)
> + return -EINVAL;
> +
> + if (strncmp(buf, "all", 3) == 0)
I'd use strcmp() to get exact matches. That way "allalksdjflas" doesn't
match. :)
> + static_branch_enable(&restrict_proc_mem[offset]);
> + else if (strncmp(buf, "ptracer", 7) == 0)
> + static_branch_enable(&restrict_proc_mem[offset + 1]);
> +
> + return 0;
> +}
Then don't bother with a common helper since you've got a macro, and
it'll all get tossed after __init anyway.
> +
> +#define DEFINE_EARLY_RESTRICT_PROC_MEM(name, offset) \
> +static int __init early_restrict_proc_mem_##name(char *buf) \
> +{ \
> + return early_restrict_proc_mem(buf, offset); \
> +} \
> +early_param("restrict_proc_mem_" #name, early_restrict_proc_mem_##name)
> +
> +DEFINE_EARLY_RESTRICT_PROC_MEM(write, 0);
> +DEFINE_EARLY_RESTRICT_PROC_MEM(read, 2);
> +DEFINE_EARLY_RESTRICT_PROC_MEM(foll_force, 4);
#define DEFINE_EARLY_PROC_MEM_RESTRICT(name) \
static int __init early_proc_mem_restrict_##name(char *buf) \
{ \
if (!buf) \
return -EINVAL; \
\
if (strcmp(buf, "all") == 0) \
static_branch_enable(&proc_mem_restrict_##name); \
else if (strcmp(buf, "ptracer") == 0) \
static_branch_enable(&proc_mem_restrict_##name); \
\
return 0; \
} \
early_param("proc_mem_restrict_" #name, early_proc_mem_restrict_##name)
> +
> /*
> * Count the number of hardlinks for the pid_entry table, excluding the .
> * and .. links.
> @@ -825,9 +860,58 @@ static int __mem_open(struct inode *inode, struct file *file, unsigned int mode)
> return 0;
> }
>
> +static bool __mem_open_current_is_ptracer(struct file *file)
> +{
> + struct inode *inode = file_inode(file);
> + struct task_struct *task = get_proc_task(inode);
> + int ret = false;
> +
> + if (task) {
> + rcu_read_lock();
> + if (current == ptrace_parent(task))
> + ret = true;
> + rcu_read_unlock();
> + put_task_struct(task);
> + }
This creates a ToCToU race between this check (which releases the task)
and the later memopen which make get a different task (and mm).
To deal with this, I think you need to add a new mode flag for
proc_mem_open(), and add the checking there.
> +
> + return ret;
> +}
> +
> +static int __mem_open_check_access_restriction(struct file *file)
> +{
> + if (file->f_mode & FMODE_WRITE) {
> + /* Deny if writes are unconditionally disabled via param */
> + if (static_branch_unlikely(&restrict_proc_mem[0]))
> + return -EACCES;
> +
> + /* Deny if writes are allowed only for ptracers via param */
> + if (static_branch_unlikely(&restrict_proc_mem[1]) &&
> + !__mem_open_current_is_ptracer(file))
> + return -EACCES;
> +
> + } else if (file->f_mode & FMODE_READ) {
I think this "else" means that O_RDWR opens will only check the write
flag, so drop the "else".
> + /* Deny if reads are unconditionally disabled via param */
> + if (static_branch_unlikely(&restrict_proc_mem[2]))
> + return -EACCES;
> +
> + /* Deny if reads are allowed only for ptracers via param */
> + if (static_branch_unlikely(&restrict_proc_mem[3]) &&
> + !__mem_open_current_is_ptracer(file))
> + return -EACCES;
> + }
> +
> + return 0;
> +}
> +
> static int mem_open(struct inode *inode, struct file *file)
> {
> - int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> + int ret;
> +
> + ret = __mem_open_check_access_restriction(file);
> + if (ret)
> + return ret;
> +
> + ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
>
> /* OK to pass negative loff_t, we can catch out-of-range */
> file->f_mode |= FMODE_UNSIGNED_OFFSET;
> @@ -835,6 +919,20 @@ static int mem_open(struct inode *inode, struct file *file)
> return ret;
> }
>
> +static unsigned int __mem_rw_get_foll_force_flag(struct file *file)
> +{
> + /* Deny if FOLL_FORCE is disabled via param */
> + if (static_branch_unlikely(&restrict_proc_mem[4]))
> + return 0;
> +
> + /* Deny if FOLL_FORCE is allowed only for ptracers via param */
> + if (static_branch_unlikely(&restrict_proc_mem[5]) &&
> + !__mem_open_current_is_ptracer(file))
This is like the ToCToU: the task may have changed out from under us
between the open the read/write.
I'm not sure how to store this during "open" though... Hmmm
> + return 0;
> +
> + return FOLL_FORCE;
> +}
> +
> static ssize_t mem_rw(struct file *file, char __user *buf,
> size_t count, loff_t *ppos, int write)
> {
> @@ -855,7 +953,8 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
> if (!mmget_not_zero(mm))
> goto free;
>
> - flags = FOLL_FORCE | (write ? FOLL_WRITE : 0);
> + flags = (write ? FOLL_WRITE : 0);
> + flags |= __mem_rw_get_foll_force_flag(file);
I wonder if we need some way to track openers in the mm? That sounds
not-fun.
>
> while (count > 0) {
> size_t this_len = min_t(size_t, count, PAGE_SIZE);
> diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
> index f5a2727ca4a9a..ba2460fe878c5 100644
> --- a/include/linux/jump_label.h
> +++ b/include/linux/jump_label.h
> @@ -398,6 +398,11 @@ struct static_key_false {
> [0 ... (count) - 1] = STATIC_KEY_FALSE_INIT, \
> }
>
> +#define DEFINE_STATIC_KEY_ARRAY_FALSE_RO(name, count) \
> + struct static_key_false name[count] __ro_after_init = { \
> + [0 ... (count) - 1] = STATIC_KEY_FALSE_INIT, \
> + }
Let's not add this. :)
> +
> #define _DEFINE_STATIC_KEY_1(name) DEFINE_STATIC_KEY_TRUE(name)
> #define _DEFINE_STATIC_KEY_0(name) DEFINE_STATIC_KEY_FALSE(name)
> #define DEFINE_STATIC_KEY_MAYBE(cfg, name) \
So, yes, conceptually, I really like this -- we've got some good
granularity now, and wow do I love being able to turn off FOLL_FORCE. :)
Safely checking for ptracer is tricky, though. I wonder how we could
store the foll_force state in the private_data somehow. Seems a bit
painful to allocate a struct for it. We could do some really horrid
hacks like store it in the low bit of the mm address that gets stored to
private_data and mask it out when used, but that's really ugly too...
-Kees
--
Kees Cook
On Tue, Apr 09, 2024 at 08:57:50PM +0300, Adrian Ratiu wrote:
> Some systems might have difficulty changing their bootloaders
> to enable the newly added restrict_proc_mem* params, for e.g.
> remote embedded doing OTA updates, so this provides a set of
> Kconfigs to set /proc/pid/mem restrictions at build-time.
>
> The boot params take precedence over the Kconfig values. This
> can be reversed, but doing it this way I think makes sense.
>
> Another idea is to have a global bool Kconfig which can enable
> or disable this mechanism in its entirety, however it does not
> seem necessary since all three knobs default to off, the branch
> logic overhead is rather minimal and I assume most of systems
> will want to restrict at least the use of FOLL_FORCE.
>
> Cc: Guenter Roeck <[email protected]>
> Cc: Doug Anderson <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Jann Horn <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Signed-off-by: Adrian Ratiu <[email protected]>
> ---
> fs/proc/base.c | 33 +++++++++++++++++++++++++++++++++
> security/Kconfig | 42 ++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 75 insertions(+)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index c733836c42a65..e8ee848fc4a98 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -889,6 +889,17 @@ static int __mem_open_check_access_restriction(struct file *file)
> !__mem_open_current_is_ptracer(file))
> return -EACCES;
>
> +#ifdef CONFIG_SECURITY_PROC_MEM_WRITE_RESTRICT
No, please. :)
Just use use the _MAYBE/_maybe variants of the static branch DECLAREs and
branches, and make Kconfigs for:
CONFIG_PROC_MEM_RESTRICT_READ_DEFAULT
CONFIG_PROC_MEM_RESTRICT_WRITE_DEFAULT
CONFIG_PROC_MEM_RESTRICT_FOLL_FORCE_DEFAULT
Like:
DECLARE_STATIC_KEY_MAYBE(CONFIG_PROC_MEM_RESTRICT_READ_DEFAULT, proc_mem_restrict_read);
and then later:
if (static_branch_maybe(CONFIG_PROC_MEM_RESTRICT_READ_DEFAULT,
&proc_mem_restrict_read))
...
Then all builds of the kernel will have it available, but system
builders who want it enabled by default will get a slightly more
optimized "if".
-Kees
--
Kees Cook
On Fri, Apr 26, 2024 at 04:10:49PM -0700, Kees Cook wrote:
> On Tue, Apr 09, 2024 at 08:57:49PM +0300, Adrian Ratiu wrote:
> > Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
> > after which it got allowed in commit 198214a7ee50 ("proc: enable
> > writing to /proc/pid/mem"). Famous last words from that patch:
> > "no longer a security hazard". :)
> >
> > Afterwards exploits started causing drama like [1]. The exploits
> > using /proc/*/mem can be rather sophisticated like [2] which
> > installed an arbitrary payload from noexec storage into a running
> > process then exec'd it, which itself could include an ELF loader
> > to run arbitrary code off noexec storage.
> >
> > One of the well-known problems with /proc/*/mem writes is they
> > ignore page permissions via FOLL_FORCE, as opposed to writes via
> > process_vm_writev which respect page permissions. These writes can
> > also be used to bypass mode bits.
> >
> > To harden against these types of attacks, distrbutions might want
> > to restrict /proc/pid/mem accesses, either entirely or partially,
> > for eg. to restrict FOLL_FORCE usage.
> >
> > Known valid use-cases which still need these accesses are:
> >
> > * Debuggers which also have ptrace permissions, so they can access
> > memory anyway via PTRACE_POKEDATA & co. Some debuggers like GDB
> > are designed to write /proc/pid/mem for basic functionality.
> >
> > * Container supervisors using the seccomp notifier to intercept
> > syscalls and rewrite memory of calling processes by passing
> > around /proc/pid/mem file descriptors.
> >
> > There might be more, that's why these params default to disabled.
> >
> > Regarding other mechanisms which can block these accesses:
> >
> > * seccomp filters can be used to block mmap/mprotect calls with W|X
> > perms, but they often can't block open calls as daemons want to
> > read/write their runtime state and seccomp filters cannot check
> > file paths, so plain write calls can't be easily blocked.
> >
> > * Since the mem file is part of the dynamic /proc/<pid>/ space, we
> > can't run chmod once at boot to restrict it (and trying to react
> > to every process and run chmod doesn't scale, and the kernel no
> > longer allows chmod on any of these paths).
> >
> > * SELinux could be used with a rule to cover all /proc/*/mem files,
> > but even then having multiple ways to deny an attack is useful in
> > case one layer fails.
> >
> > Thus we introduce three kernel parameters to restrict /proc/*/mem
> > access: read, write and foll_force. All three can be independently
> > set to the following values:
> >
> > all => restrict all access unconditionally.
> > ptracer => restrict all access except for ptracer processes.
> >
> > If left unset, the existing behaviour is preserved, i.e. access
> > is governed by basic file permissions.
> >
> > Examples which can be passed by bootloaders:
> >
> > restrict_proc_mem_foll_force=all
> > restrict_proc_mem_write=ptracer
> > restrict_proc_mem_read=ptracer
> >
> > Each distribution needs to decide what restrictions to apply,
> > depending on its use-cases. Embedded systems might want to do
> > more, while general-purpouse distros might want a more relaxed
> > policy, because for e.g. foll_force=all and write=all both break
> > break GDB, so it might be a bit excessive.
> >
> > Based on an initial patch by Mike Frysinger <[email protected]>.
>
> Thanks for this new version!
>
> >
> > Link: https://lwn.net/Articles/476947/ [1]
> > Link: https://issues.chromium.org/issues/40089045 [2]
> > Cc: Guenter Roeck <[email protected]>
> > Cc: Doug Anderson <[email protected]>
> > Cc: Kees Cook <[email protected]>
> > Cc: Jann Horn <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Randy Dunlap <[email protected]>
> > Cc: Christian Brauner <[email protected]>
> > Co-developed-by: Mike Frysinger <[email protected]>
> > Signed-off-by: Mike Frysinger <[email protected]>
> > Signed-off-by: Adrian Ratiu <[email protected]>
> > ---
> > .../admin-guide/kernel-parameters.txt | 27 +++++
> > fs/proc/base.c | 103 +++++++++++++++++-
> > include/linux/jump_label.h | 5 +
> > 3 files changed, 133 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 6e62b8cb19c8d..d7f7db41369c7 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -5665,6 +5665,33 @@
> > reset_devices [KNL] Force drivers to reset the underlying device
> > during initialization.
> >
> > + restrict_proc_mem_read= [KNL]
> > + Format: {all | ptracer}
> > + Allows restricting read access to /proc/*/mem files.
> > + Depending on restriction level, open for reads return -EACCESS.
> > + Can be one of:
> > + - 'all' restricts all access unconditionally.
> > + - 'ptracer' allows access only for ptracer processes.
> > + If not specified, then basic file permissions continue to apply.
> > +
> > + restrict_proc_mem_write= [KNL]
> > + Format: {all | ptracer}
> > + Allows restricting write access to /proc/*/mem files.
> > + Depending on restriction level, open for writes return -EACCESS.
> > + Can be one of:
> > + - 'all' restricts all access unconditionally.
> > + - 'ptracer' allows access only for ptracer processes.
> > + If not specified, then basic file permissions continue to apply.
> > +
> > + restrict_proc_mem_foll_force= [KNL]
> > + Format: {all | ptracer}
> > + Restricts the use of the FOLL_FORCE flag for /proc/*/mem access.
> > + If restricted, the FOLL_FORCE flag will not be added to vm accesses.
> > + Can be one of:
> > + - 'all' restricts all access unconditionally.
> > + - 'ptracer' allows access only for ptracer processes.
> > + If not specified, FOLL_FORCE is always used.
>
> bike shedding: I wonder if this should be a fake namespace (adding a dot
> just to break it up for reading more easily), and have words reordered
> to the kernel's more common subject-verb-object: proc_mem.restrict_read=...
>
> > +
> > resume= [SWSUSP]
> > Specify the partition device for software suspend
> > Format:
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 18550c071d71c..c733836c42a65 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -152,6 +152,41 @@ struct pid_entry {
> > NULL, &proc_pid_attr_operations, \
> > { .lsmid = LSMID })
> >
> > +/*
> > + * each restrict_proc_mem_* param controls the following static branches:
> > + * key[0] = restrict all writes
> > + * key[1] = restrict writes except for ptracers
> > + * key[2] = restrict all reads
> > + * key[3] = restrict reads except for ptracers
> > + * key[4] = restrict all FOLL_FORCE usage
> > + * key[5] = restrict FOLL_FORCE usage except for ptracers
> > + */
> > +DEFINE_STATIC_KEY_ARRAY_FALSE_RO(restrict_proc_mem, 6);
>
> So, I don't like having open-coded numbers. And I'm not sure there's a
> benefit to stuffing these all into an array? So:
>
> DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_read);
> DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_write);
> DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_foll_force);
>
> > +
> > +static int __init early_restrict_proc_mem(char *buf, int offset)
> > +{
> > + if (!buf)
> > + return -EINVAL;
> > +
> > + if (strncmp(buf, "all", 3) == 0)
>
> I'd use strcmp() to get exact matches. That way "allalksdjflas" doesn't
> match. :)
>
> > + static_branch_enable(&restrict_proc_mem[offset]);
> > + else if (strncmp(buf, "ptracer", 7) == 0)
> > + static_branch_enable(&restrict_proc_mem[offset + 1]);
> > +
> > + return 0;
> > +}
>
> Then don't bother with a common helper since you've got a macro, and
> it'll all get tossed after __init anyway.
>
> > +
> > +#define DEFINE_EARLY_RESTRICT_PROC_MEM(name, offset) \
> > +static int __init early_restrict_proc_mem_##name(char *buf) \
> > +{ \
> > + return early_restrict_proc_mem(buf, offset); \
> > +} \
> > +early_param("restrict_proc_mem_" #name, early_restrict_proc_mem_##name)
> > +
> > +DEFINE_EARLY_RESTRICT_PROC_MEM(write, 0);
> > +DEFINE_EARLY_RESTRICT_PROC_MEM(read, 2);
> > +DEFINE_EARLY_RESTRICT_PROC_MEM(foll_force, 4);
>
> #define DEFINE_EARLY_PROC_MEM_RESTRICT(name) \
> static int __init early_proc_mem_restrict_##name(char *buf) \
> { \
> if (!buf) \
> return -EINVAL; \
> \
> if (strcmp(buf, "all") == 0) \
> static_branch_enable(&proc_mem_restrict_##name); \
> else if (strcmp(buf, "ptracer") == 0) \
> static_branch_enable(&proc_mem_restrict_##name); \
> \
> return 0; \
> } \
> early_param("proc_mem_restrict_" #name, early_proc_mem_restrict_##name)
>
>
> > +
> > /*
> > * Count the number of hardlinks for the pid_entry table, excluding the .
> > * and .. links.
> > @@ -825,9 +860,58 @@ static int __mem_open(struct inode *inode, struct file *file, unsigned int mode)
> > return 0;
> > }
> >
> > +static bool __mem_open_current_is_ptracer(struct file *file)
> > +{
> > + struct inode *inode = file_inode(file);
> > + struct task_struct *task = get_proc_task(inode);
> > + int ret = false;
> > +
> > + if (task) {
> > + rcu_read_lock();
> > + if (current == ptrace_parent(task))
> > + ret = true;
> > + rcu_read_unlock();
> > + put_task_struct(task);
> > + }
>
> This creates a ToCToU race between this check (which releases the task)
> and the later memopen which make get a different task (and mm).
>
> To deal with this, I think you need to add a new mode flag for
> proc_mem_open(), and add the checking there.
>
> > +
> > + return ret;
> > +}
> > +
> > +static int __mem_open_check_access_restriction(struct file *file)
> > +{
> > + if (file->f_mode & FMODE_WRITE) {
> > + /* Deny if writes are unconditionally disabled via param */
> > + if (static_branch_unlikely(&restrict_proc_mem[0]))
> > + return -EACCES;
> > +
> > + /* Deny if writes are allowed only for ptracers via param */
> > + if (static_branch_unlikely(&restrict_proc_mem[1]) &&
> > + !__mem_open_current_is_ptracer(file))
> > + return -EACCES;
> > +
> > + } else if (file->f_mode & FMODE_READ) {
>
> I think this "else" means that O_RDWR opens will only check the write
> flag, so drop the "else".
>
> > + /* Deny if reads are unconditionally disabled via param */
> > + if (static_branch_unlikely(&restrict_proc_mem[2]))
> > + return -EACCES;
> > +
> > + /* Deny if reads are allowed only for ptracers via param */
> > + if (static_branch_unlikely(&restrict_proc_mem[3]) &&
> > + !__mem_open_current_is_ptracer(file))
> > + return -EACCES;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > static int mem_open(struct inode *inode, struct file *file)
> > {
> > - int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> > + int ret;
> > +
> > + ret = __mem_open_check_access_restriction(file);
> > + if (ret)
> > + return ret;
> > +
> > + ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> >
> > /* OK to pass negative loff_t, we can catch out-of-range */
> > file->f_mode |= FMODE_UNSIGNED_OFFSET;
> > @@ -835,6 +919,20 @@ static int mem_open(struct inode *inode, struct file *file)
> > return ret;
> > }
> >
> > +static unsigned int __mem_rw_get_foll_force_flag(struct file *file)
> > +{
> > + /* Deny if FOLL_FORCE is disabled via param */
> > + if (static_branch_unlikely(&restrict_proc_mem[4]))
> > + return 0;
> > +
> > + /* Deny if FOLL_FORCE is allowed only for ptracers via param */
> > + if (static_branch_unlikely(&restrict_proc_mem[5]) &&
> > + !__mem_open_current_is_ptracer(file))
>
> This is like the ToCToU: the task may have changed out from under us
> between the open the read/write.
But why would you care? As long as the task is the ptracer it doesn't
really matter afaict.
On Fri, May 03, 2024 at 11:57:56AM +0200, Christian Brauner wrote:
> On Fri, Apr 26, 2024 at 04:10:49PM -0700, Kees Cook wrote:
> > On Tue, Apr 09, 2024 at 08:57:49PM +0300, Adrian Ratiu wrote:
> > > Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
> > > after which it got allowed in commit 198214a7ee50 ("proc: enable
> > > writing to /proc/pid/mem"). Famous last words from that patch:
> > > "no longer a security hazard". :)
> > >
> > > Afterwards exploits started causing drama like [1]. The exploits
> > > using /proc/*/mem can be rather sophisticated like [2] which
> > > installed an arbitrary payload from noexec storage into a running
> > > process then exec'd it, which itself could include an ELF loader
> > > to run arbitrary code off noexec storage.
> > >
> > > One of the well-known problems with /proc/*/mem writes is they
> > > ignore page permissions via FOLL_FORCE, as opposed to writes via
> > > process_vm_writev which respect page permissions. These writes can
> > > also be used to bypass mode bits.
> > >
> > > To harden against these types of attacks, distrbutions might want
> > > to restrict /proc/pid/mem accesses, either entirely or partially,
> > > for eg. to restrict FOLL_FORCE usage.
> > >
> > > Known valid use-cases which still need these accesses are:
> > >
> > > * Debuggers which also have ptrace permissions, so they can access
> > > memory anyway via PTRACE_POKEDATA & co. Some debuggers like GDB
> > > are designed to write /proc/pid/mem for basic functionality.
> > >
> > > * Container supervisors using the seccomp notifier to intercept
> > > syscalls and rewrite memory of calling processes by passing
> > > around /proc/pid/mem file descriptors.
> > >
> > > There might be more, that's why these params default to disabled.
> > >
> > > Regarding other mechanisms which can block these accesses:
> > >
> > > * seccomp filters can be used to block mmap/mprotect calls with W|X
> > > perms, but they often can't block open calls as daemons want to
> > > read/write their runtime state and seccomp filters cannot check
> > > file paths, so plain write calls can't be easily blocked.
> > >
> > > * Since the mem file is part of the dynamic /proc/<pid>/ space, we
> > > can't run chmod once at boot to restrict it (and trying to react
> > > to every process and run chmod doesn't scale, and the kernel no
> > > longer allows chmod on any of these paths).
> > >
> > > * SELinux could be used with a rule to cover all /proc/*/mem files,
> > > but even then having multiple ways to deny an attack is useful in
> > > case one layer fails.
> > >
> > > Thus we introduce three kernel parameters to restrict /proc/*/mem
> > > access: read, write and foll_force. All three can be independently
> > > set to the following values:
> > >
> > > all => restrict all access unconditionally.
> > > ptracer => restrict all access except for ptracer processes.
> > >
> > > If left unset, the existing behaviour is preserved, i.e. access
> > > is governed by basic file permissions.
> > >
> > > Examples which can be passed by bootloaders:
> > >
> > > restrict_proc_mem_foll_force=all
> > > restrict_proc_mem_write=ptracer
> > > restrict_proc_mem_read=ptracer
> > >
> > > Each distribution needs to decide what restrictions to apply,
> > > depending on its use-cases. Embedded systems might want to do
> > > more, while general-purpouse distros might want a more relaxed
> > > policy, because for e.g. foll_force=all and write=all both break
> > > break GDB, so it might be a bit excessive.
> > >
> > > Based on an initial patch by Mike Frysinger <[email protected]>.
> >
> > Thanks for this new version!
> >
> > >
> > > Link: https://lwn.net/Articles/476947/ [1]
> > > Link: https://issues.chromium.org/issues/40089045 [2]
> > > Cc: Guenter Roeck <[email protected]>
> > > Cc: Doug Anderson <[email protected]>
> > > Cc: Kees Cook <[email protected]>
> > > Cc: Jann Horn <[email protected]>
> > > Cc: Andrew Morton <[email protected]>
> > > Cc: Randy Dunlap <[email protected]>
> > > Cc: Christian Brauner <[email protected]>
> > > Co-developed-by: Mike Frysinger <[email protected]>
> > > Signed-off-by: Mike Frysinger <[email protected]>
> > > Signed-off-by: Adrian Ratiu <[email protected]>
> > > ---
> > > .../admin-guide/kernel-parameters.txt | 27 +++++
> > > fs/proc/base.c | 103 +++++++++++++++++-
> > > include/linux/jump_label.h | 5 +
> > > 3 files changed, 133 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > index 6e62b8cb19c8d..d7f7db41369c7 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -5665,6 +5665,33 @@
> > > reset_devices [KNL] Force drivers to reset the underlying device
> > > during initialization.
> > >
> > > + restrict_proc_mem_read= [KNL]
> > > + Format: {all | ptracer}
> > > + Allows restricting read access to /proc/*/mem files.
> > > + Depending on restriction level, open for reads return -EACCESS.
> > > + Can be one of:
> > > + - 'all' restricts all access unconditionally.
> > > + - 'ptracer' allows access only for ptracer processes.
> > > + If not specified, then basic file permissions continue to apply.
> > > +
> > > + restrict_proc_mem_write= [KNL]
> > > + Format: {all | ptracer}
> > > + Allows restricting write access to /proc/*/mem files.
> > > + Depending on restriction level, open for writes return -EACCESS.
> > > + Can be one of:
> > > + - 'all' restricts all access unconditionally.
> > > + - 'ptracer' allows access only for ptracer processes.
> > > + If not specified, then basic file permissions continue to apply.
> > > +
> > > + restrict_proc_mem_foll_force= [KNL]
> > > + Format: {all | ptracer}
> > > + Restricts the use of the FOLL_FORCE flag for /proc/*/mem access.
> > > + If restricted, the FOLL_FORCE flag will not be added to vm accesses.
> > > + Can be one of:
> > > + - 'all' restricts all access unconditionally.
> > > + - 'ptracer' allows access only for ptracer processes.
> > > + If not specified, FOLL_FORCE is always used.
> >
> > bike shedding: I wonder if this should be a fake namespace (adding a dot
> > just to break it up for reading more easily), and have words reordered
> > to the kernel's more common subject-verb-object: proc_mem.restrict_read=...
> >
> > > +
> > > resume= [SWSUSP]
> > > Specify the partition device for software suspend
> > > Format:
> > > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > > index 18550c071d71c..c733836c42a65 100644
> > > --- a/fs/proc/base.c
> > > +++ b/fs/proc/base.c
> > > @@ -152,6 +152,41 @@ struct pid_entry {
> > > NULL, &proc_pid_attr_operations, \
> > > { .lsmid = LSMID })
> > >
> > > +/*
> > > + * each restrict_proc_mem_* param controls the following static branches:
> > > + * key[0] = restrict all writes
> > > + * key[1] = restrict writes except for ptracers
> > > + * key[2] = restrict all reads
> > > + * key[3] = restrict reads except for ptracers
> > > + * key[4] = restrict all FOLL_FORCE usage
> > > + * key[5] = restrict FOLL_FORCE usage except for ptracers
> > > + */
> > > +DEFINE_STATIC_KEY_ARRAY_FALSE_RO(restrict_proc_mem, 6);
> >
> > So, I don't like having open-coded numbers. And I'm not sure there's a
> > benefit to stuffing these all into an array? So:
> >
> > DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_read);
> > DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_write);
> > DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_foll_force);
> >
> > > +
> > > +static int __init early_restrict_proc_mem(char *buf, int offset)
> > > +{
> > > + if (!buf)
> > > + return -EINVAL;
> > > +
> > > + if (strncmp(buf, "all", 3) == 0)
> >
> > I'd use strcmp() to get exact matches. That way "allalksdjflas" doesn't
> > match. :)
> >
> > > + static_branch_enable(&restrict_proc_mem[offset]);
> > > + else if (strncmp(buf, "ptracer", 7) == 0)
> > > + static_branch_enable(&restrict_proc_mem[offset + 1]);
> > > +
> > > + return 0;
> > > +}
> >
> > Then don't bother with a common helper since you've got a macro, and
> > it'll all get tossed after __init anyway.
> >
> > > +
> > > +#define DEFINE_EARLY_RESTRICT_PROC_MEM(name, offset) \
> > > +static int __init early_restrict_proc_mem_##name(char *buf) \
> > > +{ \
> > > + return early_restrict_proc_mem(buf, offset); \
> > > +} \
> > > +early_param("restrict_proc_mem_" #name, early_restrict_proc_mem_##name)
> > > +
> > > +DEFINE_EARLY_RESTRICT_PROC_MEM(write, 0);
> > > +DEFINE_EARLY_RESTRICT_PROC_MEM(read, 2);
> > > +DEFINE_EARLY_RESTRICT_PROC_MEM(foll_force, 4);
> >
> > #define DEFINE_EARLY_PROC_MEM_RESTRICT(name) \
> > static int __init early_proc_mem_restrict_##name(char *buf) \
> > { \
> > if (!buf) \
> > return -EINVAL; \
> > \
> > if (strcmp(buf, "all") == 0) \
> > static_branch_enable(&proc_mem_restrict_##name); \
> > else if (strcmp(buf, "ptracer") == 0) \
> > static_branch_enable(&proc_mem_restrict_##name); \
> > \
> > return 0; \
> > } \
> > early_param("proc_mem_restrict_" #name, early_proc_mem_restrict_##name)
> >
> >
> > > +
> > > /*
> > > * Count the number of hardlinks for the pid_entry table, excluding the .
> > > * and .. links.
> > > @@ -825,9 +860,58 @@ static int __mem_open(struct inode *inode, struct file *file, unsigned int mode)
> > > return 0;
> > > }
> > >
> > > +static bool __mem_open_current_is_ptracer(struct file *file)
> > > +{
> > > + struct inode *inode = file_inode(file);
> > > + struct task_struct *task = get_proc_task(inode);
> > > + int ret = false;
> > > +
> > > + if (task) {
> > > + rcu_read_lock();
> > > + if (current == ptrace_parent(task))
> > > + ret = true;
> > > + rcu_read_unlock();
> > > + put_task_struct(task);
> > > + }
> >
> > This creates a ToCToU race between this check (which releases the task)
> > and the later memopen which make get a different task (and mm).
> >
> > To deal with this, I think you need to add a new mode flag for
> > proc_mem_open(), and add the checking there.
> >
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +static int __mem_open_check_access_restriction(struct file *file)
> > > +{
> > > + if (file->f_mode & FMODE_WRITE) {
> > > + /* Deny if writes are unconditionally disabled via param */
> > > + if (static_branch_unlikely(&restrict_proc_mem[0]))
> > > + return -EACCES;
> > > +
> > > + /* Deny if writes are allowed only for ptracers via param */
> > > + if (static_branch_unlikely(&restrict_proc_mem[1]) &&
> > > + !__mem_open_current_is_ptracer(file))
> > > + return -EACCES;
> > > +
> > > + } else if (file->f_mode & FMODE_READ) {
> >
> > I think this "else" means that O_RDWR opens will only check the write
> > flag, so drop the "else".
> >
> > > + /* Deny if reads are unconditionally disabled via param */
> > > + if (static_branch_unlikely(&restrict_proc_mem[2]))
> > > + return -EACCES;
> > > +
> > > + /* Deny if reads are allowed only for ptracers via param */
> > > + if (static_branch_unlikely(&restrict_proc_mem[3]) &&
> > > + !__mem_open_current_is_ptracer(file))
> > > + return -EACCES;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > static int mem_open(struct inode *inode, struct file *file)
> > > {
> > > - int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> > > + int ret;
> > > +
> > > + ret = __mem_open_check_access_restriction(file);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> > >
> > > /* OK to pass negative loff_t, we can catch out-of-range */
> > > file->f_mode |= FMODE_UNSIGNED_OFFSET;
> > > @@ -835,6 +919,20 @@ static int mem_open(struct inode *inode, struct file *file)
> > > return ret;
> > > }
> > >
> > > +static unsigned int __mem_rw_get_foll_force_flag(struct file *file)
> > > +{
> > > + /* Deny if FOLL_FORCE is disabled via param */
> > > + if (static_branch_unlikely(&restrict_proc_mem[4]))
> > > + return 0;
> > > +
> > > + /* Deny if FOLL_FORCE is allowed only for ptracers via param */
> > > + if (static_branch_unlikely(&restrict_proc_mem[5]) &&
> > > + !__mem_open_current_is_ptracer(file))
> >
> > This is like the ToCToU: the task may have changed out from under us
> > between the open the read/write.
>
> But why would you care? As long as the task is the ptracer it doesn't
> really matter afaict.
Because the mm you're writing to may no longer be associated with the
task.
proc_mem_operations.open() will take a reference to the current task's
mm, via proc_mem_open() through __mem_open():
struct task_struct *task = get_proc_task(inode);
...
mm = mm_access(task, mode | PTRACE_MODE_FSCREDS);
...
file->private_data = mm;
And in the proposed check added to mem_rw(), if get_proc_task(inode)
returns a different task (i.e. the pid got recycled and the original mm
is still associated with a forked task), then it could write to the
forked task using the ptrace check against the new task.
Looking at it again now, I think it should be possible to just revalidate
the mm in __mem_open_current_is_ptracer(), though. i.e. it would be
allowed if ptrace check passes and file->private_data == mm_access(...),
for the mem_rw case...
--
Kees Cook
On Saturday, April 27, 2024 02:10 EEST, Kees Cook <[email protected]> wrote:
> On Tue, Apr 09, 2024 at 08:57:49PM +0300, Adrian Ratiu wrote:
> > Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
> > after which it got allowed in commit 198214a7ee50 ("proc: enable
> > writing to /proc/pid/mem"). Famous last words from that patch:
> > "no longer a security hazard". :)
> >
> > Afterwards exploits started causing drama like [1]. The exploits
> > using /proc/*/mem can be rather sophisticated like [2] which
> > installed an arbitrary payload from noexec storage into a running
> > process then exec'd it, which itself could include an ELF loader
> > to run arbitrary code off noexec storage.
> >
> > One of the well-known problems with /proc/*/mem writes is they
> > ignore page permissions via FOLL_FORCE, as opposed to writes via
> > process_vm_writev which respect page permissions. These writes can
> > also be used to bypass mode bits.
> >
> > To harden against these types of attacks, distrbutions might want
> > to restrict /proc/pid/mem accesses, either entirely or partially,
> > for eg. to restrict FOLL_FORCE usage.
> >
> > Known valid use-cases which still need these accesses are:
> >
> > * Debuggers which also have ptrace permissions, so they can access
> > memory anyway via PTRACE_POKEDATA & co. Some debuggers like GDB
> > are designed to write /proc/pid/mem for basic functionality.
> >
> > * Container supervisors using the seccomp notifier to intercept
> > syscalls and rewrite memory of calling processes by passing
> > around /proc/pid/mem file descriptors.
> >
> > There might be more, that's why these params default to disabled.
> >
> > Regarding other mechanisms which can block these accesses:
> >
> > * seccomp filters can be used to block mmap/mprotect calls with W|X
> > perms, but they often can't block open calls as daemons want to
> > read/write their runtime state and seccomp filters cannot check
> > file paths, so plain write calls can't be easily blocked.
> >
> > * Since the mem file is part of the dynamic /proc/<pid>/ space, we
> > can't run chmod once at boot to restrict it (and trying to react
> > to every process and run chmod doesn't scale, and the kernel no
> > longer allows chmod on any of these paths).
> >
> > * SELinux could be used with a rule to cover all /proc/*/mem files,
> > but even then having multiple ways to deny an attack is useful in
> > case one layer fails.
> >
> > Thus we introduce three kernel parameters to restrict /proc/*/mem
> > access: read, write and foll_force. All three can be independently
> > set to the following values:
> >
> > all => restrict all access unconditionally.
> > ptracer => restrict all access except for ptracer processes.
> >
> > If left unset, the existing behaviour is preserved, i.e. access
> > is governed by basic file permissions.
> >
> > Examples which can be passed by bootloaders:
> >
> > restrict_proc_mem_foll_force=all
> > restrict_proc_mem_write=ptracer
> > restrict_proc_mem_read=ptracer
> >
> > Each distribution needs to decide what restrictions to apply,
> > depending on its use-cases. Embedded systems might want to do
> > more, while general-purpouse distros might want a more relaxed
> > policy, because for e.g. foll_force=all and write=all both break
> > break GDB, so it might be a bit excessive.
> >
> > Based on an initial patch by Mike Frysinger <[email protected]>.
>
> Thanks for this new version!
Thank you for the great feedback and sorry for the delayed response.
I had to go offline for 2 weeks during Eastern Easter period.
I'll implement all your suggestions and then send a v4.
>
> >
> > Link: https://lwn.net/Articles/476947/ [1]
> > Link: https://issues.chromium.org/issues/40089045 [2]
> > Cc: Guenter Roeck <[email protected]>
> > Cc: Doug Anderson <[email protected]>
> > Cc: Kees Cook <[email protected]>
> > Cc: Jann Horn <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Randy Dunlap <[email protected]>
> > Cc: Christian Brauner <[email protected]>
> > Co-developed-by: Mike Frysinger <[email protected]>
> > Signed-off-by: Mike Frysinger <[email protected]>
> > Signed-off-by: Adrian Ratiu <[email protected]>
> > ---
> > .../admin-guide/kernel-parameters.txt | 27 +++++
> > fs/proc/base.c | 103 +++++++++++++++++-
> > include/linux/jump_label.h | 5 +
> > 3 files changed, 133 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 6e62b8cb19c8d..d7f7db41369c7 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -5665,6 +5665,33 @@
> > reset_devices [KNL] Force drivers to reset the underlying device
> > during initialization.
> >
> > + restrict_proc_mem_read= [KNL]
> > + Format: {all | ptracer}
> > + Allows restricting read access to /proc/*/mem files.
> > + Depending on restriction level, open for reads return -EACCESS.
> > + Can be one of:
> > + - 'all' restricts all access unconditionally.
> > + - 'ptracer' allows access only for ptracer processes.
> > + If not specified, then basic file permissions continue to apply.
> > +
> > + restrict_proc_mem_write= [KNL]
> > + Format: {all | ptracer}
> > + Allows restricting write access to /proc/*/mem files.
> > + Depending on restriction level, open for writes return -EACCESS.
> > + Can be one of:
> > + - 'all' restricts all access unconditionally.
> > + - 'ptracer' allows access only for ptracer processes.
> > + If not specified, then basic file permissions continue to apply.
> > +
> > + restrict_proc_mem_foll_force= [KNL]
> > + Format: {all | ptracer}
> > + Restricts the use of the FOLL_FORCE flag for /proc/*/mem access.
> > + If restricted, the FOLL_FORCE flag will not be added to vm accesses.
> > + Can be one of:
> > + - 'all' restricts all access unconditionally.
> > + - 'ptracer' allows access only for ptracer processes.
> > + If not specified, FOLL_FORCE is always used.
>
> bike shedding: I wonder if this should be a fake namespace (adding a dot
> just to break it up for reading more easily), and have words reordered
> to the kernel's more common subject-verb-object: proc_mem.restrict_read=...
>
> > +
> > resume= [SWSUSP]
> > Specify the partition device for software suspend
> > Format:
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 18550c071d71c..c733836c42a65 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -152,6 +152,41 @@ struct pid_entry {
> > NULL, &proc_pid_attr_operations, \
> > { .lsmid = LSMID })
> >
> > +/*
> > + * each restrict_proc_mem_* param controls the following static branches:
> > + * key[0] = restrict all writes
> > + * key[1] = restrict writes except for ptracers
> > + * key[2] = restrict all reads
> > + * key[3] = restrict reads except for ptracers
> > + * key[4] = restrict all FOLL_FORCE usage
> > + * key[5] = restrict FOLL_FORCE usage except for ptracers
> > + */
> > +DEFINE_STATIC_KEY_ARRAY_FALSE_RO(restrict_proc_mem, 6);
>
> So, I don't like having open-coded numbers. And I'm not sure there's a
> benefit to stuffing these all into an array? So:
>
> DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_read);
> DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_write);
> DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_foll_force);
>
> > +
> > +static int __init early_restrict_proc_mem(char *buf, int offset)
> > +{
> > + if (!buf)
> > + return -EINVAL;
> > +
> > + if (strncmp(buf, "all", 3) == 0)
>
> I'd use strcmp() to get exact matches. That way "allalksdjflas" doesn't
> match. :)
>
> > + static_branch_enable(&restrict_proc_mem[offset]);
> > + else if (strncmp(buf, "ptracer", 7) == 0)
> > + static_branch_enable(&restrict_proc_mem[offset + 1]);
> > +
> > + return 0;
> > +}
>
> Then don't bother with a common helper since you've got a macro, and
> it'll all get tossed after __init anyway.
>
> > +
> > +#define DEFINE_EARLY_RESTRICT_PROC_MEM(name, offset) \
> > +static int __init early_restrict_proc_mem_##name(char *buf) \
> > +{ \
> > + return early_restrict_proc_mem(buf, offset); \
> > +} \
> > +early_param("restrict_proc_mem_" #name, early_restrict_proc_mem_##name)
> > +
> > +DEFINE_EARLY_RESTRICT_PROC_MEM(write, 0);
> > +DEFINE_EARLY_RESTRICT_PROC_MEM(read, 2);
> > +DEFINE_EARLY_RESTRICT_PROC_MEM(foll_force, 4);
>
> #define DEFINE_EARLY_PROC_MEM_RESTRICT(name) \
> static int __init early_proc_mem_restrict_##name(char *buf) \
> { \
> if (!buf) \
> return -EINVAL; \
> \
> if (strcmp(buf, "all") == 0) \
> static_branch_enable(&proc_mem_restrict_##name); \
> else if (strcmp(buf, "ptracer") == 0) \
> static_branch_enable(&proc_mem_restrict_##name); \
> \
> return 0; \
> } \
> early_param("proc_mem_restrict_" #name, early_proc_mem_restrict_##name)
>
>
> > +
> > /*
> > * Count the number of hardlinks for the pid_entry table, excluding the .
> > * and .. links.
> > @@ -825,9 +860,58 @@ static int __mem_open(struct inode *inode, struct file *file, unsigned int mode)
> > return 0;
> > }
> >
> > +static bool __mem_open_current_is_ptracer(struct file *file)
> > +{
> > + struct inode *inode = file_inode(file);
> > + struct task_struct *task = get_proc_task(inode);
> > + int ret = false;
> > +
> > + if (task) {
> > + rcu_read_lock();
> > + if (current == ptrace_parent(task))
> > + ret = true;
> > + rcu_read_unlock();
> > + put_task_struct(task);
> > + }
>
> This creates a ToCToU race between this check (which releases the task)
> and the later memopen which make get a different task (and mm).
Especially thanks for noticing I introduced this in v3!
It was an accident, my mistake :)
I'll pay close attention to fixing this in v4, will come back if any questions.
>
> To deal with this, I think you need to add a new mode flag for
> proc_mem_open(), and add the checking there.
>
> > +
> > + return ret;
> > +}
> > +
> > +static int __mem_open_check_access_restriction(struct file *file)
> > +{
> > + if (file->f_mode & FMODE_WRITE) {
> > + /* Deny if writes are unconditionally disabled via param */
> > + if (static_branch_unlikely(&restrict_proc_mem[0]))
> > + return -EACCES;
> > +
> > + /* Deny if writes are allowed only for ptracers via param */
> > + if (static_branch_unlikely(&restrict_proc_mem[1]) &&
> > + !__mem_open_current_is_ptracer(file))
> > + return -EACCES;
> > +
> > + } else if (file->f_mode & FMODE_READ) {
>
> I think this "else" means that O_RDWR opens will only check the write
> flag, so drop the "else".
>
> > + /* Deny if reads are unconditionally disabled via param */
> > + if (static_branch_unlikely(&restrict_proc_mem[2]))
> > + return -EACCES;
> > +
> > + /* Deny if reads are allowed only for ptracers via param */
> > + if (static_branch_unlikely(&restrict_proc_mem[3]) &&
> > + !__mem_open_current_is_ptracer(file))
> > + return -EACCES;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > static int mem_open(struct inode *inode, struct file *file)
> > {
> > - int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> > + int ret;
> > +
> > + ret = __mem_open_check_access_restriction(file);
> > + if (ret)
> > + return ret;
> > +
> > + ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> >
> > /* OK to pass negative loff_t, we can catch out-of-range */
> > file->f_mode |= FMODE_UNSIGNED_OFFSET;
> > @@ -835,6 +919,20 @@ static int mem_open(struct inode *inode, struct file *file)
> > return ret;
> > }
> >
> > +static unsigned int __mem_rw_get_foll_force_flag(struct file *file)
> > +{
> > + /* Deny if FOLL_FORCE is disabled via param */
> > + if (static_branch_unlikely(&restrict_proc_mem[4]))
> > + return 0;
> > +
> > + /* Deny if FOLL_FORCE is allowed only for ptracers via param */
> > + if (static_branch_unlikely(&restrict_proc_mem[5]) &&
> > + !__mem_open_current_is_ptracer(file))
>
> This is like the ToCToU: the task may have changed out from under us
> between the open the read/write.
>
> I'm not sure how to store this during "open" though... Hmmm
>
> > + return 0;
> > +
> > + return FOLL_FORCE;
> > +}
> > +
> > static ssize_t mem_rw(struct file *file, char __user *buf,
> > size_t count, loff_t *ppos, int write)
> > {
> > @@ -855,7 +953,8 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
> > if (!mmget_not_zero(mm))
> > goto free;
> >
> > - flags = FOLL_FORCE | (write ? FOLL_WRITE : 0);
> > + flags = (write ? FOLL_WRITE : 0);
> > + flags |= __mem_rw_get_foll_force_flag(file);
>
> I wonder if we need some way to track openers in the mm? That sounds
> not-fun.
>
> >
> > while (count > 0) {
> > size_t this_len = min_t(size_t, count, PAGE_SIZE);
> > diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
> > index f5a2727ca4a9a..ba2460fe878c5 100644
> > --- a/include/linux/jump_label.h
> > +++ b/include/linux/jump_label.h
> > @@ -398,6 +398,11 @@ struct static_key_false {
> > [0 ... (count) - 1] = STATIC_KEY_FALSE_INIT, \
> > }
> >
> > +#define DEFINE_STATIC_KEY_ARRAY_FALSE_RO(name, count) \
> > + struct static_key_false name[count] __ro_after_init = { \
> > + [0 ... (count) - 1] = STATIC_KEY_FALSE_INIT, \
> > + }
>
> Let's not add this. :)
>
> > +
> > #define _DEFINE_STATIC_KEY_1(name) DEFINE_STATIC_KEY_TRUE(name)
> > #define _DEFINE_STATIC_KEY_0(name) DEFINE_STATIC_KEY_FALSE(name)
> > #define DEFINE_STATIC_KEY_MAYBE(cfg, name) \
>
> So, yes, conceptually, I really like this -- we've got some good
> granularity now, and wow do I love being able to turn off FOLL_FORCE. :)
>
> Safely checking for ptracer is tricky, though. I wonder how we could
> store the foll_force state in the private_data somehow. Seems a bit
> painful to allocate a struct for it. We could do some really horrid
> hacks like store it in the low bit of the mm address that gets stored to
> private_data and mask it out when used, but that's really ugly too..
>
> -Kees
>
> --
> Kees Cook
On Tuesday, May 14, 2024 02:50 EEST, Kees Cook <[email protected]> wrote:
> On Fri, May 03, 2024 at 11:57:56AM +0200, Christian Brauner wrote:
> > On Fri, Apr 26, 2024 at 04:10:49PM -0700, Kees Cook wrote:
> > > On Tue, Apr 09, 2024 at 08:57:49PM +0300, Adrian Ratiu wrote:
> > > > Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
> > > > after which it got allowed in commit 198214a7ee50 ("proc: enable
> > > > writing to /proc/pid/mem"). Famous last words from that patch:
> > > > "no longer a security hazard". :)
> > > >
> > > > Afterwards exploits started causing drama like [1]. The exploits
> > > > using /proc/*/mem can be rather sophisticated like [2] which
> > > > installed an arbitrary payload from noexec storage into a running
> > > > process then exec'd it, which itself could include an ELF loader
> > > > to run arbitrary code off noexec storage.
> > > >
> > > > One of the well-known problems with /proc/*/mem writes is they
> > > > ignore page permissions via FOLL_FORCE, as opposed to writes via
> > > > process_vm_writev which respect page permissions. These writes can
> > > > also be used to bypass mode bits.
> > > >
> > > > To harden against these types of attacks, distrbutions might want
> > > > to restrict /proc/pid/mem accesses, either entirely or partially,
> > > > for eg. to restrict FOLL_FORCE usage.
> > > >
> > > > Known valid use-cases which still need these accesses are:
> > > >
> > > > * Debuggers which also have ptrace permissions, so they can access
> > > > memory anyway via PTRACE_POKEDATA & co. Some debuggers like GDB
> > > > are designed to write /proc/pid/mem for basic functionality.
> > > >
> > > > * Container supervisors using the seccomp notifier to intercept
> > > > syscalls and rewrite memory of calling processes by passing
> > > > around /proc/pid/mem file descriptors.
> > > >
> > > > There might be more, that's why these params default to disabled.
> > > >
> > > > Regarding other mechanisms which can block these accesses:
> > > >
> > > > * seccomp filters can be used to block mmap/mprotect calls with W|X
> > > > perms, but they often can't block open calls as daemons want to
> > > > read/write their runtime state and seccomp filters cannot check
> > > > file paths, so plain write calls can't be easily blocked.
> > > >
> > > > * Since the mem file is part of the dynamic /proc/<pid>/ space, we
> > > > can't run chmod once at boot to restrict it (and trying to react
> > > > to every process and run chmod doesn't scale, and the kernel no
> > > > longer allows chmod on any of these paths).
> > > >
> > > > * SELinux could be used with a rule to cover all /proc/*/mem files,
> > > > but even then having multiple ways to deny an attack is useful in
> > > > case one layer fails.
> > > >
> > > > Thus we introduce three kernel parameters to restrict /proc/*/mem
> > > > access: read, write and foll_force. All three can be independently
> > > > set to the following values:
> > > >
> > > > all => restrict all access unconditionally.
> > > > ptracer => restrict all access except for ptracer processes.
> > > >
> > > > If left unset, the existing behaviour is preserved, i.e. access
> > > > is governed by basic file permissions.
> > > >
> > > > Examples which can be passed by bootloaders:
> > > >
> > > > restrict_proc_mem_foll_force=all
> > > > restrict_proc_mem_write=ptracer
> > > > restrict_proc_mem_read=ptracer
> > > >
> > > > Each distribution needs to decide what restrictions to apply,
> > > > depending on its use-cases. Embedded systems might want to do
> > > > more, while general-purpouse distros might want a more relaxed
> > > > policy, because for e.g. foll_force=all and write=all both break
> > > > break GDB, so it might be a bit excessive.
> > > >
> > > > Based on an initial patch by Mike Frysinger <[email protected]>.
> > >
> > > Thanks for this new version!
> > >
> > > >
> > > > Link: https://lwn.net/Articles/476947/ [1]
> > > > Link: https://issues.chromium.org/issues/40089045 [2]
> > > > Cc: Guenter Roeck <[email protected]>
> > > > Cc: Doug Anderson <[email protected]>
> > > > Cc: Kees Cook <[email protected]>
> > > > Cc: Jann Horn <[email protected]>
> > > > Cc: Andrew Morton <[email protected]>
> > > > Cc: Randy Dunlap <[email protected]>
> > > > Cc: Christian Brauner <[email protected]>
> > > > Co-developed-by: Mike Frysinger <[email protected]>
> > > > Signed-off-by: Mike Frysinger <[email protected]>
> > > > Signed-off-by: Adrian Ratiu <[email protected]>
> > > > ---
> > > > .../admin-guide/kernel-parameters.txt | 27 +++++
> > > > fs/proc/base.c | 103 +++++++++++++++++-
> > > > include/linux/jump_label.h | 5 +
> > > > 3 files changed, 133 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > index 6e62b8cb19c8d..d7f7db41369c7 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -5665,6 +5665,33 @@
> > > > reset_devices [KNL] Force drivers to reset the underlying device
> > > > during initialization.
> > > >
> > > > + restrict_proc_mem_read= [KNL]
> > > > + Format: {all | ptracer}
> > > > + Allows restricting read access to /proc/*/mem files.
> > > > + Depending on restriction level, open for reads return -EACCESS.
> > > > + Can be one of:
> > > > + - 'all' restricts all access unconditionally.
> > > > + - 'ptracer' allows access only for ptracer processes.
> > > > + If not specified, then basic file permissions continue to apply.
> > > > +
> > > > + restrict_proc_mem_write= [KNL]
> > > > + Format: {all | ptracer}
> > > > + Allows restricting write access to /proc/*/mem files.
> > > > + Depending on restriction level, open for writes return -EACCESS.
> > > > + Can be one of:
> > > > + - 'all' restricts all access unconditionally.
> > > > + - 'ptracer' allows access only for ptracer processes.
> > > > + If not specified, then basic file permissions continue to apply.
> > > > +
> > > > + restrict_proc_mem_foll_force= [KNL]
> > > > + Format: {all | ptracer}
> > > > + Restricts the use of the FOLL_FORCE flag for /proc/*/mem access.
> > > > + If restricted, the FOLL_FORCE flag will not be added to vm accesses.
> > > > + Can be one of:
> > > > + - 'all' restricts all access unconditionally.
> > > > + - 'ptracer' allows access only for ptracer processes.
> > > > + If not specified, FOLL_FORCE is always used.
> > >
> > > bike shedding: I wonder if this should be a fake namespace (adding a dot
> > > just to break it up for reading more easily), and have words reordered
> > > to the kernel's more common subject-verb-object: proc_mem.restrict_read=...
> > >
> > > > +
> > > > resume= [SWSUSP]
> > > > Specify the partition device for software suspend
> > > > Format:
> > > > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > > > index 18550c071d71c..c733836c42a65 100644
> > > > --- a/fs/proc/base.c
> > > > +++ b/fs/proc/base.c
> > > > @@ -152,6 +152,41 @@ struct pid_entry {
> > > > NULL, &proc_pid_attr_operations, \
> > > > { .lsmid = LSMID })
> > > >
> > > > +/*
> > > > + * each restrict_proc_mem_* param controls the following static branches:
> > > > + * key[0] = restrict all writes
> > > > + * key[1] = restrict writes except for ptracers
> > > > + * key[2] = restrict all reads
> > > > + * key[3] = restrict reads except for ptracers
> > > > + * key[4] = restrict all FOLL_FORCE usage
> > > > + * key[5] = restrict FOLL_FORCE usage except for ptracers
> > > > + */
> > > > +DEFINE_STATIC_KEY_ARRAY_FALSE_RO(restrict_proc_mem, 6);
> > >
> > > So, I don't like having open-coded numbers. And I'm not sure there's a
> > > benefit to stuffing these all into an array? So:
> > >
> > > DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_read);
> > > DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_write);
> > > DEFINE_STATIC_KEY_FALSE_RO(proc_mem_restrict_foll_force);
> > >
> > > > +
> > > > +static int __init early_restrict_proc_mem(char *buf, int offset)
> > > > +{
> > > > + if (!buf)
> > > > + return -EINVAL;
> > > > +
> > > > + if (strncmp(buf, "all", 3) == 0)
> > >
> > > I'd use strcmp() to get exact matches. That way "allalksdjflas" doesn't
> > > match. :)
> > >
> > > > + static_branch_enable(&restrict_proc_mem[offset]);
> > > > + else if (strncmp(buf, "ptracer", 7) == 0)
> > > > + static_branch_enable(&restrict_proc_mem[offset + 1]);
> > > > +
> > > > + return 0;
> > > > +}
> > >
> > > Then don't bother with a common helper since you've got a macro, and
> > > it'll all get tossed after __init anyway.
> > >
> > > > +
> > > > +#define DEFINE_EARLY_RESTRICT_PROC_MEM(name, offset) \
> > > > +static int __init early_restrict_proc_mem_##name(char *buf) \
> > > > +{ \
> > > > + return early_restrict_proc_mem(buf, offset); \
> > > > +} \
> > > > +early_param("restrict_proc_mem_" #name, early_restrict_proc_mem_##name)
> > > > +
> > > > +DEFINE_EARLY_RESTRICT_PROC_MEM(write, 0);
> > > > +DEFINE_EARLY_RESTRICT_PROC_MEM(read, 2);
> > > > +DEFINE_EARLY_RESTRICT_PROC_MEM(foll_force, 4);
> > >
> > > #define DEFINE_EARLY_PROC_MEM_RESTRICT(name) \
> > > static int __init early_proc_mem_restrict_##name(char *buf) \
> > > { \
> > > if (!buf) \
> > > return -EINVAL; \
> > > \
> > > if (strcmp(buf, "all") == 0) \
> > > static_branch_enable(&proc_mem_restrict_##name); \
> > > else if (strcmp(buf, "ptracer") == 0) \
> > > static_branch_enable(&proc_mem_restrict_##name); \
> > > \
> > > return 0; \
> > > } \
> > > early_param("proc_mem_restrict_" #name, early_proc_mem_restrict_##name)
> > >
> > >
> > > > +
> > > > /*
> > > > * Count the number of hardlinks for the pid_entry table, excluding the .
> > > > * and .. links.
> > > > @@ -825,9 +860,58 @@ static int __mem_open(struct inode *inode, struct file *file, unsigned int mode)
> > > > return 0;
> > > > }
> > > >
> > > > +static bool __mem_open_current_is_ptracer(struct file *file)
> > > > +{
> > > > + struct inode *inode = file_inode(file);
> > > > + struct task_struct *task = get_proc_task(inode);
> > > > + int ret = false;
> > > > +
> > > > + if (task) {
> > > > + rcu_read_lock();
> > > > + if (current == ptrace_parent(task))
> > > > + ret = true;
> > > > + rcu_read_unlock();
> > > > + put_task_struct(task);
> > > > + }
> > >
> > > This creates a ToCToU race between this check (which releases the task)
> > > and the later memopen which make get a different task (and mm).
> > >
> > > To deal with this, I think you need to add a new mode flag for
> > > proc_mem_open(), and add the checking there.
> > >
> > > > +
> > > > + return ret;
> > > > +}
> > > > +
> > > > +static int __mem_open_check_access_restriction(struct file *file)
> > > > +{
> > > > + if (file->f_mode & FMODE_WRITE) {
> > > > + /* Deny if writes are unconditionally disabled via param */
> > > > + if (static_branch_unlikely(&restrict_proc_mem[0]))
> > > > + return -EACCES;
> > > > +
> > > > + /* Deny if writes are allowed only for ptracers via param */
> > > > + if (static_branch_unlikely(&restrict_proc_mem[1]) &&
> > > > + !__mem_open_current_is_ptracer(file))
> > > > + return -EACCES;
> > > > +
> > > > + } else if (file->f_mode & FMODE_READ) {
> > >
> > > I think this "else" means that O_RDWR opens will only check the write
> > > flag, so drop the "else".
> > >
> > > > + /* Deny if reads are unconditionally disabled via param */
> > > > + if (static_branch_unlikely(&restrict_proc_mem[2]))
> > > > + return -EACCES;
> > > > +
> > > > + /* Deny if reads are allowed only for ptracers via param */
> > > > + if (static_branch_unlikely(&restrict_proc_mem[3]) &&
> > > > + !__mem_open_current_is_ptracer(file))
> > > > + return -EACCES;
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > static int mem_open(struct inode *inode, struct file *file)
> > > > {
> > > > - int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> > > > + int ret;
> > > > +
> > > > + ret = __mem_open_check_access_restriction(file);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> > > >
> > > > /* OK to pass negative loff_t, we can catch out-of-range */
> > > > file->f_mode |= FMODE_UNSIGNED_OFFSET;
> > > > @@ -835,6 +919,20 @@ static int mem_open(struct inode *inode, struct file *file)
> > > > return ret;
> > > > }
> > > >
> > > > +static unsigned int __mem_rw_get_foll_force_flag(struct file *file)
> > > > +{
> > > > + /* Deny if FOLL_FORCE is disabled via param */
> > > > + if (static_branch_unlikely(&restrict_proc_mem[4]))
> > > > + return 0;
> > > > +
> > > > + /* Deny if FOLL_FORCE is allowed only for ptracers via param */
> > > > + if (static_branch_unlikely(&restrict_proc_mem[5]) &&
> > > > + !__mem_open_current_is_ptracer(file))
> > >
> > > This is like the ToCToU: the task may have changed out from under us
> > > between the open the read/write.
> >
> > But why would you care? As long as the task is the ptracer it doesn't
> > really matter afaict.
>
> Because the mm you're writing to may no longer be associated with the
> task.
>
> proc_mem_operations.open() will take a reference to the current task's
> mm, via proc_mem_open() through __mem_open():
>
> struct task_struct *task = get_proc_task(inode);
> ...
> mm = mm_access(task, mode | PTRACE_MODE_FSCREDS);
> ...
> file->private_data = mm;
>
>
> And in the proposed check added to mem_rw(), if get_proc_task(inode)
> returns a different task (i.e. the pid got recycled and the original mm
> is still associated with a forked task), then it could write to the
> forked task using the ptrace check against the new task.
>
> Looking at it again now, I think it should be possible to just revalidate
> the mm in __mem_open_current_is_ptracer(), though. i.e. it would be
> allowed if ptrace check passes and file->private_data == mm_access(...),
> for the mem_rw case...
Ack, I'll do this in v4, thanks again!