This series removes all in-tree usage of MAP_DENYWRITE from the kernel
and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
user space applications a while ago because of the chance for DoS.
The last renaming user is binfmt binary loading during exec and
legacy library loading via uselib().
With this change, MAP_DENYWRITE is effectively ignored throughout the
kernel. Although the net change is small (well, we actually add code and
comments), I think the cleanup in mmap() is quite nice.
There are some (minor) user-visible changes with this series:
1. We no longer deny write access to shared libaries loaded via legacy
uselib(); this behavior matches modern user space e.g., via dlopen().
2. We no longer deny write access to the elf interpreter after exec
completed, treating it just like shared libraries (which it often is).
3. We always deny write access to the file linked via /proc/pid/exe:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the file
cannot be denied, and write access to the file will remain denied
until the link is effectivel gone (exec, termination,
sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.
There is a related problem [2] with overlayfs, that should at least partly
be tackled by this series. I don't quite understand the interaction of
overlayfs and deny_write_access()/allow_write_access() at exec time:
If we end up denying write access to the wrong file and not to the
realfile, that would be fundamentally broken. We would have to reroute
our deny_write_access()/ allow_write_access() calls for the exec file to
the realfile -- but I leave figuring out the details to overlayfs guys, as
that would be a related but different issue.
There was a lengthy discussion in [3] whether to remove deny_write_access()
completely; however, if we decide to go that way, it would ideally be done
on top, because it could be that some applications even rely on the current
behavior.
v1 -> v2:
- "kernel/fork: factor out replacing the current MM exe_file"
-- Call the function "replace_mm_exe_file()" instead
-- Add some doc, similar to set_mm_exe_file()
-- Update patch subject/description
- "kernel/fork: always deny write access to current MM exe_file"
-- Introduce dup_mm_exe_file()
-- Make set_mm_exe_file() return an error to make the code easier to
grasp.
-- Improve comments
- Added ACKs
- Mention "sys_prctl(PR_SET_MM_MAP/EXE_FILE)" everywhere instead of
only "sys_prctl(PR_SET_MM_EXE_FILE)".
RFC -> v1:
- "binfmt: remove in-tree usage of MAP_DENYWRITE"
-- Add a note that this should fix part of a problem with overlayfs
[1] https://lore.kernel.org/r/[email protected]/
[2] https://lore.kernel.org/r/[email protected]/
[3] https://lkml.kernel.org/r/[email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Chinwen Chang <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Shawn Anastasio <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Gabriel Krisman Bertazi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Nicolas Viennot <[email protected]>
Cc: Thomas Cedeno <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Chengguang Xu <[email protected]>
Cc: "Christian König" <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: David Laight <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
David Hildenbrand (7):
binfmt: don't use MAP_DENYWRITE when loading shared libraries via
uselib()
kernel/fork: factor out replacing the current MM exe_file
kernel/fork: always deny write access to current MM exe_file
binfmt: remove in-tree usage of MAP_DENYWRITE
mm: remove VM_DENYWRITE
mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
fs: update documentation of get_write_access() and friends
arch/x86/ia32/ia32_aout.c | 8 ++-
fs/binfmt_aout.c | 7 ++-
fs/binfmt_elf.c | 6 +--
fs/binfmt_elf_fdpic.c | 2 +-
fs/exec.c | 4 +-
fs/proc/task_mmu.c | 1 -
include/linux/fs.h | 19 ++++---
include/linux/mm.h | 4 +-
include/linux/mman.h | 4 +-
include/trace/events/mmflags.h | 1 -
kernel/events/core.c | 2 -
kernel/fork.c | 95 ++++++++++++++++++++++++++++++----
kernel/sys.c | 33 +-----------
lib/test_printf.c | 5 +-
mm/mmap.c | 29 ++---------
mm/nommu.c | 2 -
16 files changed, 119 insertions(+), 103 deletions(-)
base-commit: 7c60610d476766e128cc4284bb6349732cbd6606
--
2.31.1
uselib() is the legacy systemcall for loading shared libraries.
Nowadays, applications use dlopen() to load shared libraries, completely
implemented in user space via mmap().
For example, glibc uses MAP_COPY to mmap shared libraries. While this
maps to MAP_PRIVATE | MAP_DENYWRITE on Linux, Linux ignores any
MAP_DENYWRITE specification from user space in mmap.
With this change, all remaining in-tree users of MAP_DENYWRITE use it
to map an executable. We will be able to open shared libraries loaded
via uselib() writable, just as we already can via dlopen() from user
space.
This is one step into the direction of removing MAP_DENYWRITE from the
kernel. This can be considered a minor user space visible change.
Acked-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/x86/ia32/ia32_aout.c | 2 +-
fs/binfmt_aout.c | 2 +-
fs/binfmt_elf.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index 5e5b9fc2747f..321d7b22ad2d 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -293,7 +293,7 @@ static int load_aout_library(struct file *file)
/* Now use mmap to map the library into memory. */
error = vm_mmap(file, start_addr, ex.a_text + ex.a_data,
PROT_READ | PROT_WRITE | PROT_EXEC,
- MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_32BIT,
+ MAP_FIXED | MAP_PRIVATE | MAP_32BIT,
N_TXTOFF(ex));
retval = error;
if (error != start_addr)
diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index 145917f734fe..d29de971d3f3 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -309,7 +309,7 @@ static int load_aout_library(struct file *file)
/* Now use mmap to map the library into memory. */
error = vm_mmap(file, start_addr, ex.a_text + ex.a_data,
PROT_READ | PROT_WRITE | PROT_EXEC,
- MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
+ MAP_FIXED | MAP_PRIVATE;
N_TXTOFF(ex));
retval = error;
if (error != start_addr)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 439ed81e755a..6d2c79533631 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1384,7 +1384,7 @@ static int load_elf_library(struct file *file)
(eppnt->p_filesz +
ELF_PAGEOFFSET(eppnt->p_vaddr)),
PROT_READ | PROT_WRITE | PROT_EXEC,
- MAP_FIXED_NOREPLACE | MAP_PRIVATE | MAP_DENYWRITE,
+ MAP_FIXED_NOREPLACE | MAP_PRIVATE,
(eppnt->p_offset -
ELF_PAGEOFFSET(eppnt->p_vaddr)));
if (error != ELF_PAGESTART(eppnt->p_vaddr))
--
2.31.1
Let's factor the main logic out into replace_mm_exe_file(), such that
all mm->exe_file logic is contained in kernel/fork.c.
While at it, perform some simple cleanups that are possible now that
we're simplifying the individual functions.
Acked-by: Christian Brauner <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm.h | 1 +
kernel/fork.c | 44 +++++++++++++++++++++++++++++++++++++++++---
kernel/sys.c | 33 +--------------------------------
3 files changed, 43 insertions(+), 35 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..48c6fa9ab792 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2581,6 +2581,7 @@ extern int mm_take_all_locks(struct mm_struct *mm);
extern void mm_drop_all_locks(struct mm_struct *mm);
extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
+extern int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
extern struct file *get_mm_exe_file(struct mm_struct *mm);
extern struct file *get_task_exe_file(struct task_struct *task);
diff --git a/kernel/fork.c b/kernel/fork.c
index bc94b2cc5995..eedce5c77041 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1148,9 +1148,7 @@ void mmput_async(struct mm_struct *mm)
*
* Main users are mmput() and sys_execve(). Callers prevent concurrent
* invocations: in mmput() nobody alive left, in execve task is single
- * threaded. sys_prctl(PR_SET_MM_MAP/EXE_FILE) also needs to set the
- * mm->exe_file, but does so without using set_mm_exe_file() in order
- * to avoid the need for any locks.
+ * threaded.
*/
void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
{
@@ -1170,6 +1168,46 @@ void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
fput(old_exe_file);
}
+/**
+ * replace_mm_exe_file - replace a reference to the mm's executable file
+ *
+ * This changes mm's executable file (shown as symlink /proc/[pid]/exe),
+ * dealing with concurrent invocation and without grabbing the mmap lock in
+ * write mode.
+ *
+ * Main user is sys_prctl(PR_SET_MM_MAP/EXE_FILE).
+ */
+int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
+{
+ struct vm_area_struct *vma;
+ struct file *old_exe_file;
+ int ret = 0;
+
+ /* Forbid mm->exe_file change if old file still mapped. */
+ old_exe_file = get_mm_exe_file(mm);
+ if (old_exe_file) {
+ mmap_read_lock(mm);
+ for (vma = mm->mmap; vma && !ret; vma = vma->vm_next) {
+ if (!vma->vm_file)
+ continue;
+ if (path_equal(&vma->vm_file->f_path,
+ &old_exe_file->f_path))
+ ret = -EBUSY;
+ }
+ mmap_read_unlock(mm);
+ fput(old_exe_file);
+ if (ret)
+ return ret;
+ }
+
+ /* set the new file, lockless */
+ get_file(new_exe_file);
+ old_exe_file = xchg(&mm->exe_file, new_exe_file);
+ if (old_exe_file)
+ fput(old_exe_file);
+ return 0;
+}
+
/**
* get_mm_exe_file - acquire a reference to the mm's executable file
*
diff --git a/kernel/sys.c b/kernel/sys.c
index ef1a78f5d71c..30c12e54585a 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1846,7 +1846,6 @@ SYSCALL_DEFINE1(umask, int, mask)
static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
{
struct fd exe;
- struct file *old_exe, *exe_file;
struct inode *inode;
int err;
@@ -1869,40 +1868,10 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
if (err)
goto exit;
- /*
- * Forbid mm->exe_file change if old file still mapped.
- */
- exe_file = get_mm_exe_file(mm);
- err = -EBUSY;
- if (exe_file) {
- struct vm_area_struct *vma;
-
- mmap_read_lock(mm);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
- if (!vma->vm_file)
- continue;
- if (path_equal(&vma->vm_file->f_path,
- &exe_file->f_path))
- goto exit_err;
- }
-
- mmap_read_unlock(mm);
- fput(exe_file);
- }
-
- err = 0;
- /* set the new file, lockless */
- get_file(exe.file);
- old_exe = xchg(&mm->exe_file, exe.file);
- if (old_exe)
- fput(old_exe);
+ err = replace_mm_exe_file(mm, exe.file);
exit:
fdput(exe);
return err;
-exit_err:
- mmap_read_unlock(mm);
- fput(exe_file);
- goto exit;
}
/*
--
2.31.1
We want to remove VM_DENYWRITE only currently only used when mapping the
executable during exec. During exec, we already deny_write_access() the
executable, however, after exec completes the VMAs mapped
with VM_DENYWRITE effectively keeps write access denied via
deny_write_access().
Let's deny write access when setting or replacing the MM exe_file. With
this change, we can remove VM_DENYWRITE for mapping executables.
Make set_mm_exe_file() return an error in case deny_write_access()
fails; note that this should never happen, because exec code does a
deny_write_access() early and keeps write access denied when calling
set_mm_exe_file. However, it makes the code easier to read and makes
set_mm_exe_file() and replace_mm_exe_file() look more similar.
This represents a minor user space visible change:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already
opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file
cannot be opened writable. Note that we can already fail with -EACCES if
the file doesn't have execute permissions.
Acked-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
fs/exec.c | 4 +++-
include/linux/mm.h | 2 +-
kernel/fork.c | 50 ++++++++++++++++++++++++++++++++++++++++------
3 files changed, 48 insertions(+), 8 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 38f63451b928..9294049f5487 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1270,7 +1270,9 @@ int begin_new_exec(struct linux_binprm * bprm)
* not visibile until then. This also enables the update
* to be lockless.
*/
- set_mm_exe_file(bprm->mm, bprm->file);
+ retval = set_mm_exe_file(bprm->mm, bprm->file);
+ if (retval)
+ goto out;
/* If the binary is not readable then enforce mm->dumpable=0 */
would_dump(bprm, bprm->file);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 48c6fa9ab792..56b1cd41db61 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2580,7 +2580,7 @@ static inline int check_data_rlimit(unsigned long rlim,
extern int mm_take_all_locks(struct mm_struct *mm);
extern void mm_drop_all_locks(struct mm_struct *mm);
-extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
+extern int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
extern int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
extern struct file *get_mm_exe_file(struct mm_struct *mm);
extern struct file *get_task_exe_file(struct task_struct *task);
diff --git a/kernel/fork.c b/kernel/fork.c
index eedce5c77041..543541764865 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -470,6 +470,20 @@ void free_task(struct task_struct *tsk)
}
EXPORT_SYMBOL(free_task);
+static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+ struct file *exe_file;
+
+ exe_file = get_mm_exe_file(oldmm);
+ RCU_INIT_POINTER(mm->exe_file, exe_file);
+ /*
+ * We depend on the oldmm having properly denied write access to the
+ * exe_file already.
+ */
+ if (exe_file && deny_write_access(exe_file))
+ pr_warn_once("deny_write_access() failed in %s\n", __func__);
+}
+
#ifdef CONFIG_MMU
static __latent_entropy int dup_mmap(struct mm_struct *mm,
struct mm_struct *oldmm)
@@ -493,7 +507,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
/* No ordering required: file already has been exposed. */
- RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
+ dup_mm_exe_file(mm, oldmm);
mm->total_vm = oldmm->total_vm;
mm->data_vm = oldmm->data_vm;
@@ -639,7 +653,7 @@ static inline void mm_free_pgd(struct mm_struct *mm)
static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
{
mmap_write_lock(oldmm);
- RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
+ dup_mm_exe_file(mm, oldmm);
mmap_write_unlock(oldmm);
return 0;
}
@@ -1149,8 +1163,10 @@ void mmput_async(struct mm_struct *mm)
* Main users are mmput() and sys_execve(). Callers prevent concurrent
* invocations: in mmput() nobody alive left, in execve task is single
* threaded.
+ *
+ * Can only fail if new_exe_file != NULL.
*/
-void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
+int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
{
struct file *old_exe_file;
@@ -1161,11 +1177,21 @@ void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
*/
old_exe_file = rcu_dereference_raw(mm->exe_file);
- if (new_exe_file)
+ if (new_exe_file) {
+ /*
+ * We expect the caller (i.e., sys_execve) to already denied
+ * write access, so this is unlikely to fail.
+ */
+ if (unlikely(deny_write_access(new_exe_file)))
+ return -EACCES;
get_file(new_exe_file);
+ }
rcu_assign_pointer(mm->exe_file, new_exe_file);
- if (old_exe_file)
+ if (old_exe_file) {
+ allow_write_access(old_exe_file);
fput(old_exe_file);
+ }
+ return 0;
}
/**
@@ -1201,10 +1227,22 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
}
/* set the new file, lockless */
+ ret = deny_write_access(new_exe_file);
+ if (ret)
+ return -EACCES;
get_file(new_exe_file);
+
old_exe_file = xchg(&mm->exe_file, new_exe_file);
- if (old_exe_file)
+ if (old_exe_file) {
+ /*
+ * Don't race with dup_mmap() getting the file and disallowing
+ * write access while someone might open the file writable.
+ */
+ mmap_read_lock(mm);
+ allow_write_access(old_exe_file);
fput(old_exe_file);
+ mmap_read_unlock(mm);
+ }
return 0;
}
--
2.31.1
All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be
set from user space, so all users are gone; let's remove it.
Acked-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/task_mmu.c | 1 -
include/linux/mm.h | 1 -
include/linux/mman.h | 1 -
include/trace/events/mmflags.h | 1 -
kernel/events/core.c | 2 --
kernel/fork.c | 3 ---
lib/test_printf.c | 5 ++---
mm/mmap.c | 27 +++------------------------
8 files changed, 5 insertions(+), 36 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..cf25be3e0321 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -619,7 +619,6 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
[ilog2(VM_MAYSHARE)] = "ms",
[ilog2(VM_GROWSDOWN)] = "gd",
[ilog2(VM_PFNMAP)] = "pf",
- [ilog2(VM_DENYWRITE)] = "dw",
[ilog2(VM_LOCKED)] = "lo",
[ilog2(VM_IO)] = "io",
[ilog2(VM_SEQ_READ)] = "sr",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 56b1cd41db61..257995f62e83 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -281,7 +281,6 @@ extern unsigned int kobjsize(const void *objp);
#define VM_GROWSDOWN 0x00000100 /* general info on the segment */
#define VM_UFFD_MISSING 0x00000200 /* missing pages tracking */
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
-#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
#define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
#define VM_LOCKED 0x00002000
diff --git a/include/linux/mman.h b/include/linux/mman.h
index ebb09a964272..bd9aadda047b 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -153,7 +153,6 @@ static inline unsigned long
calc_vm_flag_bits(unsigned long flags)
{
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
- _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
_calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
_calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) |
arch_calc_vm_flag_bits(flags);
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 390270e00a1d..f44c3fb8da1a 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -163,7 +163,6 @@ IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
{VM_UFFD_MISSING, "uffd_missing" }, \
IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \
{VM_PFNMAP, "pfnmap" }, \
- {VM_DENYWRITE, "denywrite" }, \
{VM_UFFD_WP, "uffd_wp" }, \
{VM_LOCKED, "locked" }, \
{VM_IO, "io" }, \
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1cb1f9b8392e..19767bb9933c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8307,8 +8307,6 @@ static void perf_event_mmap_event(struct perf_mmap_event *mmap_event)
else
flags = MAP_PRIVATE;
- if (vma->vm_flags & VM_DENYWRITE)
- flags |= MAP_DENYWRITE;
if (vma->vm_flags & VM_LOCKED)
flags |= MAP_LOCKED;
if (is_vm_hugetlb_page(vma))
diff --git a/kernel/fork.c b/kernel/fork.c
index 543541764865..34d21642c44f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -570,12 +570,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
file = tmp->vm_file;
if (file) {
- struct inode *inode = file_inode(file);
struct address_space *mapping = file->f_mapping;
get_file(file);
- if (tmp->vm_flags & VM_DENYWRITE)
- put_write_access(inode);
i_mmap_lock_write(mapping);
if (tmp->vm_flags & VM_SHARED)
mapping_allow_writable(mapping);
diff --git a/lib/test_printf.c b/lib/test_printf.c
index 8ac71aee46af..8a48b61c3763 100644
--- a/lib/test_printf.c
+++ b/lib/test_printf.c
@@ -675,9 +675,8 @@ flags(void)
"uptodate|dirty|lru|active|swapbacked",
cmp_buffer);
- flags = VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC
- | VM_DENYWRITE;
- test("read|exec|mayread|maywrite|mayexec|denywrite", "%pGv", &flags);
+ flags = VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
+ test("read|exec|mayread|maywrite|mayexec", "%pGv", &flags);
gfp = GFP_TRANSHUGE;
test("GFP_TRANSHUGE", "%pGg", &gfp);
diff --git a/mm/mmap.c b/mm/mmap.c
index ca54d36d203a..589dc1dc13db 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -148,8 +148,6 @@ void vma_set_page_prot(struct vm_area_struct *vma)
static void __remove_shared_vm_struct(struct vm_area_struct *vma,
struct file *file, struct address_space *mapping)
{
- if (vma->vm_flags & VM_DENYWRITE)
- allow_write_access(file);
if (vma->vm_flags & VM_SHARED)
mapping_unmap_writable(mapping);
@@ -666,8 +664,6 @@ static void __vma_link_file(struct vm_area_struct *vma)
if (file) {
struct address_space *mapping = file->f_mapping;
- if (vma->vm_flags & VM_DENYWRITE)
- put_write_access(file_inode(file));
if (vma->vm_flags & VM_SHARED)
mapping_allow_writable(mapping);
@@ -1788,22 +1784,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma->vm_pgoff = pgoff;
if (file) {
- if (vm_flags & VM_DENYWRITE) {
- error = deny_write_access(file);
- if (error)
- goto free_vma;
- }
if (vm_flags & VM_SHARED) {
error = mapping_map_writable(file->f_mapping);
if (error)
- goto allow_write_and_free_vma;
+ goto free_vma;
}
- /* ->mmap() can change vma->vm_file, but must guarantee that
- * vma_link() below can deny write-access if VM_DENYWRITE is set
- * and map writably if VM_SHARED is set. This usually means the
- * new file must not have been exposed to user-space, yet.
- */
vma->vm_file = get_file(file);
error = call_mmap(file, vma);
if (error)
@@ -1860,13 +1846,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma_link(mm, vma, prev, rb_link, rb_parent);
/* Once vma denies write, undo our temporary denial count */
- if (file) {
unmap_writable:
- if (vm_flags & VM_SHARED)
- mapping_unmap_writable(file->f_mapping);
- if (vm_flags & VM_DENYWRITE)
- allow_write_access(file);
- }
+ if (file && vm_flags & VM_SHARED)
+ mapping_unmap_writable(file->f_mapping);
file = vma->vm_file;
out:
perf_event_mmap(vma);
@@ -1906,9 +1888,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
charged = 0;
if (vm_flags & VM_SHARED)
mapping_unmap_writable(file->f_mapping);
-allow_write_and_free_vma:
- if (vm_flags & VM_DENYWRITE)
- allow_write_access(file);
free_vma:
vm_area_free(vma);
unacct_error:
--
2.31.1
As VM_DENYWRITE does no longer exists, let's spring-clean the
documentation of get_write_access() and friends.
Acked-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/fs.h | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 640574294216..e0dc3e96ed72 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3055,15 +3055,20 @@ static inline void file_end_write(struct file *file)
}
/*
+ * This is used for regular files where some users -- especially the
+ * currently executed binary in a process, previously handled via
+ * VM_DENYWRITE -- cannot handle concurrent write (and maybe mmap
+ * read-write shared) accesses.
+ *
* get_write_access() gets write permission for a file.
* put_write_access() releases this write permission.
- * This is used for regular files.
- * We cannot support write (and maybe mmap read-write shared) accesses and
- * MAP_DENYWRITE mmappings simultaneously. The i_writecount field of an inode
- * can have the following values:
- * 0: no writers, no VM_DENYWRITE mappings
- * < 0: (-i_writecount) vm_area_structs with VM_DENYWRITE set exist
- * > 0: (i_writecount) users are writing to the file.
+ * deny_write_access() denies write access to a file.
+ * allow_write_access() re-enables write access to a file.
+ *
+ * The i_writecount field of an inode can have the following values:
+ * 0: no write access, no denied write access
+ * < 0: (-i_writecount) users that denied write access to the file.
+ * > 0: (i_writecount) users that have write access to the file.
*
* Normally we operate on that counter with atomic_{inc,dec} and it's safe
* except for the cases where we don't hold i_writecount yet. Then we need to
--
2.31.1
At exec time when we mmap the new executable via MAP_DENYWRITE we have it
opened via do_open_execat() and already deny_write_access()'ed the file
successfully. Once exec completes, we allow_write_acces(); however,
we set mm->exe_file in begin_new_exec() via set_mm_exe_file() and
also deny_write_access() as long as mm->exe_file remains set. We'll
effectively deny write access to our executable via mm->exe_file
until mm->exe_file is changed -- when the process is removed, on new
exec, or via sys_prctl(PR_SET_MM_MAP/EXE_FILE).
Let's remove all usage of MAP_DENYWRITE, it's no longer necessary for
mm->exe_file.
In case of an elf interpreter, we'll now only deny write access to the file
during exec. This is somewhat okay, because the interpreter behaves
(and sometime is) a shared library; all shared libraries, especially the
ones loaded directly in user space like via dlopen() won't ever be mapped
via MAP_DENYWRITE, because we ignore that from user space completely;
these shared libraries can always be modified while mapped and executed.
Let's only special-case the main executable, denying write access while
being executed by a process. This can be considered a minor user space
visible change.
While this is a cleanup, it also fixes part of a problem reported with
VM_DENYWRITE on overlayfs, as VM_DENYWRITE is effectively unused with
this patch and will be removed next:
"Overlayfs did not honor positive i_writecount on realfile for
VM_DENYWRITE mappings." [1]
[1] https://lore.kernel.org/r/[email protected]/
Reported-by: Chengguang Xu <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/x86/ia32/ia32_aout.c | 6 ++----
fs/binfmt_aout.c | 5 ++---
fs/binfmt_elf.c | 4 ++--
fs/binfmt_elf_fdpic.c | 2 +-
4 files changed, 7 insertions(+), 10 deletions(-)
diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index 321d7b22ad2d..9bd15241fadb 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -202,8 +202,7 @@ static int load_aout_binary(struct linux_binprm *bprm)
error = vm_mmap(bprm->file, N_TXTADDR(ex), ex.a_text,
PROT_READ | PROT_EXEC,
- MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE |
- MAP_32BIT,
+ MAP_FIXED | MAP_PRIVATE | MAP_32BIT,
fd_offset);
if (error != N_TXTADDR(ex))
@@ -211,8 +210,7 @@ static int load_aout_binary(struct linux_binprm *bprm)
error = vm_mmap(bprm->file, N_DATADDR(ex), ex.a_data,
PROT_READ | PROT_WRITE | PROT_EXEC,
- MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE |
- MAP_32BIT,
+ MAP_FIXED | MAP_PRIVATE | MAP_32BIT,
fd_offset + ex.a_text);
if (error != N_DATADDR(ex))
return error;
diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index d29de971d3f3..a47496d0f123 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -221,8 +221,7 @@ static int load_aout_binary(struct linux_binprm * bprm)
}
error = vm_mmap(bprm->file, N_TXTADDR(ex), ex.a_text,
- PROT_READ | PROT_EXEC,
- MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
+ PROT_READ | PROT_EXEC, MAP_FIXED | MAP_PRIVATE,
fd_offset);
if (error != N_TXTADDR(ex))
@@ -230,7 +229,7 @@ static int load_aout_binary(struct linux_binprm * bprm)
error = vm_mmap(bprm->file, N_DATADDR(ex), ex.a_data,
PROT_READ | PROT_WRITE | PROT_EXEC,
- MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
+ MAP_FIXED | MAP_PRIVATE,
fd_offset + ex.a_text);
if (error != N_DATADDR(ex))
return error;
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 6d2c79533631..69d900a8473d 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -622,7 +622,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
eppnt = interp_elf_phdata;
for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
if (eppnt->p_type == PT_LOAD) {
- int elf_type = MAP_PRIVATE | MAP_DENYWRITE;
+ int elf_type = MAP_PRIVATE;
int elf_prot = make_prot(eppnt->p_flags, arch_state,
true, true);
unsigned long vaddr = 0;
@@ -1070,7 +1070,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
elf_prot = make_prot(elf_ppnt->p_flags, &arch_state,
!!interpreter, false);
- elf_flags = MAP_PRIVATE | MAP_DENYWRITE;
+ elf_flags = MAP_PRIVATE;
vaddr = elf_ppnt->p_vaddr;
/*
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index cf4028487dcc..6d8fd6030cbb 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1041,7 +1041,7 @@ static int elf_fdpic_map_file_by_direct_mmap(struct elf_fdpic_params *params,
if (phdr->p_flags & PF_W) prot |= PROT_WRITE;
if (phdr->p_flags & PF_X) prot |= PROT_EXEC;
- flags = MAP_PRIVATE | MAP_DENYWRITE;
+ flags = MAP_PRIVATE;
maddr = 0;
switch (params->flags & ELF_FDPIC_FLAG_ARRANGEMENT) {
--
2.31.1
Let's also remove masking off MAP_DENYWRITE from ksys_mmap_pgoff():
the last in-tree occurrence of MAP_DENYWRITE is now in LEGACY_MAP_MASK,
which accepts the flag e.g., for MAP_SHARED_VALIDATE; however, the flag
is ignored throughout the kernel now.
Add a comment to LEGACY_MAP_MASK stating that MAP_DENYWRITE is ignored.
Acked-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mman.h | 3 ++-
mm/mmap.c | 2 --
mm/nommu.c | 2 --
3 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/include/linux/mman.h b/include/linux/mman.h
index bd9aadda047b..b66e91b8176c 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -32,7 +32,8 @@
* The historical set of flags that all mmap implementations implicitly
* support when a ->mmap_validate() op is not provided in file_operations.
*
- * MAP_EXECUTABLE is completely ignored throughout the kernel.
+ * MAP_EXECUTABLE and MAP_DENYWRITE are completely ignored throughout the
+ * kernel.
*/
#define LEGACY_MAP_MASK (MAP_SHARED \
| MAP_PRIVATE \
diff --git a/mm/mmap.c b/mm/mmap.c
index 589dc1dc13db..bf11fc6e8311 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1626,8 +1626,6 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
return PTR_ERR(file);
}
- flags &= ~MAP_DENYWRITE;
-
retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
out_fput:
if (file)
diff --git a/mm/nommu.c b/mm/nommu.c
index 3a93d4054810..0987d131bdfc 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1296,8 +1296,6 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
goto out;
}
- flags &= ~MAP_DENYWRITE;
-
retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
if (file)
--
2.31.1
Am 16.08.21 um 21:48 schrieb David Hildenbrand:
> This series removes all in-tree usage of MAP_DENYWRITE from the kernel
> and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
> user space applications a while ago because of the chance for DoS.
> The last renaming user is binfmt binary loading during exec and
> legacy library loading via uselib().
>
> With this change, MAP_DENYWRITE is effectively ignored throughout the
> kernel. Although the net change is small (well, we actually add code and
> comments), I think the cleanup in mmap() is quite nice.
>
> There are some (minor) user-visible changes with this series:
> 1. We no longer deny write access to shared libaries loaded via legacy
> uselib(); this behavior matches modern user space e.g., via dlopen().
> 2. We no longer deny write access to the elf interpreter after exec
> completed, treating it just like shared libraries (which it often is).
> 3. We always deny write access to the file linked via /proc/pid/exe:
> sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the file
> cannot be denied, and write access to the file will remain denied
> until the link is effectivel gone (exec, termination,
> sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.
>
> There is a related problem [2] with overlayfs, that should at least partly
> be tackled by this series. I don't quite understand the interaction of
> overlayfs and deny_write_access()/allow_write_access() at exec time:
>
> If we end up denying write access to the wrong file and not to the
> realfile, that would be fundamentally broken. We would have to reroute
> our deny_write_access()/ allow_write_access() calls for the exec file to
> the realfile -- but I leave figuring out the details to overlayfs guys, as
> that would be a related but different issue.
>
> There was a lengthy discussion in [3] whether to remove deny_write_access()
> completely; however, if we decide to go that way, it would ideally be done
> on top, because it could be that some applications even rely on the current
> behavior.
>
> v1 -> v2:
> - "kernel/fork: factor out replacing the current MM exe_file"
> -- Call the function "replace_mm_exe_file()" instead
> -- Add some doc, similar to set_mm_exe_file()
> -- Update patch subject/description
> - "kernel/fork: always deny write access to current MM exe_file"
> -- Introduce dup_mm_exe_file()
> -- Make set_mm_exe_file() return an error to make the code easier to
> grasp.
> -- Improve comments
> - Added ACKs
> - Mention "sys_prctl(PR_SET_MM_MAP/EXE_FILE)" everywhere instead of
> only "sys_prctl(PR_SET_MM_EXE_FILE)".
>
> RFC -> v1:
> - "binfmt: remove in-tree usage of MAP_DENYWRITE"
> -- Add a note that this should fix part of a problem with overlayfs
This is unfortunately way beyond my understanding of the fs layer to
actually review this.
But from the ten mile high view it looks like a really nice cleanup to
me and makes my live in the DRM subsystem much easier.
Feel free to add a Acked-by: Christian König <[email protected]>
to the series.
Thanks,
Christian.
>
> [1] https://lore.kernel.org/r/[email protected]/
> [2] https://lore.kernel.org/r/[email protected]/
> [3] https://lkml.kernel.org/r/[email protected]
>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: Alexey Dobriyan <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Alexander Shishkin <[email protected]>
> Cc: Jiri Olsa <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Petr Mladek <[email protected]>
> Cc: Sergey Senozhatsky <[email protected]>
> Cc: Andy Shevchenko <[email protected]>
> Cc: Rasmus Villemoes <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Greg Ungerer <[email protected]>
> Cc: Geert Uytterhoeven <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Vincenzo Frascino <[email protected]>
> Cc: Chinwen Chang <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: "Matthew Wilcox (Oracle)" <[email protected]>
> Cc: Huang Ying <[email protected]>
> Cc: Jann Horn <[email protected]>
> Cc: Feng Tang <[email protected]>
> Cc: Kevin Brodsky <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Shawn Anastasio <[email protected]>
> Cc: Steven Price <[email protected]>
> Cc: Nicholas Piggin <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Gabriel Krisman Bertazi <[email protected]>
> Cc: Peter Xu <[email protected]>
> Cc: Suren Baghdasaryan <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Cc: Marco Elver <[email protected]>
> Cc: Daniel Jordan <[email protected]>
> Cc: Nicolas Viennot <[email protected]>
> Cc: Thomas Cedeno <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Miklos Szeredi <[email protected]>
> Cc: Chengguang Xu <[email protected]>
> Cc: "Christian König" <[email protected]>
> Cc: Florian Weimer <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: David Laight <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
>
> David Hildenbrand (7):
> binfmt: don't use MAP_DENYWRITE when loading shared libraries via
> uselib()
> kernel/fork: factor out replacing the current MM exe_file
> kernel/fork: always deny write access to current MM exe_file
> binfmt: remove in-tree usage of MAP_DENYWRITE
> mm: remove VM_DENYWRITE
> mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
> fs: update documentation of get_write_access() and friends
>
> arch/x86/ia32/ia32_aout.c | 8 ++-
> fs/binfmt_aout.c | 7 ++-
> fs/binfmt_elf.c | 6 +--
> fs/binfmt_elf_fdpic.c | 2 +-
> fs/exec.c | 4 +-
> fs/proc/task_mmu.c | 1 -
> include/linux/fs.h | 19 ++++---
> include/linux/mm.h | 4 +-
> include/linux/mman.h | 4 +-
> include/trace/events/mmflags.h | 1 -
> kernel/events/core.c | 2 -
> kernel/fork.c | 95 ++++++++++++++++++++++++++++++----
> kernel/sys.c | 33 +-----------
> lib/test_printf.c | 5 +-
> mm/mmap.c | 29 ++---------
> mm/nommu.c | 2 -
> 16 files changed, 119 insertions(+), 103 deletions(-)
>
>
> base-commit: 7c60610d476766e128cc4284bb6349732cbd6606
So I like this series.
However, logically, I think this part in replace_mm_exe_file() no
longer makes sense:
On Mon, Aug 16, 2021 at 12:50 PM David Hildenbrand <[email protected]> wrote:
>
> + /* Forbid mm->exe_file change if old file still mapped. */
> + old_exe_file = get_mm_exe_file(mm);
> + if (old_exe_file) {
> + mmap_read_lock(mm);
> + for (vma = mm->mmap; vma && !ret; vma = vma->vm_next) {
> + if (!vma->vm_file)
> + continue;
> + if (path_equal(&vma->vm_file->f_path,
> + &old_exe_file->f_path))
> + ret = -EBUSY;
> + }
> + mmap_read_unlock(mm);
> + fput(old_exe_file);
> + if (ret)
> + return ret;
> + }
and should just be removed.
NOTE! I think it makes sense within the context of this patch (where
you just move code around), but that it should then be removed in the
next patch that does that "always deny write access to current MM
exe_file" thing.
I just quoted it in the context of this patch, since the next patch
doesn't actually show this code any more.
In the *old* model - where the ETXTBUSY was about the mmap() of the
file - the above tests make sense.
But in the new model, walking the mappings just doesn't seem to be a
sensible operation any more. The mappings simply aren't what ETXTBUSY
is about in the new world order, and so doing that mapping walk seems
nonsensical.
Hmm?
Linus
On 19.08.21 22:51, Linus Torvalds wrote:
> So I like this series.
>
> However, logically, I think this part in replace_mm_exe_file() no
> longer makes sense:
>
> On Mon, Aug 16, 2021 at 12:50 PM David Hildenbrand <[email protected]> wrote:
>>
>> + /* Forbid mm->exe_file change if old file still mapped. */
>> + old_exe_file = get_mm_exe_file(mm);
>> + if (old_exe_file) {
>> + mmap_read_lock(mm);
>> + for (vma = mm->mmap; vma && !ret; vma = vma->vm_next) {
>> + if (!vma->vm_file)
>> + continue;
>> + if (path_equal(&vma->vm_file->f_path,
>> + &old_exe_file->f_path))
>> + ret = -EBUSY;
>> + }
>> + mmap_read_unlock(mm);
>> + fput(old_exe_file);
>> + if (ret)
>> + return ret;
>> + }
>
> and should just be removed.
>
> NOTE! I think it makes sense within the context of this patch (where
> you just move code around), but that it should then be removed in the
> next patch that does that "always deny write access to current MM
> exe_file" thing.
>
> I just quoted it in the context of this patch, since the next patch
> doesn't actually show this code any more.
>
> In the *old* model - where the ETXTBUSY was about the mmap() of the
> file - the above tests make sense.
>
> But in the new model, walking the mappings just doesn't seem to be a
> sensible operation any more. The mappings simply aren't what ETXTBUSY
> is about in the new world order, and so doing that mapping walk seems
> nonsensical.
>
> Hmm?
I think this is somewhat another kind of "stop user space trying
to do stupid things" thingy, not necessarily glued to ETXTBUSY:
don't allow replacing exe_file if that very file is still mapped
and consequently eventually still in use by the application.
I don't think it necessarily has many things to do with ETXTBUSY:
we only check if there is a VMA mapping that file, not that it's
a VM_DENYWRITE mapping.
That code originates from
commit 4229fb1dc6843c49a14bb098719f8a696cdc44f8
Author: Konstantin Khlebnikov <[email protected]>
Date: Wed Jul 11 14:02:11 2012 -0700
c/r: prctl: less paranoid prctl_set_mm_exe_file()
"no other files mapped" requirement from my previous patch (c/r: prctl:
update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal) is too
paranoid, it forbids operation even if there mapped one shared-anon vma.
Let's check that current mm->exe_file already unmapped, in this case
exe_file symlink already outdated and its changing is reasonable.
The statement "exe_file symlink already outdated and its
changing is reasonable" somewhat makes sense.
Long story short, I think this check somehow makes a bit of sense, but
we wouldn't lose too much if we drop it -- just another sanity check.
Your call :)
--
Thanks,
David / dhildenb
David Hildenbrand <[email protected]> writes:
> On 19.08.21 22:51, Linus Torvalds wrote:
>> So I like this series.
>>
>> However, logically, I think this part in replace_mm_exe_file() no
>> longer makes sense:
>>
>> On Mon, Aug 16, 2021 at 12:50 PM David Hildenbrand <[email protected]> wrote:
>>>
>>> + /* Forbid mm->exe_file change if old file still mapped. */
>>> + old_exe_file = get_mm_exe_file(mm);
>>> + if (old_exe_file) {
>>> + mmap_read_lock(mm);
>>> + for (vma = mm->mmap; vma && !ret; vma = vma->vm_next) {
>>> + if (!vma->vm_file)
>>> + continue;
>>> + if (path_equal(&vma->vm_file->f_path,
>>> + &old_exe_file->f_path))
>>> + ret = -EBUSY;
>>> + }
>>> + mmap_read_unlock(mm);
>>> + fput(old_exe_file);
>>> + if (ret)
>>> + return ret;
>>> + }
>>
>> and should just be removed.
>>
>> NOTE! I think it makes sense within the context of this patch (where
>> you just move code around), but that it should then be removed in the
>> next patch that does that "always deny write access to current MM
>> exe_file" thing.
>>
>> I just quoted it in the context of this patch, since the next patch
>> doesn't actually show this code any more.
>>
>> In the *old* model - where the ETXTBUSY was about the mmap() of the
>> file - the above tests make sense.
>>
>> But in the new model, walking the mappings just doesn't seem to be a
>> sensible operation any more. The mappings simply aren't what ETXTBUSY
>> is about in the new world order, and so doing that mapping walk seems
>> nonsensical.
>>
>> Hmm?
>
> I think this is somewhat another kind of "stop user space trying
> to do stupid things" thingy, not necessarily glued to ETXTBUSY:
> don't allow replacing exe_file if that very file is still mapped
> and consequently eventually still in use by the application.
>
> I don't think it necessarily has many things to do with ETXTBUSY:
> we only check if there is a VMA mapping that file, not that it's
> a VM_DENYWRITE mapping.
>
> That code originates from
>
> commit 4229fb1dc6843c49a14bb098719f8a696cdc44f8
> Author: Konstantin Khlebnikov <[email protected]>
> Date: Wed Jul 11 14:02:11 2012 -0700
>
> c/r: prctl: less paranoid prctl_set_mm_exe_file()
>
> "no other files mapped" requirement from my previous patch (c/r: prctl:
> update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal) is too
> paranoid, it forbids operation even if there mapped one shared-anon vma.
> Let's check that current mm->exe_file already unmapped, in this case
> exe_file symlink already outdated and its changing is reasonable.
>
>
> The statement "exe_file symlink already outdated and its
> changing is reasonable" somewhat makes sense.
>
>
> Long story short, I think this check somehow makes a bit of sense, but
> we wouldn't lose too much if we drop it -- just another sanity check.
>
> Your call :)
There has been quite a bit of conversation of the years about how bad is
it to allow changing /proc/self/exe as some userspace depends on it.
I think this check is there to keep from changing /proc/self/exe
arbitrarily.
Maybe it is all completely silly and we should not care about the code
that thinks /proc/self/exe is a reliable measure of anything, but short
of that I think we should either keep the code or put in some careful
thought as to which restrictions make sense when changing
/proc/self/exe.
Eric
On Fri, Aug 20, 2021 at 7:36 AM Eric W. Biederman <[email protected]> wrote:
>
> I think this check is there to keep from changing /proc/self/exe
> arbitrarily.
Well, you pretty much can already. You just have to jump through a few hoops.
> Maybe it is all completely silly and we should not care about the code
> that thinks /proc/self/exe is a reliable measure of anything, but short
> of that I think we should either keep the code or put in some careful
> thought as to which restrictions make sense when changing
> /proc/self/exe.
I think the important ones are already there: checking that it is (a)
an executable and (b) that we have execute permission to it.
I also think the code is actually racy - while we are checking "did
the old mm_exe file have any mappings", there's nothing that keeps
another thread from changing the exe file to another one that _does_
have mappings, and then we'll happily replace it with yet another file
because we checked the old one, not the new one it was replaced by in
the meantime.
Of course, that "race" doesn't really matter - exactly because this
isn't about security, it's just a random "let's test that immaterial
thing, and we don't actually care about corner cases".
So I'm not saying that race needs to be fixed - I'm just pointing it
out as an example of how nonsensical the test really is. It's not
fundamental to anything, it's just a random "let's test this odd
condition".
That said, I don't care _that_ much. I'm happy with David's series, I
just think that once we don't do this at a mmap level any more, the
"go look for mappings" code makes little sense.
So we can leave it, and remove it later if people agree.
Linus
On 16.08.21 21:48, David Hildenbrand wrote:
> This series removes all in-tree usage of MAP_DENYWRITE from the kernel
> and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
> user space applications a while ago because of the chance for DoS.
> The last renaming user is binfmt binary loading during exec and
> legacy library loading via uselib().
>
So, how do we want to continue with this? Pick it up for v5.15? Have it
in -next for a while and eventually pick it up for v5.16?
I think the "remove ETXTBSY completely" and "remove the sanity mapping
check" thingies should be done on top.
--
Thanks,
David / dhildenb
On Fri, Sep 3, 2021 at 2:45 AM David Hildenbrand <[email protected]> wrote:
>
> So, how do we want to continue with this? Pick it up for v5.15? Have it
> in -next for a while and eventually pick it up for v5.16?
I'm ok with the series. If you have a git tree, do the normal pull
request, and we can do it for 5.15 and see if anybody notices.
As you say, any final removal of ETXTBSY should be a separate and
later patch on top.
Linus
On Mon, Aug 16, 2021 at 09:48:34PM +0200, David Hildenbrand wrote:
> uselib() is the legacy systemcall for loading shared libraries.
> Nowadays, applications use dlopen() to load shared libraries, completely
> implemented in user space via mmap().
>
> For example, glibc uses MAP_COPY to mmap shared libraries. While this
> maps to MAP_PRIVATE | MAP_DENYWRITE on Linux, Linux ignores any
> MAP_DENYWRITE specification from user space in mmap.
>
> With this change, all remaining in-tree users of MAP_DENYWRITE use it
> to map an executable. We will be able to open shared libraries loaded
> via uselib() writable, just as we already can via dlopen() from user
> space.
>
> This is one step into the direction of removing MAP_DENYWRITE from the
> kernel. This can be considered a minor user space visible change.
>
> Acked-by: "Eric W. Biederman" <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> arch/x86/ia32/ia32_aout.c | 2 +-
> fs/binfmt_aout.c | 2 +-
> fs/binfmt_elf.c | 2 +-
> 3 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
> index 5e5b9fc2747f..321d7b22ad2d 100644
> --- a/arch/x86/ia32/ia32_aout.c
> +++ b/arch/x86/ia32/ia32_aout.c
> @@ -293,7 +293,7 @@ static int load_aout_library(struct file *file)
> /* Now use mmap to map the library into memory. */
> error = vm_mmap(file, start_addr, ex.a_text + ex.a_data,
> PROT_READ | PROT_WRITE | PROT_EXEC,
> - MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_32BIT,
> + MAP_FIXED | MAP_PRIVATE | MAP_32BIT,
> N_TXTOFF(ex));
> retval = error;
> if (error != start_addr)
> diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
> index 145917f734fe..d29de971d3f3 100644
> --- a/fs/binfmt_aout.c
> +++ b/fs/binfmt_aout.c
> @@ -309,7 +309,7 @@ static int load_aout_library(struct file *file)
> /* Now use mmap to map the library into memory. */
> error = vm_mmap(file, start_addr, ex.a_text + ex.a_data,
> PROT_READ | PROT_WRITE | PROT_EXEC,
> - MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
> + MAP_FIXED | MAP_PRIVATE;
> N_TXTOFF(ex));
Guess someone didn't care compile testing their code. This is now in
mainline.
Guenter
On Sun, Sep 5, 2021 at 8:32 AM Guenter Roeck <[email protected]> wrote:
>
> Guess someone didn't care compile testing their code. This is now in
> mainline.
To be fair, a.out is disabled pretty much on all relevant platforms these days.
Only alpha and m68k left, I think.
I applied the obvious patch from Geert.
Linus
On 05.09.21 19:17, Linus Torvalds wrote:
> On Sun, Sep 5, 2021 at 8:32 AM Guenter Roeck <[email protected]> wrote:
>>
>> Guess someone didn't care compile testing their code. This is now in
>> mainline.
>
> To be fair, a.out is disabled pretty much on all relevant platforms these days.
Yes, and it seems like it was disabled in all configs I used. (I did not
compile all-yes configs; usually my stuff goes via -mm where it will end
up in -next for a while ... this one was special)
>
> Only alpha and m68k left, I think.
>
> I applied the obvious patch from Geert.
Thanks Linus!
--
Thanks,
David / dhildenb