Hello,
This is a resend[1] of a patch series that implements safe writes to
/proc/pid/mem. Such functionality is useful as it gives debuggers a simple and
efficient mechanism to manipulate a process' address space. Memory can be read
and written using single calls to pread(2) and pwrite(2) instead of iteratively
calling into ptrace(2).
Since the first version this series has had some good review. However, I think
the first half of the series (patches 1-5) would benefit from an ACK by one of
the x86 maintainers before the mm side of things can move forward.
All changes wrt to x86 are in patches 1-5. These make is_gate_vma() and
in_gate_vma() functions of mm_struct, not task_struct. This is accomplished by
adding a field to mm->mm_context_t that mirrors TIF_IA32. This change will
help simplify current and future operations on mm's. For example, it allows
some code paths to avoid holding task_lock, or to simply avoid passing a
task_struct around when an mm will do.
Patches 6-12 build on this flexibility to enable secure writes to
/proc/pid/mem. These patches impact the memory and procfs subsystems. The
principle strategy is to get a reference to the target task's mm before the
permission check, and to hold that reference until after the write completes.
This patch set is based on v2.6.38.
The general approach used was suggested to me by Alexander Viro, but any
mistakes present in these patches are entirely my own.
Thanks!
--
steve
[1] lkml.org/lkml/2011/3/13/147
Changes since v1:
- Rename mm_context_t.compat to ia32_compat as suggested by Michel
Lespinasse.
- Rework check_mem_permission() to return ERR_PTR and hold cred_guard_mutex
as suggested by Alexander Viro.
- Collapse patches into a single series.
Stephen Wilson (12):
x86: add context tag to mark mm when running a task in 32-bit compatibility mode
x86: mark associated mm when running a task in 32 bit compatibility mode
mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
mm: arch: make in_gate_area take an mm_struct instead of a task_struct
mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
mm: use mm_struct to resolve gate vma's in __get_user_pages
mm: factor out main logic of access_process_vm
mm: implement access_remote_vm
proc: disable mem_write after exec
proc: hold cred_guard_mutex in check_mem_permission()
proc: make check_mem_permission() return an mm_struct on success
proc: enable writing to /proc/pid/mem
arch/powerpc/kernel/vdso.c | 6 +-
arch/s390/kernel/vdso.c | 6 +-
arch/sh/kernel/vsyscall/vsyscall.c | 6 +-
arch/x86/ia32/ia32_aout.c | 1 +
arch/x86/include/asm/mmu.h | 6 +++
arch/x86/kernel/process_64.c | 8 ++++
arch/x86/mm/init_64.c | 16 ++++----
arch/x86/vdso/vdso32-setup.c | 15 ++++---
fs/binfmt_elf.c | 2 +-
fs/proc/base.c | 79 ++++++++++++++++++++++++------------
fs/proc/task_mmu.c | 8 ++-
include/linux/mm.h | 12 +++--
kernel/kallsyms.c | 4 +-
mm/memory.c | 73 ++++++++++++++++++++++++---------
mm/mlock.c | 4 +-
mm/nommu.c | 2 +-
16 files changed, 165 insertions(+), 83 deletions(-)
This patch simply follows the same practice as for setting the TIF_IA32 flag.
In particular, an mm is marked as holding 32-bit tasks when a 32-bit binary is
exec'ed. Both ELF and a.out formats are updated.
Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/x86/ia32/ia32_aout.c | 1 +
arch/x86/kernel/process_64.c | 8 ++++++++
2 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index 2d93bdb..fd84387 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -298,6 +298,7 @@ static int load_aout_binary(struct linux_binprm *bprm, struct pt_regs *regs)
/* OK, This is the point of no return */
set_personality(PER_LINUX);
set_thread_flag(TIF_IA32);
+ current->mm->context.ia32_compat = 1;
setup_new_exec(bprm);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index bd387e8..6c9dd92 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -501,6 +501,10 @@ void set_personality_64bit(void)
/* Make sure to be in 64bit mode */
clear_thread_flag(TIF_IA32);
+ /* Ensure the corresponding mm is not marked. */
+ if (current->mm)
+ current->mm->context.ia32_compat = 0;
+
/* TBD: overwrites user setup. Should have two bits.
But 64bit processes have always behaved this way,
so it's not too bad. The main problem is just that
@@ -516,6 +520,10 @@ void set_personality_ia32(void)
set_thread_flag(TIF_IA32);
current->personality |= force_personality32;
+ /* Mark the associated mm as containing 32-bit tasks. */
+ if (current->mm)
+ current->mm->context.ia32_compat = 1;
+
/* Prepare the first "return" to user space */
current_thread_info()->status |= TS_COMPAT;
}
--
1.7.3.5
This tag is intended to mirror the thread info TIF_IA32 flag. Will be used to
identify mm's which support 32 bit tasks running in compatibility mode without
requiring a reference to the task itself.
Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/x86/include/asm/mmu.h | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 80a1dee..aeff3e8 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -13,6 +13,12 @@ typedef struct {
int size;
struct mutex lock;
void *vdso;
+
+#ifdef CONFIG_X86_64
+ /* True if mm supports a task running in 32 bit compatibility mode. */
+ unsigned short ia32_compat;
+#endif
+
} mm_context_t;
#ifdef CONFIG_SMP
--
1.7.3.5
Morally, the question of whether an address lies in a gate vma should be asked
with respect to an mm, not a particular task. Moreover, dropping the dependency
on task_struct will help make existing and future operations on mm's more
flexible and convenient.
Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/powerpc/kernel/vdso.c | 2 +-
arch/s390/kernel/vdso.c | 2 +-
arch/sh/kernel/vsyscall/vsyscall.c | 2 +-
arch/x86/mm/init_64.c | 4 ++--
arch/x86/vdso/vdso32-setup.c | 4 ++--
include/linux/mm.h | 4 ++--
mm/memory.c | 2 +-
7 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 6169f17..467aa9e 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -825,7 +825,7 @@ int in_gate_area_no_task(unsigned long addr)
return 0;
}
-int in_gate_area(struct task_struct *task, unsigned long addr)
+int in_gate_area(struct mm_struct *mm, unsigned long addr)
{
return 0;
}
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index d19f305..9006e96 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -342,7 +342,7 @@ int in_gate_area_no_task(unsigned long addr)
return 0;
}
-int in_gate_area(struct task_struct *task, unsigned long addr)
+int in_gate_area(struct mm_struct *mm, unsigned long addr)
{
return 0;
}
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 3f9b6f4..62c36a8 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -99,7 +99,7 @@ struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
return NULL;
}
-int in_gate_area(struct task_struct *task, unsigned long address)
+int in_gate_area(struct mm_struct *mm, unsigned long address)
{
return 0;
}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index fa56f9b..4deb881 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -879,9 +879,9 @@ struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
return &gate_vma;
}
-int in_gate_area(struct task_struct *task, unsigned long addr)
+int in_gate_area(struct mm_struct *mm, unsigned long addr)
{
- struct vm_area_struct *vma = get_gate_vma(task->mm);
+ struct vm_area_struct *vma = get_gate_vma(mm);
if (!vma)
return 0;
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 1f651f6..f849bb2 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -428,9 +428,9 @@ struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
return NULL;
}
-int in_gate_area(struct task_struct *task, unsigned long addr)
+int in_gate_area(struct mm_struct *mm, unsigned long addr)
{
- const struct vm_area_struct *vma = get_gate_vma(task->mm);
+ const struct vm_area_struct *vma = get_gate_vma(mm);
return vma && addr >= vma->vm_start && addr < vma->vm_end;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b571921..c700bbb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1571,10 +1571,10 @@ static inline bool kernel_page_present(struct page *page) { return true; }
extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
#ifdef __HAVE_ARCH_GATE_AREA
int in_gate_area_no_task(unsigned long addr);
-int in_gate_area(struct task_struct *task, unsigned long addr);
+int in_gate_area(struct mm_struct *mm, unsigned long addr);
#else
int in_gate_area_no_task(unsigned long addr);
-#define in_gate_area(task, addr) ({(void)task; in_gate_area_no_task(addr);})
+#define in_gate_area(mm, addr) ({(void)mm; in_gate_area_no_task(addr);})
#endif /* __HAVE_ARCH_GATE_AREA */
int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/memory.c b/mm/memory.c
index aec7cbd..d1347ac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1437,7 +1437,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
struct vm_area_struct *vma;
vma = find_extend_vma(mm, start);
- if (!vma && in_gate_area(tsk, start)) {
+ if (!vma && in_gate_area(tsk->mm, start)) {
unsigned long pg = start & PAGE_MASK;
struct vm_area_struct *gate_vma = get_gate_vma(tsk->mm);
pgd_t *pgd;
--
1.7.3.5
Now that gate vma's are referenced with respect to a particular mm and not a
particular task it only makes sense to propagate the change to this predicate as
well.
Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/powerpc/kernel/vdso.c | 2 +-
arch/s390/kernel/vdso.c | 2 +-
arch/sh/kernel/vsyscall/vsyscall.c | 2 +-
arch/x86/mm/init_64.c | 8 ++++----
arch/x86/vdso/vdso32-setup.c | 2 +-
include/linux/mm.h | 6 +++---
kernel/kallsyms.c | 4 ++--
mm/memory.c | 2 +-
mm/nommu.c | 2 +-
9 files changed, 15 insertions(+), 15 deletions(-)
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 467aa9e..142ab10 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -820,7 +820,7 @@ static int __init vdso_init(void)
}
arch_initcall(vdso_init);
-int in_gate_area_no_task(unsigned long addr)
+int in_gate_area_no_mm(unsigned long addr)
{
return 0;
}
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 9006e96..d73630b 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -337,7 +337,7 @@ static int __init vdso_init(void)
}
arch_initcall(vdso_init);
-int in_gate_area_no_task(unsigned long addr)
+int in_gate_area_no_mm(unsigned long addr)
{
return 0;
}
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 62c36a8..1d6d51a 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -104,7 +104,7 @@ int in_gate_area(struct mm_struct *mm, unsigned long address)
return 0;
}
-int in_gate_area_no_task(unsigned long address)
+int in_gate_area_no_mm(unsigned long address)
{
return 0;
}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 4deb881..560113c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -890,11 +890,11 @@ int in_gate_area(struct mm_struct *mm, unsigned long addr)
}
/*
- * Use this when you have no reliable task/vma, typically from interrupt
- * context. It is less reliable than using the task's vma and may give
- * false positives:
+ * Use this when you have no reliable mm, typically from interrupt
+ * context. It is less reliable than using a task's mm and may give
+ * false positives.
*/
-int in_gate_area_no_task(unsigned long addr)
+int in_gate_area_no_mm(unsigned long addr)
{
return (addr >= VSYSCALL_START) && (addr < VSYSCALL_END);
}
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index f849bb2..468d591 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -435,7 +435,7 @@ int in_gate_area(struct mm_struct *mm, unsigned long addr)
return vma && addr >= vma->vm_start && addr < vma->vm_end;
}
-int in_gate_area_no_task(unsigned long addr)
+int in_gate_area_no_mm(unsigned long addr)
{
return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c700bbb..694512d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1570,11 +1570,11 @@ static inline bool kernel_page_present(struct page *page) { return true; }
extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
#ifdef __HAVE_ARCH_GATE_AREA
-int in_gate_area_no_task(unsigned long addr);
+int in_gate_area_no_mm(unsigned long addr);
int in_gate_area(struct mm_struct *mm, unsigned long addr);
#else
-int in_gate_area_no_task(unsigned long addr);
-#define in_gate_area(mm, addr) ({(void)mm; in_gate_area_no_task(addr);})
+int in_gate_area_no_mm(unsigned long addr);
+#define in_gate_area(mm, addr) ({(void)mm; in_gate_area_no_mm(addr);})
#endif /* __HAVE_ARCH_GATE_AREA */
int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index 6f6d091..b9d0fd1 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -64,14 +64,14 @@ static inline int is_kernel_text(unsigned long addr)
if ((addr >= (unsigned long)_stext && addr <= (unsigned long)_etext) ||
arch_is_kernel_text(addr))
return 1;
- return in_gate_area_no_task(addr);
+ return in_gate_area_no_mm(addr);
}
static inline int is_kernel(unsigned long addr)
{
if (addr >= (unsigned long)_stext && addr <= (unsigned long)_end)
return 1;
- return in_gate_area_no_task(addr);
+ return in_gate_area_no_mm(addr);
}
static int is_ksym_addr(unsigned long addr)
diff --git a/mm/memory.c b/mm/memory.c
index d1347ac..3863e86 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3448,7 +3448,7 @@ struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
#endif
}
-int in_gate_area_no_task(unsigned long addr)
+int in_gate_area_no_mm(unsigned long addr)
{
#ifdef AT_SYSINFO_EHDR
if ((addr >= FIXADDR_USER_START) && (addr < FIXADDR_USER_END))
diff --git a/mm/nommu.c b/mm/nommu.c
index f59e142..e629143 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1963,7 +1963,7 @@ error:
return -ENOMEM;
}
-int in_gate_area_no_task(unsigned long addr)
+int in_gate_area_no_mm(unsigned long addr)
{
return 0;
}
--
1.7.3.5
Morally, the presence of a gate vma is more an attribute of a particular mm than
a particular task. Moreover, dropping the dependency on task_struct will help
make both existing and future operations on mm's more flexible and convenient.
Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/powerpc/kernel/vdso.c | 2 +-
arch/s390/kernel/vdso.c | 2 +-
arch/sh/kernel/vsyscall/vsyscall.c | 2 +-
arch/x86/mm/init_64.c | 6 +++---
arch/x86/vdso/vdso32-setup.c | 11 ++++++-----
fs/binfmt_elf.c | 2 +-
fs/proc/task_mmu.c | 8 +++++---
include/linux/mm.h | 2 +-
mm/memory.c | 4 ++--
mm/mlock.c | 4 ++--
10 files changed, 23 insertions(+), 20 deletions(-)
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index fd87287..6169f17 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -830,7 +830,7 @@ int in_gate_area(struct task_struct *task, unsigned long addr)
return 0;
}
-struct vm_area_struct *get_gate_vma(struct task_struct *tsk)
+struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
{
return NULL;
}
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index f438d74..d19f305 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -347,7 +347,7 @@ int in_gate_area(struct task_struct *task, unsigned long addr)
return 0;
}
-struct vm_area_struct *get_gate_vma(struct task_struct *tsk)
+struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
{
return NULL;
}
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 242117c..3f9b6f4 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -94,7 +94,7 @@ const char *arch_vma_name(struct vm_area_struct *vma)
return NULL;
}
-struct vm_area_struct *get_gate_vma(struct task_struct *task)
+struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
{
return NULL;
}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index c14a542..fa56f9b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -870,10 +870,10 @@ static struct vm_area_struct gate_vma = {
.vm_flags = VM_READ | VM_EXEC
};
-struct vm_area_struct *get_gate_vma(struct task_struct *tsk)
+struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
{
#ifdef CONFIG_IA32_EMULATION
- if (test_tsk_thread_flag(tsk, TIF_IA32))
+ if (!mm || mm->context.ia32_compat)
return NULL;
#endif
return &gate_vma;
@@ -881,7 +881,7 @@ struct vm_area_struct *get_gate_vma(struct task_struct *tsk)
int in_gate_area(struct task_struct *task, unsigned long addr)
{
- struct vm_area_struct *vma = get_gate_vma(task);
+ struct vm_area_struct *vma = get_gate_vma(task->mm);
if (!vma)
return 0;
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 36df991..1f651f6 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -417,11 +417,12 @@ const char *arch_vma_name(struct vm_area_struct *vma)
return NULL;
}
-struct vm_area_struct *get_gate_vma(struct task_struct *tsk)
+struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
{
- struct mm_struct *mm = tsk->mm;
-
- /* Check to see if this task was created in compat vdso mode */
+ /*
+ * Check to see if the corresponding task was created in compat vdso
+ * mode.
+ */
if (mm && mm->context.vdso == (void *)VDSO_HIGH_BASE)
return &gate_vma;
return NULL;
@@ -429,7 +430,7 @@ struct vm_area_struct *get_gate_vma(struct task_struct *tsk)
int in_gate_area(struct task_struct *task, unsigned long addr)
{
- const struct vm_area_struct *vma = get_gate_vma(task);
+ const struct vm_area_struct *vma = get_gate_vma(task->mm);
return vma && addr >= vma->vm_start && addr < vma->vm_end;
}
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index d5b640b..bbabdcc 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1906,7 +1906,7 @@ static int elf_core_dump(struct coredump_params *cprm)
segs = current->mm->map_count;
segs += elf_core_extra_phdrs();
- gate_vma = get_gate_vma(current);
+ gate_vma = get_gate_vma(current->mm);
if (gate_vma != NULL)
segs++;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 60b9148..bb548d4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -126,7 +126,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return NULL;
down_read(&mm->mmap_sem);
- tail_vma = get_gate_vma(priv->task);
+ tail_vma = get_gate_vma(priv->task->mm);
priv->tail_vma = tail_vma;
/* Start with last addr hint */
@@ -277,7 +277,8 @@ static int show_map(struct seq_file *m, void *v)
show_map_vma(m, vma);
if (m->count < m->size) /* vma is copied successfully */
- m->version = (vma != get_gate_vma(task))? vma->vm_start: 0;
+ m->version = (vma != get_gate_vma(task->mm))
+ ? vma->vm_start : 0;
return 0;
}
@@ -436,7 +437,8 @@ static int show_smap(struct seq_file *m, void *v)
(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
if (m->count < m->size) /* vma is copied successfully */
- m->version = (vma != get_gate_vma(task)) ? vma->vm_start : 0;
+ m->version = (vma != get_gate_vma(task->mm))
+ ? vma->vm_start : 0;
return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6385fc..b571921 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1568,7 +1568,7 @@ static inline bool kernel_page_present(struct page *page) { return true; }
#endif /* CONFIG_HIBERNATION */
#endif
-extern struct vm_area_struct *get_gate_vma(struct task_struct *tsk);
+extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
#ifdef __HAVE_ARCH_GATE_AREA
int in_gate_area_no_task(unsigned long addr);
int in_gate_area(struct task_struct *task, unsigned long addr);
diff --git a/mm/memory.c b/mm/memory.c
index 5823698..aec7cbd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1439,7 +1439,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(tsk, start)) {
unsigned long pg = start & PAGE_MASK;
- struct vm_area_struct *gate_vma = get_gate_vma(tsk);
+ struct vm_area_struct *gate_vma = get_gate_vma(tsk->mm);
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
@@ -3439,7 +3439,7 @@ static int __init gate_vma_init(void)
__initcall(gate_vma_init);
#endif
-struct vm_area_struct *get_gate_vma(struct task_struct *tsk)
+struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
{
#ifdef AT_SYSINFO_EHDR
return &gate_vma;
diff --git a/mm/mlock.c b/mm/mlock.c
index c3924c7f..2689a08c 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -237,7 +237,7 @@ long mlock_vma_pages_range(struct vm_area_struct *vma,
if (!((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
is_vm_hugetlb_page(vma) ||
- vma == get_gate_vma(current))) {
+ vma == get_gate_vma(current->mm))) {
__mlock_vma_pages_range(vma, start, end, NULL);
@@ -332,7 +332,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
int lock = newflags & VM_LOCKED;
if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
- is_vm_hugetlb_page(vma) || vma == get_gate_vma(current))
+ is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
goto out; /* don't set VM_LOCKED, don't count */
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
--
1.7.3.5
We now check if a requested user page overlaps a gate vma using the supplied mm
instead of the supplied task. The given task is now used solely for accounting
purposes and may be NULL.
Signed-off-by: Stephen Wilson <[email protected]>
---
mm/memory.c | 18 +++++++++++-------
1 files changed, 11 insertions(+), 7 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 3863e86..36445e3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1437,9 +1437,9 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
struct vm_area_struct *vma;
vma = find_extend_vma(mm, start);
- if (!vma && in_gate_area(tsk->mm, start)) {
+ if (!vma && in_gate_area(mm, start)) {
unsigned long pg = start & PAGE_MASK;
- struct vm_area_struct *gate_vma = get_gate_vma(tsk->mm);
+ struct vm_area_struct *gate_vma = get_gate_vma(mm);
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
@@ -1533,10 +1533,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
return i ? i : -EFAULT;
BUG();
}
- if (ret & VM_FAULT_MAJOR)
- tsk->maj_flt++;
- else
- tsk->min_flt++;
+
+ if (tsk) {
+ if (ret & VM_FAULT_MAJOR)
+ tsk->maj_flt++;
+ else
+ tsk->min_flt++;
+ }
if (ret & VM_FAULT_RETRY) {
*nonblocking = 0;
@@ -1581,7 +1584,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
/**
* get_user_pages() - pin user pages in memory
- * @tsk: task_struct of target task
+ * @tsk: the task_struct to use for page fault accounting, or
+ * NULL if faults are not to be recorded.
* @mm: mm_struct of target mm
* @start: starting user address
* @nr_pages: number of pages from start to pin
--
1.7.3.5
This change makes mem_write() observe the same constraints as mem_read(). This
is particularly important for mem_write as an accidental leak of the fd across
an exec could result in arbitrary modification of the target process' memory.
IOW, /proc/pid/mem is implicitly close-on-exec.
Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/base.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index d49c4b5..13c5e8c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -848,6 +848,10 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
if (check_mem_permission(task))
goto out;
+ copied = -EIO;
+ if (file->private_data != (void *)((long)current->self_exec_id))
+ goto out;
+
copied = -ENOMEM;
page = (char *)__get_free_page(GFP_TEMPORARY);
if (!page)
--
1.7.3.5
This change allows us to take advantage of access_remote_vm(), which in turn
eliminates a security issue with the mem_write() implementation.
The previous implementation of mem_write() was insecure since the target task
could exec a setuid-root binary between the permission check and the actual
write. Holding a reference to the target mm_struct eliminates this
vulnerability.
Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/base.c | 58 ++++++++++++++++++++++++++++++++-----------------------
1 files changed, 34 insertions(+), 24 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 084d046..ebe3c47 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -191,14 +191,20 @@ static int proc_root_link(struct inode *inode, struct path *path)
return result;
}
-static int __check_mem_permission(struct task_struct *task)
+static struct mm_struct *__check_mem_permission(struct task_struct *task)
{
+ struct mm_struct *mm;
+
+ mm = get_task_mm(task);
+ if (!mm)
+ return ERR_PTR(-EINVAL);
+
/*
* A task can always look at itself, in case it chooses
* to use system calls instead of load instructions.
*/
if (task == current)
- return 0;
+ return mm;
/*
* If current is actively ptrace'ing, and would also be
@@ -210,20 +216,23 @@ static int __check_mem_permission(struct task_struct *task)
match = (tracehook_tracer_task(task) == current);
rcu_read_unlock();
if (match && ptrace_may_access(task, PTRACE_MODE_ATTACH))
- return 0;
+ return mm;
}
/*
* Noone else is allowed.
*/
- return -EPERM;
+ mmput(mm);
+ return ERR_PTR(-EPERM);
}
/*
- * Return zero if current may access user memory in @task, -error if not.
+ * If current may access user memory in @task return a reference to the
+ * corresponding mm, otherwise ERR_PTR.
*/
-static int check_mem_permission(struct task_struct *task)
+static struct mm_struct *check_mem_permission(struct task_struct *task)
{
+ struct mm_struct *mm;
int err;
/*
@@ -232,12 +241,12 @@ static int check_mem_permission(struct task_struct *task)
*/
err = mutex_lock_killable(&task->signal->cred_guard_mutex);
if (err)
- return err;
+ return ERR_PTR(err);
- err = __check_mem_permission(task);
+ mm = __check_mem_permission(task);
mutex_unlock(&task->signal->cred_guard_mutex);
- return err;
+ return mm;
}
struct mm_struct *mm_for_maps(struct task_struct *task)
@@ -793,18 +802,14 @@ static ssize_t mem_read(struct file * file, char __user * buf,
if (!task)
goto out_no_task;
- if (check_mem_permission(task))
- goto out;
-
ret = -ENOMEM;
page = (char *)__get_free_page(GFP_TEMPORARY);
if (!page)
goto out;
- ret = 0;
-
- mm = get_task_mm(task);
- if (!mm)
+ mm = check_mem_permission(task);
+ ret = PTR_ERR(mm);
+ if (IS_ERR(mm))
goto out_free;
ret = -EIO;
@@ -818,8 +823,8 @@ static ssize_t mem_read(struct file * file, char __user * buf,
int this_len, retval;
this_len = (count > PAGE_SIZE) ? PAGE_SIZE : count;
- retval = access_process_vm(task, src, page, this_len, 0);
- if (!retval || check_mem_permission(task)) {
+ retval = access_remote_vm(mm, src, page, this_len, 0);
+ if (!retval) {
if (!ret)
ret = -EIO;
break;
@@ -858,22 +863,25 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
char *page;
struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
unsigned long dst = *ppos;
+ struct mm_struct *mm;
copied = -ESRCH;
if (!task)
goto out_no_task;
- if (check_mem_permission(task))
- goto out;
+ mm = check_mem_permission(task);
+ copied = PTR_ERR(mm);
+ if (IS_ERR(mm))
+ goto out_task;
copied = -EIO;
if (file->private_data != (void *)((long)current->self_exec_id))
- goto out;
+ goto out_mm;
copied = -ENOMEM;
page = (char *)__get_free_page(GFP_TEMPORARY);
if (!page)
- goto out;
+ goto out_mm;
copied = 0;
while (count > 0) {
@@ -884,7 +892,7 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
copied = -EFAULT;
break;
}
- retval = access_process_vm(task, dst, page, this_len, 1);
+ retval = access_remote_vm(mm, dst, page, this_len, 1);
if (!retval) {
if (!copied)
copied = -EIO;
@@ -897,7 +905,9 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
}
*ppos = dst;
free_page((unsigned long) page);
-out:
+out_mm:
+ mmput(mm);
+out_task:
put_task_struct(task);
out_no_task:
return copied;
--
1.7.3.5
With recent changes there is no longer a security hazard with writing to
/proc/pid/mem. Remove the #ifdef.
Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/base.c | 5 -----
1 files changed, 0 insertions(+), 5 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebe3c47..a0cfcd4 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -852,10 +852,6 @@ out_no_task:
return ret;
}
-#define mem_write NULL
-
-#ifndef mem_write
-/* This is a security hazard */
static ssize_t mem_write(struct file * file, const char __user *buf,
size_t count, loff_t *ppos)
{
@@ -912,7 +908,6 @@ out_task:
out_no_task:
return copied;
}
-#endif
loff_t mem_lseek(struct file *file, loff_t offset, int orig)
{
--
1.7.3.5
Avoid a potential race when task exec's and we get a new ->mm but check against
the old credentials in ptrace_may_access().
Holding of the mutex is implemented by factoring out the body of the code into a
helper function __check_mem_permission(). Performing this factorization now
simplifies upcoming changes and minimizes churn in the diff's.
Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/base.c | 26 ++++++++++++++++++++++----
1 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 13c5e8c..084d046 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -191,10 +191,7 @@ static int proc_root_link(struct inode *inode, struct path *path)
return result;
}
-/*
- * Return zero if current may access user memory in @task, -error if not.
- */
-static int check_mem_permission(struct task_struct *task)
+static int __check_mem_permission(struct task_struct *task)
{
/*
* A task can always look at itself, in case it chooses
@@ -222,6 +219,27 @@ static int check_mem_permission(struct task_struct *task)
return -EPERM;
}
+/*
+ * Return zero if current may access user memory in @task, -error if not.
+ */
+static int check_mem_permission(struct task_struct *task)
+{
+ int err;
+
+ /*
+ * Avoid racing if task exec's as we might get a new mm but validate
+ * against old credentials.
+ */
+ err = mutex_lock_killable(&task->signal->cred_guard_mutex);
+ if (err)
+ return err;
+
+ err = __check_mem_permission(task);
+ mutex_unlock(&task->signal->cred_guard_mutex);
+
+ return err;
+}
+
struct mm_struct *mm_for_maps(struct task_struct *task)
{
struct mm_struct *mm;
--
1.7.3.5
On Wed, Mar 23, 2011 at 10:43:49AM -0400, Stephen Wilson wrote:
> Hello,
>
> This is a resend[1] of a patch series that implements safe writes to
> /proc/pid/mem. Such functionality is useful as it gives debuggers a simple and
> efficient mechanism to manipulate a process' address space. Memory can be read
> and written using single calls to pread(2) and pwrite(2) instead of iteratively
> calling into ptrace(2).
It's in local queue already, along with other procfs fixes; I'm going to push
it today, with several more procfs race fixes added.
On Wed, Mar 23, 2011 at 02:57:49PM +0000, Al Viro wrote:
> On Wed, Mar 23, 2011 at 10:43:49AM -0400, Stephen Wilson wrote:
> > Hello,
> >
> > This is a resend[1] of a patch series that implements safe writes to
> > /proc/pid/mem. Such functionality is useful as it gives debuggers a simple and
> > efficient mechanism to manipulate a process' address space. Memory can be read
> > and written using single calls to pread(2) and pwrite(2) instead of iteratively
> > calling into ptrace(2).
>
> It's in local queue already, along with other procfs fixes; I'm going to push
> it today, with several more procfs race fixes added.
Ah, OK. I had no idea. Thanks so much!
--
steve
Introduce an internal helper __access_remote_vm and base access_process_vm on
top of it. This new method may be called with a NULL task_struct if page fault
accounting is not desired. This code will be shared with a new address space
accessor that is independent of task_struct.
Signed-off-by: Stephen Wilson <[email protected]>
---
mm/memory.c | 35 +++++++++++++++++++++++++----------
1 files changed, 25 insertions(+), 10 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 36445e3..68eec4f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3593,20 +3593,15 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
#endif
/*
- * Access another process' address space.
- * Source/target buffer must be kernel space,
- * Do not walk the page table directly, use get_user_pages
+ * Access another process' address space as given in mm. If non-NULL, use the
+ * given task for page fault accounting.
*/
-int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write)
+static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long addr, void *buf, int len, int write)
{
- struct mm_struct *mm;
struct vm_area_struct *vma;
void *old_buf = buf;
- mm = get_task_mm(tsk);
- if (!mm)
- return 0;
-
down_read(&mm->mmap_sem);
/* ignore errors, just check how much was successfully transferred */
while (len) {
@@ -3655,12 +3650,32 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, in
addr += bytes;
}
up_read(&mm->mmap_sem);
- mmput(mm);
return buf - old_buf;
}
/*
+ * Access another process' address space.
+ * Source/target buffer must be kernel space,
+ * Do not walk the page table directly, use get_user_pages
+ */
+int access_process_vm(struct task_struct *tsk, unsigned long addr,
+ void *buf, int len, int write)
+{
+ struct mm_struct *mm;
+ int ret;
+
+ mm = get_task_mm(tsk);
+ if (!mm)
+ return 0;
+
+ ret = __access_remote_vm(tsk, mm, addr, buf, len, write);
+ mmput(mm);
+
+ return ret;
+}
+
+/*
* Print the name of a VMA.
*/
void print_vma_addr(char *prefix, unsigned long ip)
--
1.7.3.5
Provide an alternative to access_process_vm that allows the caller to obtain a
reference to the supplied mm_struct.
Signed-off-by: Stephen Wilson <[email protected]>
---
include/linux/mm.h | 2 ++
mm/memory.c | 16 ++++++++++++++++
2 files changed, 18 insertions(+), 0 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 694512d..e5fde8a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -964,6 +964,8 @@ static inline int handle_mm_fault(struct mm_struct *mm,
extern int make_pages_present(unsigned long addr, unsigned long end);
extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
+extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
+ void *buf, int len, int write);
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int nr_pages, int write, int force,
diff --git a/mm/memory.c b/mm/memory.c
index 68eec4f..c26e4f9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3654,6 +3654,22 @@ static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
return buf - old_buf;
}
+/**
+ * @access_remote_vm - access another process' address space
+ * @mm: the mm_struct of the target address space
+ * @addr: start address to access
+ * @buf: source or destination buffer
+ * @len: number of bytes to transfer
+ * @write: whether the access is a write
+ *
+ * The caller must hold a reference on @mm.
+ */
+int access_remote_vm(struct mm_struct *mm, unsigned long addr,
+ void *buf, int len, int write)
+{
+ return __access_remote_vm(NULL, mm, addr, buf, len, write);
+}
+
/*
* Access another process' address space.
* Source/target buffer must be kernel space,
--
1.7.3.5