2011-03-13 19:50:45

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH v2 0/12] enable writing to /proc/pid/mem

For a long time /proc/pid/mem has provided a read-only interface, at least
since 2.4.0. However, a write capability has existed "forever" in tree via the
function mem_write(), disabled with an #ifdef along with the comment "this is a
security hazard". Currently, the main problem with mem_write() is that between
the time permissions are checked and the actual write the target task could
exec a setuid-root binary.

This patch series enables safe writes to /proc/pid/mem. Such functionality is
useful as it gives debuggers a simple and efficient mechanism to manipulate a
process' address space. Memory can be read and written using single calls to
pread(2) and pwrite(2) instead of iteratively calling into ptrace(2). In
addition, /proc/pid/mem has always had write permissions enabled, so clearly it
*wants* to be written to.


The first version of these patches was split into two series. Here they are
combined together for easier reference and review.


Patches 1-5 make is_gate_vma() and in_gate_vma() functions of mm_struct, not
task_struct. These patches are of particular interest to the x86 architecture
maintainers and were originally distributed as a stand alone series[1]. From a
conceptual point of view, the question of whether an address lies in a gate vma
should be asked with respect to a particular mm, not a particular task. From a
practical point of view, this change will help simplify current and future
operations on mm's. In particular, it allows some code paths to avoid the need
to hold task_lock. The principle change there is to mirror TIF_IA32 via a new
flag in mm_context_t.


Patches 6-12 build on the new flexibility to enable secure writes to
/proc/pid/mem. These patches impact the memory and procfs subsystems and were
originally distributed as a stand alone series[2]. The principle strategy is
to get a reference to the target task's mm before the permission check, and to
hold that reference until after the write completes.

This patch set is based on v2.6.38-rc8.

The general approach used was suggested to me by Alexander Viro, but any
mistakes present in these patches are entirely my own.

--
steve


[1] lkml.org/lkml/2011/3/8/409
[2] lkml.org/lkml/2011/3/8/418


Changes since v1:

- Rename mm_context_t.compat to ia32_compat as suggested by Michel
Lespinasse.

- Rework check_mem_permission() to return ERR_PTR and hold cred_guard_mutex
as suggested by Alexander Viro.

- Collapse patches into a single series.

Stephen Wilson (12):
x86: add context tag to mark mm when running a task in 32-bit compatibility mode
x86: mark associated mm when running a task in 32 bit compatibility mode
mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
mm: arch: make in_gate_area take an mm_struct instead of a task_struct
mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
mm: use mm_struct to resolve gate vma's in __get_user_pages
mm: factor out main logic of access_process_vm
mm: implement access_remote_vm
proc: disable mem_write after exec
proc: hold cred_guard_mutex in check_mem_permission()
proc: make check_mem_permission() return an mm_struct on success
proc: enable writing to /proc/pid/mem


arch/powerpc/kernel/vdso.c | 6 +-
arch/s390/kernel/vdso.c | 6 +-
arch/sh/kernel/vsyscall/vsyscall.c | 6 +-
arch/x86/ia32/ia32_aout.c | 1 +
arch/x86/include/asm/mmu.h | 6 +++
arch/x86/kernel/process_64.c | 8 ++++
arch/x86/mm/init_64.c | 16 ++++----
arch/x86/vdso/vdso32-setup.c | 15 ++++---
fs/binfmt_elf.c | 2 +-
fs/proc/base.c | 79 ++++++++++++++++++++++++------------
fs/proc/task_mmu.c | 8 ++-
include/linux/mm.h | 12 +++--
kernel/kallsyms.c | 4 +-
mm/memory.c | 73 ++++++++++++++++++++++++---------
mm/mlock.c | 4 +-
mm/nommu.c | 2 +-
16 files changed, 165 insertions(+), 83 deletions(-)



2011-03-13 19:51:04

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 02/12] x86: mark associated mm when running a task in 32 bit compatibility mode

This patch simply follows the same practice as for setting the TIF_IA32 flag.
In particular, an mm is marked as holding 32-bit tasks when a 32-bit binary is
exec'ed. Both ELF and a.out formats are updated.

Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/x86/ia32/ia32_aout.c | 1 +
arch/x86/kernel/process_64.c | 8 ++++++++
2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index 2d93bdb..fd84387 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -298,6 +298,7 @@ static int load_aout_binary(struct linux_binprm *bprm, struct pt_regs *regs)
/* OK, This is the point of no return */
set_personality(PER_LINUX);
set_thread_flag(TIF_IA32);
+ current->mm->context.ia32_compat = 1;

setup_new_exec(bprm);

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index bd387e8..6c9dd92 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -501,6 +501,10 @@ void set_personality_64bit(void)
/* Make sure to be in 64bit mode */
clear_thread_flag(TIF_IA32);

+ /* Ensure the corresponding mm is not marked. */
+ if (current->mm)
+ current->mm->context.ia32_compat = 0;
+
/* TBD: overwrites user setup. Should have two bits.
But 64bit processes have always behaved this way,
so it's not too bad. The main problem is just that
@@ -516,6 +520,10 @@ void set_personality_ia32(void)
set_thread_flag(TIF_IA32);
current->personality |= force_personality32;

+ /* Mark the associated mm as containing 32-bit tasks. */
+ if (current->mm)
+ current->mm->context.ia32_compat = 1;
+
/* Prepare the first "return" to user space */
current_thread_info()->status |= TS_COMPAT;
}
--
1.7.3.5

2011-03-13 19:51:12

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 01/12] x86: add context tag to mark mm when running a task in 32-bit compatibility mode

This tag is intended to mirror the thread info TIF_IA32 flag. Will be used to
identify mm's which support 32 bit tasks running in compatibility mode without
requiring a reference to the task itself.

Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/x86/include/asm/mmu.h | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 80a1dee..aeff3e8 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -13,6 +13,12 @@ typedef struct {
int size;
struct mutex lock;
void *vdso;
+
+#ifdef CONFIG_X86_64
+ /* True if mm supports a task running in 32 bit compatibility mode. */
+ unsigned short ia32_compat;
+#endif
+
} mm_context_t;

#ifdef CONFIG_SMP
--
1.7.3.5

2011-03-13 19:51:50

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 03/12] mm: arch: make get_gate_vma take an mm_struct instead of a task_struct

Morally, the presence of a gate vma is more an attribute of a particular mm than
a particular task. Moreover, dropping the dependency on task_struct will help
make both existing and future operations on mm's more flexible and convenient.

Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/powerpc/kernel/vdso.c | 2 +-
arch/s390/kernel/vdso.c | 2 +-
arch/sh/kernel/vsyscall/vsyscall.c | 2 +-
arch/x86/mm/init_64.c | 6 +++---
arch/x86/vdso/vdso32-setup.c | 11 ++++++-----
fs/binfmt_elf.c | 2 +-
fs/proc/task_mmu.c | 8 +++++---
include/linux/mm.h | 2 +-
mm/memory.c | 4 ++--
mm/mlock.c | 4 ++--
10 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index fd87287..6169f17 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -830,7 +830,7 @@ int in_gate_area(struct task_struct *task, unsigned long addr)
return 0;
}

-struct vm_area_struct *get_gate_vma(struct task_struct *tsk)
+struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
{
return NULL;
}
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index f438d74..d19f305 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -347,7 +347,7 @@ int in_gate_area(struct task_struct *task, unsigned long addr)
return 0;
}

-struct vm_area_struct *get_gate_vma(struct task_struct *tsk)
+struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
{
return NULL;
}
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 242117c..3f9b6f4 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -94,7 +94,7 @@ const char *arch_vma_name(struct vm_area_struct *vma)
return NULL;
}

-struct vm_area_struct *get_gate_vma(struct task_struct *task)
+struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
{
return NULL;
}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 71a5929..711b690 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -870,10 +870,10 @@ static struct vm_area_struct gate_vma = {
.vm_flags = VM_READ | VM_EXEC
};

-struct vm_area_struct *get_gate_vma(struct task_struct *tsk)
+struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
{
#ifdef CONFIG_IA32_EMULATION
- if (test_tsk_thread_flag(tsk, TIF_IA32))
+ if (!mm || mm->context.ia32_compat)
return NULL;
#endif
return &gate_vma;
@@ -881,7 +881,7 @@ struct vm_area_struct *get_gate_vma(struct task_struct *tsk)

int in_gate_area(struct task_struct *task, unsigned long addr)
{
- struct vm_area_struct *vma = get_gate_vma(task);
+ struct vm_area_struct *vma = get_gate_vma(task->mm);

if (!vma)
return 0;
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 36df991..1f651f6 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -417,11 +417,12 @@ const char *arch_vma_name(struct vm_area_struct *vma)
return NULL;
}

-struct vm_area_struct *get_gate_vma(struct task_struct *tsk)
+struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
{
- struct mm_struct *mm = tsk->mm;
-
- /* Check to see if this task was created in compat vdso mode */
+ /*
+ * Check to see if the corresponding task was created in compat vdso
+ * mode.
+ */
if (mm && mm->context.vdso == (void *)VDSO_HIGH_BASE)
return &gate_vma;
return NULL;
@@ -429,7 +430,7 @@ struct vm_area_struct *get_gate_vma(struct task_struct *tsk)

int in_gate_area(struct task_struct *task, unsigned long addr)
{
- const struct vm_area_struct *vma = get_gate_vma(task);
+ const struct vm_area_struct *vma = get_gate_vma(task->mm);

return vma && addr >= vma->vm_start && addr < vma->vm_end;
}
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index d5b640b..bbabdcc 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1906,7 +1906,7 @@ static int elf_core_dump(struct coredump_params *cprm)
segs = current->mm->map_count;
segs += elf_core_extra_phdrs();

- gate_vma = get_gate_vma(current);
+ gate_vma = get_gate_vma(current->mm);
if (gate_vma != NULL)
segs++;

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 60b9148..bb548d4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -126,7 +126,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return NULL;
down_read(&mm->mmap_sem);

- tail_vma = get_gate_vma(priv->task);
+ tail_vma = get_gate_vma(priv->task->mm);
priv->tail_vma = tail_vma;

/* Start with last addr hint */
@@ -277,7 +277,8 @@ static int show_map(struct seq_file *m, void *v)
show_map_vma(m, vma);

if (m->count < m->size) /* vma is copied successfully */
- m->version = (vma != get_gate_vma(task))? vma->vm_start: 0;
+ m->version = (vma != get_gate_vma(task->mm))
+ ? vma->vm_start : 0;
return 0;
}

@@ -436,7 +437,8 @@ static int show_smap(struct seq_file *m, void *v)
(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);

if (m->count < m->size) /* vma is copied successfully */
- m->version = (vma != get_gate_vma(task)) ? vma->vm_start : 0;
+ m->version = (vma != get_gate_vma(task->mm))
+ ? vma->vm_start : 0;
return 0;
}

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6385fc..b571921 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1568,7 +1568,7 @@ static inline bool kernel_page_present(struct page *page) { return true; }
#endif /* CONFIG_HIBERNATION */
#endif

-extern struct vm_area_struct *get_gate_vma(struct task_struct *tsk);
+extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
#ifdef __HAVE_ARCH_GATE_AREA
int in_gate_area_no_task(unsigned long addr);
int in_gate_area(struct task_struct *task, unsigned long addr);
diff --git a/mm/memory.c b/mm/memory.c
index 5823698..aec7cbd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1439,7 +1439,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(tsk, start)) {
unsigned long pg = start & PAGE_MASK;
- struct vm_area_struct *gate_vma = get_gate_vma(tsk);
+ struct vm_area_struct *gate_vma = get_gate_vma(tsk->mm);
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
@@ -3439,7 +3439,7 @@ static int __init gate_vma_init(void)
__initcall(gate_vma_init);
#endif

-struct vm_area_struct *get_gate_vma(struct task_struct *tsk)
+struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
{
#ifdef AT_SYSINFO_EHDR
return &gate_vma;
diff --git a/mm/mlock.c b/mm/mlock.c
index c3924c7f..2689a08c 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -237,7 +237,7 @@ long mlock_vma_pages_range(struct vm_area_struct *vma,

if (!((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
is_vm_hugetlb_page(vma) ||
- vma == get_gate_vma(current))) {
+ vma == get_gate_vma(current->mm))) {

__mlock_vma_pages_range(vma, start, end, NULL);

@@ -332,7 +332,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
int lock = newflags & VM_LOCKED;

if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
- is_vm_hugetlb_page(vma) || vma == get_gate_vma(current))
+ is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
goto out; /* don't set VM_LOCKED, don't count */

pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
--
1.7.3.5

2011-03-13 19:52:02

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 04/12] mm: arch: make in_gate_area take an mm_struct instead of a task_struct

Morally, the question of whether an address lies in a gate vma should be asked
with respect to an mm, not a particular task. Moreover, dropping the dependency
on task_struct will help make existing and future operations on mm's more
flexible and convenient.

Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/powerpc/kernel/vdso.c | 2 +-
arch/s390/kernel/vdso.c | 2 +-
arch/sh/kernel/vsyscall/vsyscall.c | 2 +-
arch/x86/mm/init_64.c | 4 ++--
arch/x86/vdso/vdso32-setup.c | 4 ++--
include/linux/mm.h | 4 ++--
mm/memory.c | 2 +-
7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 6169f17..467aa9e 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -825,7 +825,7 @@ int in_gate_area_no_task(unsigned long addr)
return 0;
}

-int in_gate_area(struct task_struct *task, unsigned long addr)
+int in_gate_area(struct mm_struct *mm, unsigned long addr)
{
return 0;
}
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index d19f305..9006e96 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -342,7 +342,7 @@ int in_gate_area_no_task(unsigned long addr)
return 0;
}

-int in_gate_area(struct task_struct *task, unsigned long addr)
+int in_gate_area(struct mm_struct *mm, unsigned long addr)
{
return 0;
}
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 3f9b6f4..62c36a8 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -99,7 +99,7 @@ struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
return NULL;
}

-int in_gate_area(struct task_struct *task, unsigned long address)
+int in_gate_area(struct mm_struct *mm, unsigned long address)
{
return 0;
}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 711b690..69fd853 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -879,9 +879,9 @@ struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
return &gate_vma;
}

-int in_gate_area(struct task_struct *task, unsigned long addr)
+int in_gate_area(struct mm_struct *mm, unsigned long addr)
{
- struct vm_area_struct *vma = get_gate_vma(task->mm);
+ struct vm_area_struct *vma = get_gate_vma(mm);

if (!vma)
return 0;
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 1f651f6..f849bb2 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -428,9 +428,9 @@ struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
return NULL;
}

-int in_gate_area(struct task_struct *task, unsigned long addr)
+int in_gate_area(struct mm_struct *mm, unsigned long addr)
{
- const struct vm_area_struct *vma = get_gate_vma(task->mm);
+ const struct vm_area_struct *vma = get_gate_vma(mm);

return vma && addr >= vma->vm_start && addr < vma->vm_end;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b571921..c700bbb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1571,10 +1571,10 @@ static inline bool kernel_page_present(struct page *page) { return true; }
extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
#ifdef __HAVE_ARCH_GATE_AREA
int in_gate_area_no_task(unsigned long addr);
-int in_gate_area(struct task_struct *task, unsigned long addr);
+int in_gate_area(struct mm_struct *mm, unsigned long addr);
#else
int in_gate_area_no_task(unsigned long addr);
-#define in_gate_area(task, addr) ({(void)task; in_gate_area_no_task(addr);})
+#define in_gate_area(mm, addr) ({(void)mm; in_gate_area_no_task(addr);})
#endif /* __HAVE_ARCH_GATE_AREA */

int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/memory.c b/mm/memory.c
index aec7cbd..d1347ac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1437,7 +1437,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
struct vm_area_struct *vma;

vma = find_extend_vma(mm, start);
- if (!vma && in_gate_area(tsk, start)) {
+ if (!vma && in_gate_area(tsk->mm, start)) {
unsigned long pg = start & PAGE_MASK;
struct vm_area_struct *gate_vma = get_gate_vma(tsk->mm);
pgd_t *pgd;
--
1.7.3.5

2011-03-13 19:52:15

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 05/12] mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm

Now that gate vma's are referenced with respect to a particular mm and not a
particular task it only makes sense to propagate the change to this predicate as
well.

Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/powerpc/kernel/vdso.c | 2 +-
arch/s390/kernel/vdso.c | 2 +-
arch/sh/kernel/vsyscall/vsyscall.c | 2 +-
arch/x86/mm/init_64.c | 8 ++++----
arch/x86/vdso/vdso32-setup.c | 2 +-
include/linux/mm.h | 6 +++---
kernel/kallsyms.c | 4 ++--
mm/memory.c | 2 +-
mm/nommu.c | 2 +-
9 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 467aa9e..142ab10 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -820,7 +820,7 @@ static int __init vdso_init(void)
}
arch_initcall(vdso_init);

-int in_gate_area_no_task(unsigned long addr)
+int in_gate_area_no_mm(unsigned long addr)
{
return 0;
}
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 9006e96..d73630b 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -337,7 +337,7 @@ static int __init vdso_init(void)
}
arch_initcall(vdso_init);

-int in_gate_area_no_task(unsigned long addr)
+int in_gate_area_no_mm(unsigned long addr)
{
return 0;
}
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 62c36a8..1d6d51a 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -104,7 +104,7 @@ int in_gate_area(struct mm_struct *mm, unsigned long address)
return 0;
}

-int in_gate_area_no_task(unsigned long address)
+int in_gate_area_no_mm(unsigned long address)
{
return 0;
}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 69fd853..8a64fdf 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -890,11 +890,11 @@ int in_gate_area(struct mm_struct *mm, unsigned long addr)
}

/*
- * Use this when you have no reliable task/vma, typically from interrupt
- * context. It is less reliable than using the task's vma and may give
- * false positives:
+ * Use this when you have no reliable mm, typically from interrupt
+ * context. It is less reliable than using a task's mm and may give
+ * false positives.
*/
-int in_gate_area_no_task(unsigned long addr)
+int in_gate_area_no_mm(unsigned long addr)
{
return (addr >= VSYSCALL_START) && (addr < VSYSCALL_END);
}
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index f849bb2..468d591 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -435,7 +435,7 @@ int in_gate_area(struct mm_struct *mm, unsigned long addr)
return vma && addr >= vma->vm_start && addr < vma->vm_end;
}

-int in_gate_area_no_task(unsigned long addr)
+int in_gate_area_no_mm(unsigned long addr)
{
return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c700bbb..694512d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1570,11 +1570,11 @@ static inline bool kernel_page_present(struct page *page) { return true; }

extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
#ifdef __HAVE_ARCH_GATE_AREA
-int in_gate_area_no_task(unsigned long addr);
+int in_gate_area_no_mm(unsigned long addr);
int in_gate_area(struct mm_struct *mm, unsigned long addr);
#else
-int in_gate_area_no_task(unsigned long addr);
-#define in_gate_area(mm, addr) ({(void)mm; in_gate_area_no_task(addr);})
+int in_gate_area_no_mm(unsigned long addr);
+#define in_gate_area(mm, addr) ({(void)mm; in_gate_area_no_mm(addr);})
#endif /* __HAVE_ARCH_GATE_AREA */

int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index 6f6d091..b9d0fd1 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -64,14 +64,14 @@ static inline int is_kernel_text(unsigned long addr)
if ((addr >= (unsigned long)_stext && addr <= (unsigned long)_etext) ||
arch_is_kernel_text(addr))
return 1;
- return in_gate_area_no_task(addr);
+ return in_gate_area_no_mm(addr);
}

static inline int is_kernel(unsigned long addr)
{
if (addr >= (unsigned long)_stext && addr <= (unsigned long)_end)
return 1;
- return in_gate_area_no_task(addr);
+ return in_gate_area_no_mm(addr);
}

static int is_ksym_addr(unsigned long addr)
diff --git a/mm/memory.c b/mm/memory.c
index d1347ac..3863e86 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3448,7 +3448,7 @@ struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
#endif
}

-int in_gate_area_no_task(unsigned long addr)
+int in_gate_area_no_mm(unsigned long addr)
{
#ifdef AT_SYSINFO_EHDR
if ((addr >= FIXADDR_USER_START) && (addr < FIXADDR_USER_END))
diff --git a/mm/nommu.c b/mm/nommu.c
index f59e142..e629143 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1963,7 +1963,7 @@ error:
return -ENOMEM;
}

-int in_gate_area_no_task(unsigned long addr)
+int in_gate_area_no_mm(unsigned long addr)
{
return 0;
}
--
1.7.3.5

2011-03-13 19:52:56

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 06/12] mm: use mm_struct to resolve gate vma's in __get_user_pages

We now check if a requested user page overlaps a gate vma using the supplied mm
instead of the supplied task. The given task is now used solely for accounting
purposes and may be NULL.

Signed-off-by: Stephen Wilson <[email protected]>
---
mm/memory.c | 18 +++++++++++-------
1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3863e86..36445e3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1437,9 +1437,9 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
struct vm_area_struct *vma;

vma = find_extend_vma(mm, start);
- if (!vma && in_gate_area(tsk->mm, start)) {
+ if (!vma && in_gate_area(mm, start)) {
unsigned long pg = start & PAGE_MASK;
- struct vm_area_struct *gate_vma = get_gate_vma(tsk->mm);
+ struct vm_area_struct *gate_vma = get_gate_vma(mm);
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
@@ -1533,10 +1533,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
return i ? i : -EFAULT;
BUG();
}
- if (ret & VM_FAULT_MAJOR)
- tsk->maj_flt++;
- else
- tsk->min_flt++;
+
+ if (tsk) {
+ if (ret & VM_FAULT_MAJOR)
+ tsk->maj_flt++;
+ else
+ tsk->min_flt++;
+ }

if (ret & VM_FAULT_RETRY) {
*nonblocking = 0;
@@ -1581,7 +1584,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,

/**
* get_user_pages() - pin user pages in memory
- * @tsk: task_struct of target task
+ * @tsk: the task_struct to use for page fault accounting, or
+ * NULL if faults are not to be recorded.
* @mm: mm_struct of target mm
* @start: starting user address
* @nr_pages: number of pages from start to pin
--
1.7.3.5

2011-03-13 19:55:32

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 09/12] proc: disable mem_write after exec

This change makes mem_write() observe the same constraints as mem_read(). This
is particularly important for mem_write as an accidental leak of the fd across
an exec could result in arbitrary modification of the target process' memory.
IOW, /proc/pid/mem is implicitly close-on-exec.

Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/base.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9d096e8..e52702d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -848,6 +848,10 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
if (check_mem_permission(task))
goto out;

+ copied = -EIO;
+ if (file->private_data != (void *)((long)current->self_exec_id))
+ goto out;
+
copied = -ENOMEM;
page = (char *)__get_free_page(GFP_TEMPORARY);
if (!page)
--
1.7.3.5

2011-03-13 19:56:45

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 11/12] proc: make check_mem_permission() return an mm_struct on success

This change allows us to take advantage of access_remote_vm(), which in turn
eliminates a security issue with the mem_write() implementation.

The previous implementation of mem_write() was insecure since the target task
could exec a setuid-root binary between the permission check and the actual
write. Holding a reference to the target mm_struct eliminates this
vulnerability.

Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/base.c | 58 ++++++++++++++++++++++++++++++++-----------------------
1 files changed, 34 insertions(+), 24 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f6b644f..2af83bd 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -191,14 +191,20 @@ static int proc_root_link(struct inode *inode, struct path *path)
return result;
}

-static int __check_mem_permission(struct task_struct *task)
+static struct mm_struct *__check_mem_permission(struct task_struct *task)
{
+ struct mm_struct *mm;
+
+ mm = get_task_mm(task);
+ if (!mm)
+ return ERR_PTR(-EINVAL);
+
/*
* A task can always look at itself, in case it chooses
* to use system calls instead of load instructions.
*/
if (task == current)
- return 0;
+ return mm;

/*
* If current is actively ptrace'ing, and would also be
@@ -210,20 +216,23 @@ static int __check_mem_permission(struct task_struct *task)
match = (tracehook_tracer_task(task) == current);
rcu_read_unlock();
if (match && ptrace_may_access(task, PTRACE_MODE_ATTACH))
- return 0;
+ return mm;
}

/*
* Noone else is allowed.
*/
- return -EPERM;
+ mmput(mm);
+ return ERR_PTR(-EPERM);
}

/*
- * Return zero if current may access user memory in @task, -error if not.
+ * If current may access user memory in @task return a reference to the
+ * corresponding mm, otherwise ERR_PTR.
*/
-static int check_mem_permission(struct task_struct *task)
+static struct mm_struct *check_mem_permission(struct task_struct *task)
{
+ struct mm_struct *mm;
int err;

/*
@@ -232,12 +241,12 @@ static int check_mem_permission(struct task_struct *task)
*/
err = mutex_lock_killable(&task->signal->cred_guard_mutex);
if (err)
- return err;
+ return ERR_PTR(err);

- err = __check_mem_permission(task);
+ mm = __check_mem_permission(task);
mutex_unlock(&task->signal->cred_guard_mutex);

- return err;
+ return mm;
}

struct mm_struct *mm_for_maps(struct task_struct *task)
@@ -793,18 +802,14 @@ static ssize_t mem_read(struct file * file, char __user * buf,
if (!task)
goto out_no_task;

- if (check_mem_permission(task))
- goto out;
-
ret = -ENOMEM;
page = (char *)__get_free_page(GFP_TEMPORARY);
if (!page)
goto out;

- ret = 0;
-
- mm = get_task_mm(task);
- if (!mm)
+ mm = check_mem_permission(task);
+ ret = PTR_ERR(mm);
+ if (IS_ERR(mm))
goto out_free;

ret = -EIO;
@@ -818,8 +823,8 @@ static ssize_t mem_read(struct file * file, char __user * buf,
int this_len, retval;

this_len = (count > PAGE_SIZE) ? PAGE_SIZE : count;
- retval = access_process_vm(task, src, page, this_len, 0);
- if (!retval || check_mem_permission(task)) {
+ retval = access_remote_vm(mm, src, page, this_len, 0);
+ if (!retval) {
if (!ret)
ret = -EIO;
break;
@@ -858,22 +863,25 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
char *page;
struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
unsigned long dst = *ppos;
+ struct mm_struct *mm;

copied = -ESRCH;
if (!task)
goto out_no_task;

- if (check_mem_permission(task))
- goto out;
+ mm = check_mem_permission(task);
+ copied = PTR_ERR(mm);
+ if (IS_ERR(mm))
+ goto out_task;

copied = -EIO;
if (file->private_data != (void *)((long)current->self_exec_id))
- goto out;
+ goto out_mm;

copied = -ENOMEM;
page = (char *)__get_free_page(GFP_TEMPORARY);
if (!page)
- goto out;
+ goto out_mm;

copied = 0;
while (count > 0) {
@@ -884,7 +892,7 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
copied = -EFAULT;
break;
}
- retval = access_process_vm(task, dst, page, this_len, 1);
+ retval = access_remote_vm(mm, dst, page, this_len, 1);
if (!retval) {
if (!copied)
copied = -EIO;
@@ -897,7 +905,9 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
}
*ppos = dst;
free_page((unsigned long) page);
-out:
+out_mm:
+ mmput(mm);
+out_task:
put_task_struct(task);
out_no_task:
return copied;
--
1.7.3.5

2011-03-13 19:56:56

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 10/12] proc: hold cred_guard_mutex in check_mem_permission()

Avoid a potential race when task exec's and we get a new ->mm but check against
the old credentials in ptrace_may_access().

Holding of the mutex is implemented by factoring out the body of the code into a
helper function __check_mem_permission(). Performing this factorization now
simplifies upcoming changes and minimizes churn in the diff's.

Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/base.c | 26 ++++++++++++++++++++++----
1 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index e52702d..f6b644f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -191,10 +191,7 @@ static int proc_root_link(struct inode *inode, struct path *path)
return result;
}

-/*
- * Return zero if current may access user memory in @task, -error if not.
- */
-static int check_mem_permission(struct task_struct *task)
+static int __check_mem_permission(struct task_struct *task)
{
/*
* A task can always look at itself, in case it chooses
@@ -222,6 +219,27 @@ static int check_mem_permission(struct task_struct *task)
return -EPERM;
}

+/*
+ * Return zero if current may access user memory in @task, -error if not.
+ */
+static int check_mem_permission(struct task_struct *task)
+{
+ int err;
+
+ /*
+ * Avoid racing if task exec's as we might get a new mm but validate
+ * against old credentials.
+ */
+ err = mutex_lock_killable(&task->signal->cred_guard_mutex);
+ if (err)
+ return err;
+
+ err = __check_mem_permission(task);
+ mutex_unlock(&task->signal->cred_guard_mutex);
+
+ return err;
+}
+
struct mm_struct *mm_for_maps(struct task_struct *task)
{
struct mm_struct *mm;
--
1.7.3.5

2011-03-13 19:57:21

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 12/12] proc: enable writing to /proc/pid/mem

With recent changes there is no longer a security hazard with writing to
/proc/pid/mem. Remove the #ifdef.

Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/base.c | 5 -----
1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 2af83bd..964bc5d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -852,10 +852,6 @@ out_no_task:
return ret;
}

-#define mem_write NULL
-
-#ifndef mem_write
-/* This is a security hazard */
static ssize_t mem_write(struct file * file, const char __user *buf,
size_t count, loff_t *ppos)
{
@@ -912,7 +908,6 @@ out_task:
out_no_task:
return copied;
}
-#endif

loff_t mem_lseek(struct file *file, loff_t offset, int orig)
{
--
1.7.3.5

2011-03-13 20:23:55

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 07/12] mm: factor out main logic of access_process_vm

Introduce an internal helper __access_remote_vm and base access_process_vm on
top of it. This new method may be called with a NULL task_struct if page fault
accounting is not desired. This code will be shared with a new address space
accessor that is independent of task_struct.

Signed-off-by: Stephen Wilson <[email protected]>
---
mm/memory.c | 35 +++++++++++++++++++++++++----------
1 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 36445e3..68eec4f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3593,20 +3593,15 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
#endif

/*
- * Access another process' address space.
- * Source/target buffer must be kernel space,
- * Do not walk the page table directly, use get_user_pages
+ * Access another process' address space as given in mm. If non-NULL, use the
+ * given task for page fault accounting.
*/
-int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write)
+static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long addr, void *buf, int len, int write)
{
- struct mm_struct *mm;
struct vm_area_struct *vma;
void *old_buf = buf;

- mm = get_task_mm(tsk);
- if (!mm)
- return 0;
-
down_read(&mm->mmap_sem);
/* ignore errors, just check how much was successfully transferred */
while (len) {
@@ -3655,12 +3650,32 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, in
addr += bytes;
}
up_read(&mm->mmap_sem);
- mmput(mm);

return buf - old_buf;
}

/*
+ * Access another process' address space.
+ * Source/target buffer must be kernel space,
+ * Do not walk the page table directly, use get_user_pages
+ */
+int access_process_vm(struct task_struct *tsk, unsigned long addr,
+ void *buf, int len, int write)
+{
+ struct mm_struct *mm;
+ int ret;
+
+ mm = get_task_mm(tsk);
+ if (!mm)
+ return 0;
+
+ ret = __access_remote_vm(tsk, mm, addr, buf, len, write);
+ mmput(mm);
+
+ return ret;
+}
+
+/*
* Print the name of a VMA.
*/
void print_vma_addr(char *prefix, unsigned long ip)
--
1.7.3.5

2011-03-13 20:25:22

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 08/12] mm: implement access_remote_vm

Provide an alternative to access_process_vm that allows the caller to obtain a
reference to the supplied mm_struct.

Signed-off-by: Stephen Wilson <[email protected]>
---
include/linux/mm.h | 2 ++
mm/memory.c | 16 ++++++++++++++++
2 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 694512d..e5fde8a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -964,6 +964,8 @@ static inline int handle_mm_fault(struct mm_struct *mm,

extern int make_pages_present(unsigned long addr, unsigned long end);
extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
+extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
+ void *buf, int len, int write);

int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int nr_pages, int write, int force,
diff --git a/mm/memory.c b/mm/memory.c
index 68eec4f..c26e4f9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3654,6 +3654,22 @@ static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
return buf - old_buf;
}

+/**
+ * @access_remote_vm - access another process' address space
+ * @mm: the mm_struct of the target address space
+ * @addr: start address to access
+ * @buf: source or destination buffer
+ * @len: number of bytes to transfer
+ * @write: whether the access is a write
+ *
+ * The caller must hold a reference on @mm.
+ */
+int access_remote_vm(struct mm_struct *mm, unsigned long addr,
+ void *buf, int len, int write)
+{
+ return __access_remote_vm(NULL, mm, addr, buf, len, write);
+}
+
/*
* Access another process' address space.
* Source/target buffer must be kernel space,
--
1.7.3.5

2011-03-14 00:10:05

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 11/12] proc: make check_mem_permission() return an mm_struct on success

Hi Stephen,

On Sun, Mar 13, 2011 at 03:49:23PM -0400, Stephen Wilson wrote:
> This change allows us to take advantage of access_remote_vm(), which in turn
> eliminates a security issue with the mem_write() implementation.
>
> The previous implementation of mem_write() was insecure since the target task
> could exec a setuid-root binary between the permission check and the actual
> write. Holding a reference to the target mm_struct eliminates this
> vulnerability.
>
> Signed-off-by: Stephen Wilson <[email protected]>
> ---
> fs/proc/base.c | 58 ++++++++++++++++++++++++++++++++-----------------------
> 1 files changed, 34 insertions(+), 24 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f6b644f..2af83bd 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -858,22 +863,25 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
> char *page;
> struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
> unsigned long dst = *ppos;
> + struct mm_struct *mm;
>
> copied = -ESRCH;
> if (!task)
> goto out_no_task;
>
> - if (check_mem_permission(task))
> - goto out;
> + mm = check_mem_permission(task);
> + copied = PTR_ERR(mm);
> + if (IS_ERR(mm))
> + goto out_task;
>
> copied = -EIO;
> if (file->private_data != (void *)((long)current->self_exec_id))
> - goto out;
> + goto out_mm;

The file->private_data test seems wrong to me. Is there a case were the mm
returned from check_mem_permission(task) can refer to something that is no
longer attached to task?

For example:
- pid 100 ptraces pid 200
- pid 100 opens /proc/200/mem
- pid 200 execs into something else

A read of that mem fd could, IIUC, read from the new pid 200 mm, but
only after passing check_mem_permission(task) again. This is stopped
by the private_data test. But should it, since check_mem_permission()
passed?

Even if it does mean to block it, it's insufficient since pid 200
could just exec u32 many times and align with the original private_data
value. What is that test trying to do? And I'm curious for both mem_write
as well as the existing mem_read use of the test, since I'd like to see
a general solution to the "invalidate /proc fds across exec" so we can
close CVE-2011-1020 for everything[1].

Associated with this, the drop of check_mem_permission(task) during the
mem_read loop implies that the mm is locked during that loop and seems to
reflect what you're saying ("Holding a reference to the target mm_struct
eliminates this vulnerability."), meaning there's no reason to recheck
permissions. Is that accurate?

Thanks,

-Kees

[1] https://lkml.org/lkml/2011/2/7/368

--
Kees Cook
Ubuntu Security Team

2011-03-14 01:01:12

by Stephen Wilson

[permalink] [raw]
Subject: Re: [PATCH 11/12] proc: make check_mem_permission() return an mm_struct on success


On Sun, Mar 13, 2011 at 05:08:59PM -0700, Kees Cook wrote:
> On Sun, Mar 13, 2011 at 03:49:23PM -0400, Stephen Wilson wrote:
> > copied = -EIO;
> > if (file->private_data != (void *)((long)current->self_exec_id))
> > - goto out;
> > + goto out_mm;
>
> The file->private_data test seems wrong to me. Is there a case were the mm
> returned from check_mem_permission(task) can refer to something that is no
> longer attached to task?
>
> For example:
> - pid 100 ptraces pid 200
> - pid 100 opens /proc/200/mem
> - pid 200 execs into something else

If the _target_ task (pid 200) execs then we are OK -- we hold a
reference to the *old* mm and it is that to which we read and write via
access_remote_vm().

In the case of the file->private_data test we are looking at the
*ptracer* (pid 100). Here we are guarding against the case where the
tracer exec's and accidentally leaks the fd (hence the test wrt
current). IOW, /proc/pid/mem is implicitly close on exec. This is just
a minor feature to protect against buggy user space reading/writing
mistakenly into the targets address space.

> only after passing check_mem_permission(task) again. This is stopped
> by the private_data test. But should it, since check_mem_permission()
> passed?

No. I hope the above clears that up.

> Even if it does mean to block it, it's insufficient since pid 200
> could just exec u32 many times and align with the original private_data
> value.

Just for clarity, in your example it would be pid 100 that would need to
exec many times. And yes, I think it would be possible for pid 100 to
exec() N times before the next call to mem_read/mem_write and thus
subvert this check.

Perhaps we can improve things (I would need to look into how O_CLOEXEC
is implemented), however please note that the primary rationale here is
to protect against bugs: the tracer already has the needed privilege,
and it would be silly for it to exec N times just to pass the fd out
across an exec().


> What is that test trying to do? And I'm curious for both mem_write
> as well as the existing mem_read use of the test, since I'd like to see
> a general solution to the "invalidate /proc fds across exec" so we can
> close CVE-2011-1020 for everything[1].

These patches certainly do not add to the problem -- but they do not try
to address the general issue either.

> Associated with this, the drop of check_mem_permission(task) during the
> mem_read loop implies that the mm is locked during that loop and seems to
> reflect what you're saying ("Holding a reference to the target mm_struct
> eliminates this vulnerability."), meaning there's no reason to recheck
> permissions. Is that accurate?

Yes, precisely. Once we have a reference to the mm we do not need to
worry about things changing underneath our feet, so the second check in
mem_read() is redundant and can be dropped.


Take care,

>
> Thanks,
>
> -Kees
>
> [1] https://lkml.org/lkml/2011/2/7/368
>
> --
> Kees Cook
> Ubuntu Security Team


--
steve

2011-03-14 15:14:32

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 11/12] proc: make check_mem_permission() return an mm_struct on success

On Sun, Mar 13, 2011 at 08:59:48PM -0400, Stephen Wilson wrote:
> On Sun, Mar 13, 2011 at 05:08:59PM -0700, Kees Cook wrote:
> > On Sun, Mar 13, 2011 at 03:49:23PM -0400, Stephen Wilson wrote:
> > > copied = -EIO;
> > > if (file->private_data != (void *)((long)current->self_exec_id))
> > > - goto out;
> > > + goto out_mm;
> >
> > The file->private_data test seems wrong to me. Is there a case were the mm
> > returned from check_mem_permission(task) can refer to something that is no
> > longer attached to task?
> >
> > For example:
> > - pid 100 ptraces pid 200
> > - pid 100 opens /proc/200/mem
> > - pid 200 execs into something else
>
> If the _target_ task (pid 200) execs then we are OK -- we hold a
> reference to the *old* mm and it is that to which we read and write via
> access_remote_vm().

Right, the old mm is held during read_mem(). But isn't the mm fetched
from check_mem_permission(task) each time pid 100 reads from the
/proc/200/mem fd? (And if so, that's still okay, it still passes through
check_mem_permission() so the access will be validated.)

> In the case of the file->private_data test we are looking at the
> *ptracer* (pid 100). Here we are guarding against the case where the
> tracer exec's and accidentally leaks the fd (hence the test wrt
> current). IOW, /proc/pid/mem is implicitly close on exec. This is just
> a minor feature to protect against buggy user space reading/writing
> mistakenly into the targets address space.

Ah! Right, thanks, that clears that up.

> > What is that test trying to do? And I'm curious for both mem_write
> > as well as the existing mem_read use of the test, since I'd like to see
> > a general solution to the "invalidate /proc fds across exec" so we can
> > close CVE-2011-1020 for everything[1].
>
> These patches certainly do not add to the problem -- but they do not try
> to address the general issue either.

The use of check_mem_permission() already protects /proc/pid/mem, but
that test is much stricter than the may_ptrace() checks of things like
/proc/pid/maps. Regardless, yeah, there's no problem here that I can see.

> > Associated with this, the drop of check_mem_permission(task) during the
> > mem_read loop implies that the mm is locked during that loop and seems to
> > reflect what you're saying ("Holding a reference to the target mm_struct
> > eliminates this vulnerability."), meaning there's no reason to recheck
> > permissions. Is that accurate?
>
> Yes, precisely. Once we have a reference to the mm we do not need to
> worry about things changing underneath our feet, so the second check in
> mem_read() is redundant and can be dropped.

Excellent. :)

-Kees

--
Kees Cook
Ubuntu Security Team