For a long time /proc/pid/mem has provided a read-only interface, at least
since 2.4.0. However, a write capability has existed "forever" in tree via the
function mem_write(), disabled with an #ifdef along with the comment "this is a
security hazard". Currently, the main problem with mem_write() is that between
the time permissions are checked and the actual write the target task could
exec a setuid-root binary.
This patch series enables safe writes to /proc/pid/mem. The principle strategy
is to get a reference to the target task's mm before the permission check, and
to hold that reference until after the write completes.
This patch is useful as it gives debuggers a simple and efficient mechanism to
manipulate a processes address space. Memory can be read and written using
single calls to pread(2) and pwrite(2) instead of iteratively calling
into ptrace(2). In addition, /proc/pid/mem has always had write permissions
enabled, so clearly it *wants* to be written to.
This series builds off previous work up for review here:
http://lkml.org/lkml/2011/3/8/409
The general approach used was suggested to me by Alexander Viro, but any
mistakes present in these patches are entirely my own.
--
steve
Stephen Wilson (6):
mm: use mm_struct to resolve gate vma's in __get_user_pages
mm: factor out main logic of access_process_vm
mm: implement access_remote_vm
proc: disable mem_write after exec
proc: make check_mem_permission() return an mm_struct on success
proc: enable writing to /proc/pid/mem
fs/proc/base.c | 61 ++++++++++++++++++++++++++-------------------
include/linux/mm.h | 2 +
mm/memory.c | 69 +++++++++++++++++++++++++++++++++++++++-------------
3 files changed, 89 insertions(+), 43 deletions(-)
We now check if a requested user page overlaps a gate vma using the supplied mm
instead of the supplied task. The given task is now used solely for accounting
purposes and may be NULL.
Signed-off-by: Stephen Wilson <[email protected]>
---
mm/memory.c | 18 +++++++++++-------
1 files changed, 11 insertions(+), 7 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 3863e86..36445e3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1437,9 +1437,9 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
struct vm_area_struct *vma;
vma = find_extend_vma(mm, start);
- if (!vma && in_gate_area(tsk->mm, start)) {
+ if (!vma && in_gate_area(mm, start)) {
unsigned long pg = start & PAGE_MASK;
- struct vm_area_struct *gate_vma = get_gate_vma(tsk->mm);
+ struct vm_area_struct *gate_vma = get_gate_vma(mm);
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
@@ -1533,10 +1533,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
return i ? i : -EFAULT;
BUG();
}
- if (ret & VM_FAULT_MAJOR)
- tsk->maj_flt++;
- else
- tsk->min_flt++;
+
+ if (tsk) {
+ if (ret & VM_FAULT_MAJOR)
+ tsk->maj_flt++;
+ else
+ tsk->min_flt++;
+ }
if (ret & VM_FAULT_RETRY) {
*nonblocking = 0;
@@ -1581,7 +1584,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
/**
* get_user_pages() - pin user pages in memory
- * @tsk: task_struct of target task
+ * @tsk: the task_struct to use for page fault accounting, or
+ * NULL if faults are not to be recorded.
* @mm: mm_struct of target mm
* @start: starting user address
* @nr_pages: number of pages from start to pin
--
1.7.3.5
Introduce an internal helper __access_remote_vm and base access_process_vm on
top of it. This new method may be called with a NULL task_struct if page fault
accounting is not desired. This code will be shared with a new address space
accessor that is independent of task_struct.
Signed-off-by: Stephen Wilson <[email protected]>
---
mm/memory.c | 35 +++++++++++++++++++++++++----------
1 files changed, 25 insertions(+), 10 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 36445e3..68eec4f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3593,20 +3593,15 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
#endif
/*
- * Access another process' address space.
- * Source/target buffer must be kernel space,
- * Do not walk the page table directly, use get_user_pages
+ * Access another process' address space as given in mm. If non-NULL, use the
+ * given task for page fault accounting.
*/
-int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write)
+static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long addr, void *buf, int len, int write)
{
- struct mm_struct *mm;
struct vm_area_struct *vma;
void *old_buf = buf;
- mm = get_task_mm(tsk);
- if (!mm)
- return 0;
-
down_read(&mm->mmap_sem);
/* ignore errors, just check how much was successfully transferred */
while (len) {
@@ -3655,12 +3650,32 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, in
addr += bytes;
}
up_read(&mm->mmap_sem);
- mmput(mm);
return buf - old_buf;
}
/*
+ * Access another process' address space.
+ * Source/target buffer must be kernel space,
+ * Do not walk the page table directly, use get_user_pages
+ */
+int access_process_vm(struct task_struct *tsk, unsigned long addr,
+ void *buf, int len, int write)
+{
+ struct mm_struct *mm;
+ int ret;
+
+ mm = get_task_mm(tsk);
+ if (!mm)
+ return 0;
+
+ ret = __access_remote_vm(tsk, mm, addr, buf, len, write);
+ mmput(mm);
+
+ return ret;
+}
+
+/*
* Print the name of a VMA.
*/
void print_vma_addr(char *prefix, unsigned long ip)
--
1.7.3.5
This change allows us to take advantage of access_remote_vm(), which in turn
enables a secure mem_write() implementation.
The previous implementation of mem_write() was insecure since the target task
could exec a setuid-root binary between the permission check and the actual
write. Holding a reference to the target mm_struct eliminates this
vulnerability.
Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/base.c | 54 ++++++++++++++++++++++++++++++++----------------------
1 files changed, 32 insertions(+), 22 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index e52702d..5ffc927 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -192,16 +192,23 @@ static int proc_root_link(struct inode *inode, struct path *path)
}
/*
- * Return zero if current may access user memory in @task, -error if not.
+ * If current may access user memory in @task return a reference to the
+ * corresponding mm, otherwise NULL.
*/
-static int check_mem_permission(struct task_struct *task)
+static struct mm_struct *check_mem_permission(struct task_struct *task)
{
+ struct mm_struct *mm;
+
+ mm = get_task_mm(task);
+ if (!mm)
+ return NULL;
+
/*
* A task can always look at itself, in case it chooses
* to use system calls instead of load instructions.
*/
if (task == current)
- return 0;
+ return mm;
/*
* If current is actively ptrace'ing, and would also be
@@ -213,13 +220,14 @@ static int check_mem_permission(struct task_struct *task)
match = (tracehook_tracer_task(task) == current);
rcu_read_unlock();
if (match && ptrace_may_access(task, PTRACE_MODE_ATTACH))
- return 0;
+ return mm;
}
/*
* Noone else is allowed.
*/
- return -EPERM;
+ mmput(mm);
+ return NULL;
}
struct mm_struct *mm_for_maps(struct task_struct *task)
@@ -775,24 +783,20 @@ static ssize_t mem_read(struct file * file, char __user * buf,
if (!task)
goto out_no_task;
- if (check_mem_permission(task))
+ ret = -EPERM;
+ mm = check_mem_permission(task);
+ if (!mm)
goto out;
ret = -ENOMEM;
page = (char *)__get_free_page(GFP_TEMPORARY);
if (!page)
- goto out;
-
- ret = 0;
-
- mm = get_task_mm(task);
- if (!mm)
- goto out_free;
+ goto out_put;
ret = -EIO;
if (file->private_data != (void*)((long)current->self_exec_id))
- goto out_put;
+ goto out_free;
ret = 0;
@@ -800,8 +804,8 @@ static ssize_t mem_read(struct file * file, char __user * buf,
int this_len, retval;
this_len = (count > PAGE_SIZE) ? PAGE_SIZE : count;
- retval = access_process_vm(task, src, page, this_len, 0);
- if (!retval || check_mem_permission(task)) {
+ retval = access_remote_vm(mm, src, page, this_len, 0);
+ if (!retval) {
if (!ret)
ret = -EIO;
break;
@@ -819,10 +823,10 @@ static ssize_t mem_read(struct file * file, char __user * buf,
}
*ppos = src;
-out_put:
- mmput(mm);
out_free:
free_page((unsigned long) page);
+out_put:
+ mmput(mm);
out:
put_task_struct(task);
out_no_task:
@@ -838,6 +842,7 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
{
int copied;
char *page;
+ struct mm_struct *mm;
struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
unsigned long dst = *ppos;
@@ -845,17 +850,19 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
if (!task)
goto out_no_task;
- if (check_mem_permission(task))
+ copied = -EPERM;
+ mm = check_mem_permission(task);
+ if (!mm)
goto out;
copied = -EIO;
if (file->private_data != (void *)((long)current->self_exec_id))
- goto out;
+ goto out_put;
copied = -ENOMEM;
page = (char *)__get_free_page(GFP_TEMPORARY);
if (!page)
- goto out;
+ goto out_put;
copied = 0;
while (count > 0) {
@@ -866,7 +873,7 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
copied = -EFAULT;
break;
}
- retval = access_process_vm(task, dst, page, this_len, 1);
+ retval = access_remote_vm(mm, dst, page, this_len, 1);
if (!retval) {
if (!copied)
copied = -EIO;
@@ -879,6 +886,9 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
}
*ppos = dst;
free_page((unsigned long) page);
+
+out_put:
+ mmput(mm);
out:
put_task_struct(task);
out_no_task:
--
1.7.3.5
This change makes mem_write() observe the same constraints as mem_read(). This
is particularly important for mem_write as an accidental leak of the fd across
an exec could result in arbitrary modification of the target process' memory.
IOW, /proc/pid/mem is implicitly close-on-exec.
Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/base.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9d096e8..e52702d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -848,6 +848,10 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
if (check_mem_permission(task))
goto out;
+ copied = -EIO;
+ if (file->private_data != (void *)((long)current->self_exec_id))
+ goto out;
+
copied = -ENOMEM;
page = (char *)__get_free_page(GFP_TEMPORARY);
if (!page)
--
1.7.3.5
Provide an alternative to access_process_vm that allows the caller to obtain a
reference to the supplied mm_struct.
Signed-off-by: Stephen Wilson <[email protected]>
---
include/linux/mm.h | 2 ++
mm/memory.c | 16 ++++++++++++++++
2 files changed, 18 insertions(+), 0 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 694512d..e5fde8a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -964,6 +964,8 @@ static inline int handle_mm_fault(struct mm_struct *mm,
extern int make_pages_present(unsigned long addr, unsigned long end);
extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
+extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
+ void *buf, int len, int write);
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int nr_pages, int write, int force,
diff --git a/mm/memory.c b/mm/memory.c
index 68eec4f..c26e4f9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3654,6 +3654,22 @@ static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
return buf - old_buf;
}
+/**
+ * @access_remote_vm - access another process' address space
+ * @mm: the mm_struct of the target address space
+ * @addr: start address to access
+ * @buf: source or destination buffer
+ * @len: number of bytes to transfer
+ * @write: whether the access is a write
+ *
+ * The caller must hold a reference on @mm.
+ */
+int access_remote_vm(struct mm_struct *mm, unsigned long addr,
+ void *buf, int len, int write)
+{
+ return __access_remote_vm(NULL, mm, addr, buf, len, write);
+}
+
/*
* Access another process' address space.
* Source/target buffer must be kernel space,
--
1.7.3.5
With recent changes there is no longer a security hazard with writing to
/proc/pid/mem. Remove the #ifdef.
Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/base.c | 5 -----
1 files changed, 0 insertions(+), 5 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5ffc927..41d9c46 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -833,10 +833,6 @@ out_no_task:
return ret;
}
-#define mem_write NULL
-
-#ifndef mem_write
-/* This is a security hazard */
static ssize_t mem_write(struct file * file, const char __user *buf,
size_t count, loff_t *ppos)
{
@@ -894,7 +890,6 @@ out:
out_no_task:
return copied;
}
-#endif
loff_t mem_lseek(struct file *file, loff_t offset, int orig)
{
--
1.7.3.5
On Tue, Mar 08, 2011 at 07:42:17PM -0500, Stephen Wilson wrote:
> For a long time /proc/pid/mem has provided a read-only interface, at least
> since 2.4.0. However, a write capability has existed "forever" in tree via the
> function mem_write(), disabled with an #ifdef along with the comment "this is a
> security hazard". Currently, the main problem with mem_write() is that between
> the time permissions are checked and the actual write the target task could
> exec a setuid-root binary.
>
> This patch series enables safe writes to /proc/pid/mem. The principle strategy
> is to get a reference to the target task's mm before the permission check, and
> to hold that reference until after the write completes.
One note: I'd rather prefer approach similar to mm_for_maps(). IOW, instead
of "check, then get mm, then check _again_ to decide if we are allowed to
use it", just turn check_mm_permissions() into a function that returns
you a safe mm or gives you NULL (or, better yet, ERR_PTR(...)). With all
checks done within that sucker.
Then mem_read() and mem_write() wouldn't need to recheck anything again
and the same helper would be usable for other things as well. I mean
something like this: (*WARNING* - completely untested)
err = mutex_lock_killable(&tsk->signal->cred_guard_mutex);
if (err)
return ERR_PTR(err);
mm = get_tsk_mm(tsk);
if (!mm) {
mm = ERR_PTR(-EPERM); /* maybe EINVAL here? */
} else if (mm != current->mm) {
int match;
/*
* If current is actively ptrace'ing, and would also be
* permitted to freshly attach with ptrace now, permit it.
*/
if (!tsk_is_stopped_or_traced(tsk))
goto Eperm;
rcu_read_lock();
match = (tracehook_tracer_tsk(tsk) == current);
rcu_read_unlock();
if (!match)
goto Eperm;
if (!ptrace_may_access(tsk, PTRACE_MODE_ATTACH))
goto Eperm;
}
mutex_unlock(&tsk->signal->cred_guard_mutex);
return mm;
Eperm:
mutex_unlock(&tsk->signal->cred_guard_mutex);
mmput(mm);
return ERR_PTR(-EPERM);
On Wed, Mar 09, 2011 at 01:30:17AM +0000, Al Viro wrote:
> On Tue, Mar 08, 2011 at 07:42:17PM -0500, Stephen Wilson wrote:
> > This patch series enables safe writes to /proc/pid/mem. The principle strategy
> > is to get a reference to the target task's mm before the permission check, and
> > to hold that reference until after the write completes.
>
> One note: I'd rather prefer approach similar to mm_for_maps(). IOW, instead
> of "check, then get mm, then check _again_ to decide if we are allowed to
> use it", just turn check_mm_permissions() into a function that returns
> you a safe mm or gives you NULL (or, better yet, ERR_PTR(...)). With all
> checks done within that sucker.
OK. That certainly makes a lot of sense. That can easily be added as
an additional patch to the series so that it is perfectly clear as to
what has been changed and how.
I think we could also remove the intermediate copy in both mem_read() and
mem_write() as well, but I think such optimizations could be left for
follow on patches.
> Then mem_read() and mem_write() wouldn't need to recheck anything again
> and the same helper would be usable for other things as well. I mean
> something like this: (*WARNING* - completely untested)
Will work this into the series, test it, etc.
Thanks!
--
steve
On Tue, Mar 08, 2011 at 09:15:25PM -0500, Stephen Wilson wrote:
> I think we could also remove the intermediate copy in both mem_read() and
> mem_write() as well, but I think such optimizations could be left for
> follow on patches.
How? We do copy_.._user() in there; it can trigger page faults and
that's not something you want while holding mmap_sem on some mm.
Looks like a deadlock country... So we can't do that from inside
access_process_vm() or its analogs, which means buffering in caller.
On Wed, Mar 09, 2011 at 02:33:04AM +0000, Al Viro wrote:
> On Tue, Mar 08, 2011 at 09:15:25PM -0500, Stephen Wilson wrote:
>
> > I think we could also remove the intermediate copy in both mem_read() and
> > mem_write() as well, but I think such optimizations could be left for
> > follow on patches.
>
> How? We do copy_.._user() in there; it can trigger page faults and
> that's not something you want while holding mmap_sem on some mm.
Ah, OK. I did not think thru that subtlety. Was merely mentioning
"things we might do afterwords" as opposed to a genuine proposal.
> Looks like a deadlock country... So we can't do that from inside
> access_process_vm() or its analogs, which means buffering in caller.
Thanks for the feed back -- I am certainly (relatively speaking) new to
the code so your insights are most valuable.
Thanks again!
--
steve
> We now check if a requested user page overlaps a gate vma using the supplied mm
> instead of the supplied task. The given task is now used solely for accounting
> purposes and may be NULL.
>
> Signed-off-by: Stephen Wilson <[email protected]>
> ---
> mm/memory.c | 18 +++++++++++-------
> 1 files changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3863e86..36445e3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1437,9 +1437,9 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> struct vm_area_struct *vma;
>
> vma = find_extend_vma(mm, start);
> - if (!vma && in_gate_area(tsk->mm, start)) {
> + if (!vma && in_gate_area(mm, start)) {
> unsigned long pg = start & PAGE_MASK;
> - struct vm_area_struct *gate_vma = get_gate_vma(tsk->mm);
> + struct vm_area_struct *gate_vma = get_gate_vma(mm);
> pgd_t *pgd;
> pud_t *pud;
> pmd_t *pmd;
Hmm..
Is this works? In exec() case task has two mm, old and new-borned. tsk has
no enough information to detect gate area if 64bit process exec 32bit process
or oppsite case. On Linux, 32bit and 64bit processes have perfectly different
process vma layout.
Am I missing something?
> This change makes mem_write() observe the same constraints as mem_read(). This
> is particularly important for mem_write as an accidental leak of the fd across
> an exec could result in arbitrary modification of the target process' memory.
> IOW, /proc/pid/mem is implicitly close-on-exec.
>
> Signed-off-by: Stephen Wilson <[email protected]>
> ---
> fs/proc/base.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 9d096e8..e52702d 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -848,6 +848,10 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
> if (check_mem_permission(task))
> goto out;
>
> + copied = -EIO;
> + if (file->private_data != (void *)((long)current->self_exec_id))
> + goto out;
> +
I agree.
Reviewed-by: KOSAKI Motohiro <[email protected]>
On Wed, Mar 09, 2011 at 02:19:30PM +0900, KOSAKI Motohiro wrote:
> Hmm..
> Is this works? In exec() case task has two mm, old and new-borned. tsk has
> no enough information to detect gate area if 64bit process exec 32bit process
> or oppsite case. On Linux, 32bit and 64bit processes have perfectly different
> process vma layout.
>
> Am I missing something?
Patch series refered to in [0/6] ;-) FWIW, that would probably be better
off as one mail thread - would be easier to find.
What happens is that mm_struct gets marked as 32bit/64bit at execve time
(on x86, everything else doesn't care), so this stuff becomes possible
to calculate by mm_struct alone.
> This change allows us to take advantage of access_remote_vm(), which in turn
> enables a secure mem_write() implementation.
>
> The previous implementation of mem_write() was insecure since the target task
> could exec a setuid-root binary between the permission check and the actual
> write. Holding a reference to the target mm_struct eliminates this
> vulnerability.
>
> Signed-off-by: Stephen Wilson <[email protected]>
OK, I like this idea. So, I suppose you will resend newer version as applied Al's
comment and I'll be able to ack this.
Thanks.
On Wed, Mar 09, 2011 at 06:06:17AM +0000, Al Viro wrote:
> On Wed, Mar 09, 2011 at 02:19:30PM +0900, KOSAKI Motohiro wrote:
>
> > Hmm..
> > Is this works? In exec() case task has two mm, old and new-borned. tsk has
> > no enough information to detect gate area if 64bit process exec 32bit process
> > or oppsite case. On Linux, 32bit and 64bit processes have perfectly different
> > process vma layout.
> >
> > Am I missing something?
>
> Patch series refered to in [0/6] ;-) FWIW, that would probably be better
> off as one mail thread - would be easier to find.
OK. After the first half has gone thru review I will respin (with changes)
as a single series. I was actually hoping the split would make review a
little bit easier, but in retrospect I could have accomplished the same
thing by simply pointing out the two halves in the series "cover
letter".
--
steve