From: Jovi Zhang <[email protected]>
It should be better if put check_mem_permission before __get_free_page
in mem_read, to be same as function mem_write.
Signed-off-by: Jovi Zhang <[email protected]>
---
fs/proc/base.c | 24 ++++++++++--------------
1 files changed, 10 insertions(+), 14 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index dfa5327..d66126d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -831,23 +831,21 @@ static ssize_t mem_read(struct file * file, char __user * buf,
if (!task)
goto out_no_task;
- ret = -ENOMEM;
- page = (char *)__get_free_page(GFP_TEMPORARY);
- if (!page)
- goto out;
-
mm = check_mem_permission(task);
ret = PTR_ERR(mm);
if (IS_ERR(mm))
- goto out_free;
+ goto out_task;
ret = -EIO;
-
if (file->private_data != (void*)((long)current->self_exec_id))
- goto out_put;
+ goto out_mm;
+
+ ret = -ENOMEM;
+ page = (char *)__get_free_page(GFP_TEMPORARY);
+ if (!page)
+ goto out_mm;
ret = 0;
-
while (count > 0) {
int this_len, retval;
@@ -870,12 +868,10 @@ static ssize_t mem_read(struct file * file, char __user * buf,
count -= retval;
}
*ppos = src;
-
-out_put:
- mmput(mm);
-out_free:
free_page((unsigned long) page);
-out:
+out_mm:
+ mmput(mm);
+out_task:
put_task_struct(task);
out_no_task:
return ret;
--
1.7.2.3
2011/4/18 <[email protected]>:
> From: Jovi Zhang <[email protected]>
>
> It should be better if put check_mem_permission before __get_free_page
> in mem_read, to be same as function mem_write.
>
> Signed-off-by: Jovi Zhang <[email protected]>
Looks good to me.
Reviewed-by: KOSAKI Motohiro <[email protected]>
On Sun, Apr 17, 2011 at 08:20:01PM -0400, [email protected] wrote:
> From: Jovi Zhang <[email protected]>
>
> It should be better if put check_mem_permission before __get_free_page
> in mem_read, to be same as function mem_write.
>
> Signed-off-by: Jovi Zhang <[email protected]>
looks good.
Reviewed-by: Stephen Wilson <[email protected]>
--
steve
On Sun, 17 Apr 2011, [email protected] wrote:
> From: Jovi Zhang <[email protected]>
>
> It should be better if put check_mem_permission before __get_free_page
> in mem_read, to be same as function mem_write.
>
> Signed-off-by: Jovi Zhang <[email protected]>
Sorry to be contrary, but I disagree with this. I'm all for consistency,
but is there a particular reason why you think the mem_write ordering is
right and mem_read wrong?
My reason for preferring the current mem_read ordering is this:
check_mem_permission gets a reference to the mm. If we __get_free_page
after check_mem_permission, imagine what happens if the system is out
of memory, and the mm we're looking at is selected for killing by the
OOM killer: while we wait in __get_free_page for more memory, no memory
is freed from the selected mm because it cannot reach exit_mmap while
we hold that reference.
(I may be overstating the case: a little memory may be freed from the
exiting task's stack, and kswapd should still be able to pick some pages
off the mm. But nonetheless, we would do better to let this mm go.)
No doubt there are plenty of other places in /proc which try to
allocate memory after taking a reference on an mm; but I think
we should be working to eliminate those rather than add to them.
Hugh
Hi High,
> On Sun, 17 Apr 2011, [email protected] wrote:
> > From: Jovi Zhang <[email protected]>
> >
> > It should be better if put check_mem_permission before __get_free_page
> > in mem_read, to be same as function mem_write.
> >
> > Signed-off-by: Jovi Zhang <[email protected]>
>
> Sorry to be contrary, but I disagree with this. I'm all for consistency,
> but is there a particular reason why you think the mem_write ordering is
> right and mem_read wrong?
>
> My reason for preferring the current mem_read ordering is this:
>
> check_mem_permission gets a reference to the mm. If we __get_free_page
> after check_mem_permission, imagine what happens if the system is out
> of memory, and the mm we're looking at is selected for killing by the
> OOM killer: while we wait in __get_free_page for more memory, no memory
> is freed from the selected mm because it cannot reach exit_mmap while
> we hold that reference.
Right.
sorry for that. I missed this point.
> (I may be overstating the case: a little memory may be freed from the
> exiting task's stack, and kswapd should still be able to pick some pages
> off the mm. But nonetheless, we would do better to let this mm go.)
>
> No doubt there are plenty of other places in /proc which try to
> allocate memory after taking a reference on an mm; but I think
> we should be working to eliminate those rather than add to them.
then, Should we change mem_write instead?
>From 74f827ce74e1c4f846905e940edfa5f639b5a2ce Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Tue, 26 Apr 2011 13:57:02 +0900
Subject: [PATCH] [PATCH] proc: put check_mem_permission after __get_free_page in mem_write
It should be better if put check_mem_permission after __get_free_page
in mem_write, to be same as function mem_read.
Hugh Dickins explained the reason.
check_mem_permission gets a reference to the mm. If we __get_free_page
after check_mem_permission, imagine what happens if the system is out
of memory, and the mm we're looking at is selected for killing by the
OOM killer: while we wait in __get_free_page for more memory, no memory
is freed from the selected mm because it cannot reach exit_mmap while
we hold that reference.
Reported-by: Jovi Zhang <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Stephen Wilson <[email protected]>
---
fs/proc/base.c | 16 +++++++++-------
1 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4deef2e..e93be6e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -894,20 +894,20 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
if (!task)
goto out_no_task;
+ copied = -ENOMEM;
+ page = (char *)__get_free_page(GFP_TEMPORARY);
+ if (!page)
+ goto out_task;
+
mm = check_mem_permission(task);
copied = PTR_ERR(mm);
if (IS_ERR(mm))
- goto out_task;
+ goto out_free;
copied = -EIO;
if (file->private_data != (void *)((long)current->self_exec_id))
goto out_mm;
- copied = -ENOMEM;
- page = (char *)__get_free_page(GFP_TEMPORARY);
- if (!page)
- goto out_mm;
-
copied = 0;
while (count > 0) {
int this_len, retval;
@@ -929,9 +929,11 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
count -= retval;
}
*ppos = dst;
- free_page((unsigned long) page);
+
out_mm:
mmput(mm);
+out_free:
+ free_page((unsigned long) page);
out_task:
put_task_struct(task);
out_no_task:
--
1.7.3.1
> Hi High,
>
> > On Sun, 17 Apr 2011, [email protected] wrote:
> > > From: Jovi Zhang <[email protected]>
> > >
> > > It should be better if put check_mem_permission before __get_free_page
> > > in mem_read, to be same as function mem_write.
> > >
> > > Signed-off-by: Jovi Zhang <[email protected]>
> >
> > Sorry to be contrary, but I disagree with this. I'm all for consistency,
> > but is there a particular reason why you think the mem_write ordering is
> > right and mem_read wrong?
> >
> > My reason for preferring the current mem_read ordering is this:
> >
> > check_mem_permission gets a reference to the mm. If we __get_free_page
> > after check_mem_permission, imagine what happens if the system is out
> > of memory, and the mm we're looking at is selected for killing by the
> > OOM killer: while we wait in __get_free_page for more memory, no memory
> > is freed from the selected mm because it cannot reach exit_mmap while
> > we hold that reference.
>
> Right.
>
> sorry for that. I missed this point.
>
>
> > (I may be overstating the case: a little memory may be freed from the
> > exiting task's stack, and kswapd should still be able to pick some pages
> > off the mm. But nonetheless, we would do better to let this mm go.)
> >
> > No doubt there are plenty of other places in /proc which try to
> > allocate memory after taking a reference on an mm; but I think
> > we should be working to eliminate those rather than add to them.
>
> then, Should we change mem_write instead?
I've finished audit other /proc allocation callsite. If my understand
is correct, only pagemap_read() has the same issue.
fixed.
>From 5f83db14a7c62381c4f23994d559041c4c3320a8 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Tue, 26 Apr 2011 14:26:52 +0900
Subject: [PATCH] proc: fix pagemap_read() error case
Currently, pagemap_read() has three error and/or corner case
handling mistake.
(1) If ppos parameter is wrong, mm refcount will be leak.
(2) If count parameter is 0, mm refcount will be leak too.
(3) If the current task is sleeping in kmalloc() and the system
is out of memory and oom-killer kill the proc associated task,
mm_refcount prevent the task free its memory. then system may
hang up.
<Quote Hugh's explain why we shold call kmalloc() before get_mm()>
check_mem_permission gets a reference to the mm. If we __get_free_page
after check_mem_permission, imagine what happens if the system is out
of memory, and the mm we're looking at is selected for killing by the
OOM killer: while we wait in __get_free_page for more memory, no memory
is freed from the selected mm because it cannot reach exit_mmap while
we hold that reference.
This patch fixes the above three.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Hugh Dickins <[email protected]>
---
fs/proc/task_mmu.c | 19 +++++++++----------
1 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 51b9d98..6fb07ce 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -769,18 +769,12 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
if (!task)
goto out;
- mm = mm_for_maps(task);
- ret = PTR_ERR(mm);
- if (!mm || IS_ERR(mm))
- goto out_task;
-
ret = -EINVAL;
/* file position must be aligned */
if ((*ppos % PM_ENTRY_BYTES) || (count % PM_ENTRY_BYTES))
goto out_task;
ret = 0;
-
if (!count)
goto out_task;
@@ -788,7 +782,12 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
pm.buffer = kmalloc(pm.len, GFP_TEMPORARY);
ret = -ENOMEM;
if (!pm.buffer)
- goto out_mm;
+ goto out_task;
+
+ mm = mm_for_maps(task);
+ ret = PTR_ERR(mm);
+ if (!mm || IS_ERR(mm))
+ goto out_free;
pagemap_walk.pmd_entry = pagemap_pte_range;
pagemap_walk.pte_hole = pagemap_pte_hole;
@@ -831,7 +830,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
len = min(count, PM_ENTRY_BYTES * pm.pos);
if (copy_to_user(buf, pm.buffer, len)) {
ret = -EFAULT;
- goto out_free;
+ goto out_mm;
}
copied += len;
buf += len;
@@ -841,10 +840,10 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
if (!ret || ret == PM_END_OF_BUFFER)
ret = copied;
-out_free:
- kfree(pm.buffer);
out_mm:
mmput(mm);
+out_free:
+ kfree(pm.buffer);
out_task:
put_task_struct(task);
out:
--
1.7.3.1
On Tue, Apr 26, 2011 at 11:56 AM, Hugh Dickins <[email protected]> wrote:
> On Sun, 17 Apr 2011, [email protected] wrote:
>> From: Jovi Zhang <[email protected]>
>>
>> It should be better if put check_mem_permission before __get_free_page
>> in mem_read, to be same as function mem_write.
>>
>> Signed-off-by: Jovi Zhang <[email protected]>
>
> Sorry to be contrary, but I disagree with this. I'm all for consistency,
> but is there a particular reason why you think the mem_write ordering is
> right and mem_read wrong?
>
> My reason for preferring the current mem_read ordering is this:
>
> check_mem_permission gets a reference to the mm. If we __get_free_page
> after check_mem_permission, imagine what happens if the system is out
> of memory, and the mm we're looking at is selected for killing by the
> OOM killer: while we wait in __get_free_page for more memory, no memory
> is freed from the selected mm because it cannot reach exit_mmap while
> we hold that reference.
>
> (I may be overstating the case: a little memory may be freed from the
> exiting task's stack, and kswapd should still be able to pick some pages
> off the mm. But nonetheless, we would do better to let this mm go.)
>
Indeed, I missed that point. Thanks.
On Tue, Apr 26, 2011 at 02:12:46PM +0900, KOSAKI Motohiro wrote:
> From 74f827ce74e1c4f846905e940edfa5f639b5a2ce Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Tue, 26 Apr 2011 13:57:02 +0900
> Subject: [PATCH] [PATCH] proc: put check_mem_permission after __get_free_page in mem_write
>
> It should be better if put check_mem_permission after __get_free_page
> in mem_write, to be same as function mem_read.
>
> Hugh Dickins explained the reason.
>
> check_mem_permission gets a reference to the mm. If we __get_free_page
> after check_mem_permission, imagine what happens if the system is out
> of memory, and the mm we're looking at is selected for killing by the
> OOM killer: while we wait in __get_free_page for more memory, no memory
> is freed from the selected mm because it cannot reach exit_mmap while
> we hold that reference.
>
>
> Reported-by: Jovi Zhang <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Stephen Wilson <[email protected]>
I completely missed the oom case as well. This certainly appears to be
the best solution.
Reviewed-by: Stephen Wilson <[email protected]>
--
steve
On Tue, Apr 26, 2011 at 02:50:16PM +0900, KOSAKI Motohiro wrote:
> I've finished audit other /proc allocation callsite. If my understand
> is correct, only pagemap_read() has the same issue.
I think there is one additional location that might be worth looking at.
We have a kmalloc(struct numa_maps) happening in show_numa_map() (see
mempolicy.c). In this case there is an allocation/free cycle happening
for each vma as we generate the seq_file.
Unfortunately a fix might require a little work. Initial thinking
suggests that we perform a single allocation at numa_maps_open() time.
However, there is an odd layering/dependency issue between mempolicy.c
and task_mmu.c. A "simple clean fix" does not seem obvious to me.
I am very tempted to suggest we move the proc related stuff out of
mempolicy.c. However, see 1a75a6c8.
Thoughts? I can certainly look into this some more if needed.
--
steve
On Tue, 26 Apr 2011, KOSAKI Motohiro wrote:
>
> From 74f827ce74e1c4f846905e940edfa5f639b5a2ce Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Tue, 26 Apr 2011 13:57:02 +0900
> Subject: [PATCH] [PATCH] proc: put check_mem_permission after __get_free_page in mem_write
>
> It should be better if put check_mem_permission after __get_free_page
> in mem_write, to be same as function mem_read.
>
> Hugh Dickins explained the reason.
>
> check_mem_permission gets a reference to the mm. If we __get_free_page
> after check_mem_permission, imagine what happens if the system is out
> of memory, and the mm we're looking at is selected for killing by the
> OOM killer: while we wait in __get_free_page for more memory, no memory
> is freed from the selected mm because it cannot reach exit_mmap while
> we hold that reference.
>
>
> Reported-by: Jovi Zhang <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Stephen Wilson <[email protected]>
Thank you, yes!
Acked-by: Hugh Dickins <[email protected]>
> ---
> fs/proc/base.c | 16 +++++++++-------
> 1 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 4deef2e..e93be6e 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -894,20 +894,20 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
> if (!task)
> goto out_no_task;
>
> + copied = -ENOMEM;
> + page = (char *)__get_free_page(GFP_TEMPORARY);
> + if (!page)
> + goto out_task;
> +
> mm = check_mem_permission(task);
> copied = PTR_ERR(mm);
> if (IS_ERR(mm))
> - goto out_task;
> + goto out_free;
>
> copied = -EIO;
> if (file->private_data != (void *)((long)current->self_exec_id))
> goto out_mm;
>
> - copied = -ENOMEM;
> - page = (char *)__get_free_page(GFP_TEMPORARY);
> - if (!page)
> - goto out_mm;
> -
> copied = 0;
> while (count > 0) {
> int this_len, retval;
> @@ -929,9 +929,11 @@ static ssize_t mem_write(struct file * file, const char __user *buf,
> count -= retval;
> }
> *ppos = dst;
> - free_page((unsigned long) page);
> +
> out_mm:
> mmput(mm);
> +out_free:
> + free_page((unsigned long) page);
> out_task:
> put_task_struct(task);
> out_no_task:
> --
> 1.7.3.1
On Tue, 25 Apr 2011, KOSAKI Motohiro wrote:
>
> I've finished audit other /proc allocation callsite. If my understand
> is correct, only pagemap_read() has the same issue.
>
> fixed.
Great, thank you. Though I see Steve has cleverly spotted another:
yes, that should be worth looking into, but I've not studied it.
I confess that when I wrote yesterday, I thought there were a lot more
such instances in fs/proc. I thought /proc/pid/smaps was vulnerable,
for example, but I cannot see that now: maybe I just confused it with
/proc/pid/pagemap.
I was worrying about this a couple of months ago, when David had an
OOM dump with OOM-killed threads failing to exit because they could
not skb_alloc in proc_exit_connector()'s cn_netlink_send() - connector
needs to allocate skb upfront (I'm being vague!), not when thread exits.
But it was a mystery why memory was still unavailable at that late
stage: my unsubstantiated suspicion was that something else was
holding on to the mm-to-be-freed while trying to allocate memory.
>
>
> From 5f83db14a7c62381c4f23994d559041c4c3320a8 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Tue, 26 Apr 2011 14:26:52 +0900
> Subject: [PATCH] proc: fix pagemap_read() error case
>
> Currently, pagemap_read() has three error and/or corner case
> handling mistake.
> (1) If ppos parameter is wrong, mm refcount will be leak.
> (2) If count parameter is 0, mm refcount will be leak too.
> (3) If the current task is sleeping in kmalloc() and the system
> is out of memory and oom-killer kill the proc associated task,
> mm_refcount prevent the task free its memory. then system may
> hang up.
>
> <Quote Hugh's explain why we shold call kmalloc() before get_mm()>
> check_mem_permission gets a reference to the mm. If we __get_free_page
> after check_mem_permission, imagine what happens if the system is out
> of memory, and the mm we're looking at is selected for killing by the
> OOM killer: while we wait in __get_free_page for more memory, no memory
> is freed from the selected mm because it cannot reach exit_mmap while
> we hold that reference.
>
> This patch fixes the above three.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Cc: Hugh Dickins <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
> ---
> fs/proc/task_mmu.c | 19 +++++++++----------
> 1 files changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 51b9d98..6fb07ce 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -769,18 +769,12 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
> if (!task)
> goto out;
>
> - mm = mm_for_maps(task);
> - ret = PTR_ERR(mm);
> - if (!mm || IS_ERR(mm))
> - goto out_task;
> -
> ret = -EINVAL;
> /* file position must be aligned */
> if ((*ppos % PM_ENTRY_BYTES) || (count % PM_ENTRY_BYTES))
> goto out_task;
>
> ret = 0;
> -
> if (!count)
> goto out_task;
>
> @@ -788,7 +782,12 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
> pm.buffer = kmalloc(pm.len, GFP_TEMPORARY);
> ret = -ENOMEM;
> if (!pm.buffer)
> - goto out_mm;
> + goto out_task;
> +
> + mm = mm_for_maps(task);
> + ret = PTR_ERR(mm);
> + if (!mm || IS_ERR(mm))
> + goto out_free;
>
> pagemap_walk.pmd_entry = pagemap_pte_range;
> pagemap_walk.pte_hole = pagemap_pte_hole;
> @@ -831,7 +830,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
> len = min(count, PM_ENTRY_BYTES * pm.pos);
> if (copy_to_user(buf, pm.buffer, len)) {
> ret = -EFAULT;
> - goto out_free;
> + goto out_mm;
> }
> copied += len;
> buf += len;
> @@ -841,10 +840,10 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
> if (!ret || ret == PM_END_OF_BUFFER)
> ret = copied;
>
> -out_free:
> - kfree(pm.buffer);
> out_mm:
> mmput(mm);
> +out_free:
> + kfree(pm.buffer);
> out_task:
> put_task_struct(task);
> out:
> --
> 1.7.3.1
> Great, thank you. Though I see Steve has cleverly spotted another:
> yes, that should be worth looking into, but I've not studied it.
>
> I confess that when I wrote yesterday, I thought there were a lot more
> such instances in fs/proc. I thought /proc/pid/smaps was vulnerable,
> for example, but I cannot see that now: maybe I just confused it with
> /proc/pid/pagemap.
>
> I was worrying about this a couple of months ago, when David had an
> OOM dump with OOM-killed threads failing to exit because they could
> not skb_alloc in proc_exit_connector()'s cn_netlink_send() - connector
> needs to allocate skb upfront (I'm being vague!), not when thread exits.
>
> But it was a mystery why memory was still unavailable at that late
> stage: my unsubstantiated suspicion was that something else was
> holding on to the mm-to-be-freed while trying to allocate memory.
Hi
Thank you for the explanation.
Today, I revisit get_task_mm() users. So, I found four suspect codes
for google usecase.
o CPUSET
cpuset_change_nodemask() and cpuset_attach() call do_migrate_pages()
under holding mm reference. That said, If you try to attach crazy large
application, you can face the "OOM-killed threads failing to exit" issue.
o /proc/{pid}/cmdline and /proc/{pid}/environ
proc_pid_cmdline() and environ_read() call get_user_pages() under holding
mm reference. That said, If you run system monitor process and it periodically
read the above interface, oom-kill can't gurantte to work.
So, I'll second that Linus said we should remove TIF_MEMDIE and PF_EXITING check
from select_bad_process(). Or at least, We have to make timeout and try to kill
next process if timeout is happen.
And, as Oleg said at past, maybe we have to make MMF_OOM_KILLED and page fault
should detect it.