2010-04-01 05:54:56

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk

> > So Matt, please actually address the _bug_ I pointed out rather than talk
> > about other things. And yes, getting rid of the vma accesses sounds like
> > it would fix it best. If that means that it doesn't work for hugepages, so
> > be it.
>
> That'd actually take us back to where it was when it hit mainline, which
> would make a lot of people unhappy. I wouldn't be one of them as there
> thankfully aren't any huge pages in my world. But I'm convinced
> put_user() must go. In which case, get_user_pages() stays, and I've got
> to switch things to direct physical page access into that array.

no. direct physical page access for /proc is completely wrong idea, i think.
please imazine the caller process is multi threaded and it use fork case,

example scenario)
1. the parent process has thread-A and thread-B.
2. thread-A call read_pagemap
3. read_pagemap grab the page-C
3. at the same time, thread-B call fork(), now page-C pointed from two process.
4. thread-B touch page-C, cow occur, then the parent process has cowed page (page-C')
and the child process has original page-C.
5. thread-A write page-C by physical page access, then the child page is
modified, instead parent one.

I just recommend simply do double buffering.

thanks.

> Even if I fix that, I believe San's original bug can still be triggered
> though, as all the new callers to find_vma are run outside of the
> target's mm_sem. Fixing that should be reasonably straight-forward.



2010-04-01 05:59:07

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk

On Thu, 1 Apr 2010 14:54:43 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> > > So Matt, please actually address the _bug_ I pointed out rather than talk
> > > about other things. And yes, getting rid of the vma accesses sounds like
> > > it would fix it best. If that means that it doesn't work for hugepages, so
> > > be it.
> >
> > That'd actually take us back to where it was when it hit mainline, which
> > would make a lot of people unhappy. I wouldn't be one of them as there
> > thankfully aren't any huge pages in my world. But I'm convinced
> > put_user() must go. In which case, get_user_pages() stays, and I've got
> > to switch things to direct physical page access into that array.
>
> no. direct physical page access for /proc is completely wrong idea, i think.
> please imazine the caller process is multi threaded and it use fork case,
>
> example scenario)
> 1. the parent process has thread-A and thread-B.
> 2. thread-A call read_pagemap
> 3. read_pagemap grab the page-C
> 3. at the same time, thread-B call fork(), now page-C pointed from two process.
> 4. thread-B touch page-C, cow occur, then the parent process has cowed page (page-C')
> and the child process has original page-C.
> 5. thread-A write page-C by physical page access, then the child page is
> modified, instead parent one.
>
> I just recommend simply do double buffering.
>
Like this ?
CC'ed Horiguchi, he touched hugepage part of this code recently.

==
From: KAMEZAWA Hiroyuki <[email protected]>

In initial design, walk_page_range() was designed just for walking page table and
it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
and we need mmap_sem around it.

This patch adds mmap_sem around walk_page_range().

Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
rid of it to do sane fix.

Reported-by: San Mehat <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 89 ++++++++++++++++++++++-------------------------------
1 file changed, 38 insertions(+), 51 deletions(-)

Index: linux-2.6.34-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c
+++ linux-2.6.34-rc3/fs/proc/task_mmu.c
@@ -406,8 +406,11 @@ static int show_smap(struct seq_file *m,

memset(&mss, 0, sizeof mss);
mss.vma = vma;
- if (vma->vm_mm && !is_vm_hugetlb_page(vma))
+ if (vma->vm_mm && !is_vm_hugetlb_page(vma)) {
+ down_read(&vma->vm_mm->mmap_sem);
walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
+ up_read(&vma->vm_mm->mmap_sem);
+ }

show_map_vma(m, vma);

@@ -552,7 +555,8 @@ const struct file_operations proc_clear_
};

struct pagemapread {
- u64 __user *out, *end;
+ int pos, len;
+ u64 *buffer;
};

#define PM_ENTRY_BYTES sizeof(u64)
@@ -575,10 +579,8 @@ struct pagemapread {
static int add_to_pagemap(unsigned long addr, u64 pfn,
struct pagemapread *pm)
{
- if (put_user(pfn, pm->out))
- return -EFAULT;
- pm->out++;
- if (pm->out >= pm->end)
+ pm->buffer[pm->pos++] = pfn;
+ if (pm->pos >= pm->len)
return PM_END_OF_BUFFER;
return 0;
}
@@ -720,21 +722,20 @@ static int pagemap_hugetlb_range(pte_t *
* determine which areas of memory are actually mapped and llseek to
* skip over unmapped regions.
*/
+#define PAGEMAP_WALK_SIZE (PMD_SIZE)
static ssize_t pagemap_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
- struct page **pages, *page;
- unsigned long uaddr, uend;
struct mm_struct *mm;
struct pagemapread pm;
- int pagecount;
int ret = -ESRCH;
struct mm_walk pagemap_walk = {};
unsigned long src;
unsigned long svpfn;
unsigned long start_vaddr;
unsigned long end_vaddr;
+ int copied = 0;

if (!task)
goto out;
@@ -757,34 +758,10 @@ static ssize_t pagemap_read(struct file
if (!mm)
goto out_task;

-
- uaddr = (unsigned long)buf & PAGE_MASK;
- uend = (unsigned long)(buf + count);
- pagecount = (PAGE_ALIGN(uend) - uaddr) / PAGE_SIZE;
- ret = 0;
- if (pagecount == 0)
+ pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
+ pm.buffer = kmalloc(pm.len, GFP_KERNEL);
+ if (!pm.buffer)
goto out_mm;
- pages = kcalloc(pagecount, sizeof(struct page *), GFP_KERNEL);
- ret = -ENOMEM;
- if (!pages)
- goto out_mm;
-
- down_read(&current->mm->mmap_sem);
- ret = get_user_pages(current, current->mm, uaddr, pagecount,
- 1, 0, pages, NULL);
- up_read(&current->mm->mmap_sem);
-
- if (ret < 0)
- goto out_free;
-
- if (ret != pagecount) {
- pagecount = ret;
- ret = -EFAULT;
- goto out_pages;
- }
-
- pm.out = (u64 __user *)buf;
- pm.end = (u64 __user *)(buf + count);

pagemap_walk.pmd_entry = pagemap_pte_range;
pagemap_walk.pte_hole = pagemap_pte_hole;
@@ -807,23 +784,33 @@ static ssize_t pagemap_read(struct file
* user buffer is tracked in "pm", and the walk
* will stop when we hit the end of the buffer.
*/
- ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk);
- if (ret == PM_END_OF_BUFFER)
- ret = 0;
- /* don't need mmap_sem for these, but this looks cleaner */
- *ppos += (char __user *)pm.out - buf;
- if (!ret)
- ret = (char __user *)pm.out - buf;
-
-out_pages:
- for (; pagecount; pagecount--) {
- page = pages[pagecount-1];
- if (!PageReserved(page))
- SetPageDirty(page);
- page_cache_release(page);
+ while (count && (start_vaddr < end_vaddr)) {
+ int len;
+ unsigned long end;
+ pm.pos = 0;
+ start_vaddr += PAGEMAP_WALK_SIZE;
+ end = start_vaddr + PAGEMAP_WALK_SIZE;
+ down_read(&mm->mmap_sem);
+ ret = walk_page_range(start_vaddr, end, &pagemap_walk);
+ up_read(&mm->mmap_sem);
+
+ len = PM_ENTRY_BYTES * pm.pos;
+ if (len > count)
+ len = count;
+ if (copy_to_user((char __user *)buf, pm.buffer, len) < 0) {
+ ret = -EFAULT;
+ goto out_free;
+ }
+ copied += len;
+ buf += len;
+ count -= len;
}
+ *ppos += copied;
+ if (!ret || ret == PM_END_OF_BUFFER)
+ ret = copied;
+
out_free:
- kfree(pages);
+ kfree(pm.buffer);
out_mm:
mmput(mm);
out_task:

2010-04-01 06:05:48

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk

> > I just recommend simply do double buffering.
> >
> Like this ?
> CC'ed Horiguchi, he touched hugepage part of this code recently.

I like this patch. but I have few comment.

>
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> In initial design, walk_page_range() was designed just for walking page table and
> it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
> and we need mmap_sem around it.
>
> This patch adds mmap_sem around walk_page_range().
>
> Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
> rid of it to do sane fix.
>
> Reported-by: San Mehat <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> fs/proc/task_mmu.c | 89 ++++++++++++++++++++++-------------------------------
> 1 file changed, 38 insertions(+), 51 deletions(-)
>
> Index: linux-2.6.34-rc3/fs/proc/task_mmu.c
> ===================================================================
> --- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c
> +++ linux-2.6.34-rc3/fs/proc/task_mmu.c
> @@ -406,8 +406,11 @@ static int show_smap(struct seq_file *m,
>
> memset(&mss, 0, sizeof mss);
> mss.vma = vma;
> - if (vma->vm_mm && !is_vm_hugetlb_page(vma))
> + if (vma->vm_mm && !is_vm_hugetlb_page(vma)) {
> + down_read(&vma->vm_mm->mmap_sem);
> walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
> + up_read(&vma->vm_mm->mmap_sem);
> + }

m_start already take mmap_sem. then I don't think this part is necessary.


>
> show_map_vma(m, vma);
>
> @@ -552,7 +555,8 @@ const struct file_operations proc_clear_
> };
>
> struct pagemapread {
> - u64 __user *out, *end;
> + int pos, len;
> + u64 *buffer;
> };
>
> #define PM_ENTRY_BYTES sizeof(u64)
> @@ -575,10 +579,8 @@ struct pagemapread {
> static int add_to_pagemap(unsigned long addr, u64 pfn,
> struct pagemapread *pm)
> {
> - if (put_user(pfn, pm->out))
> - return -EFAULT;
> - pm->out++;
> - if (pm->out >= pm->end)
> + pm->buffer[pm->pos++] = pfn;
> + if (pm->pos >= pm->len)
> return PM_END_OF_BUFFER;
> return 0;
> }
> @@ -720,21 +722,20 @@ static int pagemap_hugetlb_range(pte_t *
> * determine which areas of memory are actually mapped and llseek to
> * skip over unmapped regions.
> */
> +#define PAGEMAP_WALK_SIZE (PMD_SIZE)
> static ssize_t pagemap_read(struct file *file, char __user *buf,
> size_t count, loff_t *ppos)
> {
> struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
> - struct page **pages, *page;
> - unsigned long uaddr, uend;
> struct mm_struct *mm;
> struct pagemapread pm;
> - int pagecount;
> int ret = -ESRCH;
> struct mm_walk pagemap_walk = {};
> unsigned long src;
> unsigned long svpfn;
> unsigned long start_vaddr;
> unsigned long end_vaddr;
> + int copied = 0;
>
> if (!task)
> goto out;
> @@ -757,34 +758,10 @@ static ssize_t pagemap_read(struct file
> if (!mm)
> goto out_task;
>
> -
> - uaddr = (unsigned long)buf & PAGE_MASK;
> - uend = (unsigned long)(buf + count);
> - pagecount = (PAGE_ALIGN(uend) - uaddr) / PAGE_SIZE;
> - ret = 0;
> - if (pagecount == 0)
> + pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> + pm.buffer = kmalloc(pm.len, GFP_KERNEL);

In /proc interface, GFP_TEMPORARY is recommened rather than GFP_KERNEL.
it help to prevent unnecessary external fragmentation.


> + if (!pm.buffer)
> goto out_mm;
> - pages = kcalloc(pagecount, sizeof(struct page *), GFP_KERNEL);
> - ret = -ENOMEM;
> - if (!pages)
> - goto out_mm;
> -
> - down_read(&current->mm->mmap_sem);
> - ret = get_user_pages(current, current->mm, uaddr, pagecount,
> - 1, 0, pages, NULL);
> - up_read(&current->mm->mmap_sem);
> -
> - if (ret < 0)
> - goto out_free;
> -
> - if (ret != pagecount) {
> - pagecount = ret;
> - ret = -EFAULT;
> - goto out_pages;
> - }
> -
> - pm.out = (u64 __user *)buf;
> - pm.end = (u64 __user *)(buf + count);
>
> pagemap_walk.pmd_entry = pagemap_pte_range;
> pagemap_walk.pte_hole = pagemap_pte_hole;
> @@ -807,23 +784,33 @@ static ssize_t pagemap_read(struct file
> * user buffer is tracked in "pm", and the walk
> * will stop when we hit the end of the buffer.
> */
> - ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk);
> - if (ret == PM_END_OF_BUFFER)
> - ret = 0;
> - /* don't need mmap_sem for these, but this looks cleaner */
> - *ppos += (char __user *)pm.out - buf;
> - if (!ret)
> - ret = (char __user *)pm.out - buf;
> -
> -out_pages:
> - for (; pagecount; pagecount--) {
> - page = pages[pagecount-1];
> - if (!PageReserved(page))
> - SetPageDirty(page);
> - page_cache_release(page);
> + while (count && (start_vaddr < end_vaddr)) {
> + int len;
> + unsigned long end;
> + pm.pos = 0;
> + start_vaddr += PAGEMAP_WALK_SIZE;
> + end = start_vaddr + PAGEMAP_WALK_SIZE;
> + down_read(&mm->mmap_sem);
> + ret = walk_page_range(start_vaddr, end, &pagemap_walk);
> + up_read(&mm->mmap_sem);
> +
> + len = PM_ENTRY_BYTES * pm.pos;
> + if (len > count)
> + len = count;
> + if (copy_to_user((char __user *)buf, pm.buffer, len) < 0) {
> + ret = -EFAULT;
> + goto out_free;
> + }
> + copied += len;
> + buf += len;
> + count -= len;
> }
> + *ppos += copied;
> + if (!ret || ret == PM_END_OF_BUFFER)
> + ret = copied;
> +
> out_free:
> - kfree(pages);
> + kfree(pm.buffer);
> out_mm:
> mmput(mm);
> out_task:
>


2010-04-01 06:13:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk

On Thu, 1 Apr 2010 15:05:38 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:
memset(&mss, 0, sizeof mss);
> > mss.vma = vma;
> > - if (vma->vm_mm && !is_vm_hugetlb_page(vma))
> > + if (vma->vm_mm && !is_vm_hugetlb_page(vma)) {
> > + down_read(&vma->vm_mm->mmap_sem);
> > walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
> > + up_read(&vma->vm_mm->mmap_sem);
> > + }
>
> m_start already take mmap_sem. then I don't think this part is necessary.
>
Right.
<snip>


> > -
> > - uaddr = (unsigned long)buf & PAGE_MASK;
> > - uend = (unsigned long)(buf + count);
> > - pagecount = (PAGE_ALIGN(uend) - uaddr) / PAGE_SIZE;
> > - ret = 0;
> > - if (pagecount == 0)
> > + pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> > + pm.buffer = kmalloc(pm.len, GFP_KERNEL);
>
> In /proc interface, GFP_TEMPORARY is recommened rather than GFP_KERNEL.
> it help to prevent unnecessary external fragmentation.
>
Ok.

From: KAMEZAWA Hiroyuki <[email protected]>

In initial design, walk_page_range() was designed just for walking page table and
it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
and we need mmap_sem around it.

This patch adds mmap_sem around walk_page_range().

Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
rid of it to do sane fix.

Changelog:
- removed unnecessary change in smaps.
- use GFP_TEMPORARY instead of GFP_KERNEL

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 86 ++++++++++++++++++++++-------------------------------
1 file changed, 36 insertions(+), 50 deletions(-)

Index: linux-2.6.34-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c
+++ linux-2.6.34-rc3/fs/proc/task_mmu.c
@@ -406,6 +406,7 @@ static int show_smap(struct seq_file *m,

memset(&mss, 0, sizeof mss);
mss.vma = vma;
+ /* mmap_sem is held in m_start */
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);

@@ -552,7 +553,8 @@ const struct file_operations proc_clear_
};

struct pagemapread {
- u64 __user *out, *end;
+ int pos, len;
+ u64 *buffer;
};

#define PM_ENTRY_BYTES sizeof(u64)
@@ -575,10 +577,8 @@ struct pagemapread {
static int add_to_pagemap(unsigned long addr, u64 pfn,
struct pagemapread *pm)
{
- if (put_user(pfn, pm->out))
- return -EFAULT;
- pm->out++;
- if (pm->out >= pm->end)
+ pm->buffer[pm->pos++] = pfn;
+ if (pm->pos >= pm->len)
return PM_END_OF_BUFFER;
return 0;
}
@@ -720,21 +720,20 @@ static int pagemap_hugetlb_range(pte_t *
* determine which areas of memory are actually mapped and llseek to
* skip over unmapped regions.
*/
+#define PAGEMAP_WALK_SIZE (PMD_SIZE)
static ssize_t pagemap_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
- struct page **pages, *page;
- unsigned long uaddr, uend;
struct mm_struct *mm;
struct pagemapread pm;
- int pagecount;
int ret = -ESRCH;
struct mm_walk pagemap_walk = {};
unsigned long src;
unsigned long svpfn;
unsigned long start_vaddr;
unsigned long end_vaddr;
+ int copied = 0;

if (!task)
goto out;
@@ -757,34 +756,10 @@ static ssize_t pagemap_read(struct file
if (!mm)
goto out_task;

-
- uaddr = (unsigned long)buf & PAGE_MASK;
- uend = (unsigned long)(buf + count);
- pagecount = (PAGE_ALIGN(uend) - uaddr) / PAGE_SIZE;
- ret = 0;
- if (pagecount == 0)
+ pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
+ pm.buffer = kmalloc(pm.len, GFP_TEMPORARY);
+ if (!pm.buffer)
goto out_mm;
- pages = kcalloc(pagecount, sizeof(struct page *), GFP_KERNEL);
- ret = -ENOMEM;
- if (!pages)
- goto out_mm;
-
- down_read(&current->mm->mmap_sem);
- ret = get_user_pages(current, current->mm, uaddr, pagecount,
- 1, 0, pages, NULL);
- up_read(&current->mm->mmap_sem);
-
- if (ret < 0)
- goto out_free;
-
- if (ret != pagecount) {
- pagecount = ret;
- ret = -EFAULT;
- goto out_pages;
- }
-
- pm.out = (u64 __user *)buf;
- pm.end = (u64 __user *)(buf + count);

pagemap_walk.pmd_entry = pagemap_pte_range;
pagemap_walk.pte_hole = pagemap_pte_hole;
@@ -807,23 +782,34 @@ static ssize_t pagemap_read(struct file
* user buffer is tracked in "pm", and the walk
* will stop when we hit the end of the buffer.
*/
- ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk);
- if (ret == PM_END_OF_BUFFER)
- ret = 0;
- /* don't need mmap_sem for these, but this looks cleaner */
- *ppos += (char __user *)pm.out - buf;
- if (!ret)
- ret = (char __user *)pm.out - buf;
-
-out_pages:
- for (; pagecount; pagecount--) {
- page = pages[pagecount-1];
- if (!PageReserved(page))
- SetPageDirty(page);
- page_cache_release(page);
+ while (count && (start_vaddr < end_vaddr)) {
+ int len;
+ unsigned long end;
+
+ pm.pos = 0;
+ start_vaddr += PAGEMAP_WALK_SIZE;
+ end = start_vaddr + PAGEMAP_WALK_SIZE;
+ down_read(&mm->mmap_sem);
+ ret = walk_page_range(start_vaddr, end, &pagemap_walk);
+ up_read(&mm->mmap_sem);
+
+ len = PM_ENTRY_BYTES * pm.pos;
+ if (len > count)
+ len = count;
+ if (copy_to_user((char __user *)buf, pm.buffer, len) < 0) {
+ ret = -EFAULT;
+ goto out_free;
+ }
+ copied += len;
+ buf += len;
+ count -= len;
}
+ *ppos += copied;
+ if (!ret || ret == PM_END_OF_BUFFER)
+ ret = copied;
+
out_free:
- kfree(pages);
+ kfree(pm.buffer);
out_mm:
mmput(mm);
out_task:

2010-04-01 06:38:24

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk

On Thu, 1 Apr 2010 15:09:56 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

>
> + pm.pos = 0;
> + start_vaddr += PAGEMAP_WALK_SIZE;
> + end = start_vaddr + PAGEMAP_WALK_SIZE;

Sigh...this is bad..

==
From: KAMEZAWA Hiroyuki <[email protected]>

In initial design, walk_page_range() was designed just for walking page table and
it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
and we need mmap_sem around it.

This patch adds mmap_sem around walk_page_range().

Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
rid of it to do sane fix.

Changelog:
- fixed start_vaddr calculation
- removed unnecessary cast.
- removed unnecessary change in smaps.
- use GFP_TEMPORARY instead of GFP_KERNEL
- use min().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 84 +++++++++++++++++++++--------------------------------
1 file changed, 34 insertions(+), 50 deletions(-)

Index: linux-2.6.34-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c
+++ linux-2.6.34-rc3/fs/proc/task_mmu.c
@@ -406,6 +406,7 @@ static int show_smap(struct seq_file *m,

memset(&mss, 0, sizeof mss);
mss.vma = vma;
+ /* mmap_sem is held in m_start */
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);

@@ -552,7 +553,8 @@ const struct file_operations proc_clear_
};

struct pagemapread {
- u64 __user *out, *end;
+ int pos, len;
+ u64 *buffer;
};

#define PM_ENTRY_BYTES sizeof(u64)
@@ -575,10 +577,8 @@ struct pagemapread {
static int add_to_pagemap(unsigned long addr, u64 pfn,
struct pagemapread *pm)
{
- if (put_user(pfn, pm->out))
- return -EFAULT;
- pm->out++;
- if (pm->out >= pm->end)
+ pm->buffer[pm->pos++] = pfn;
+ if (pm->pos >= pm->len)
return PM_END_OF_BUFFER;
return 0;
}
@@ -720,21 +720,20 @@ static int pagemap_hugetlb_range(pte_t *
* determine which areas of memory are actually mapped and llseek to
* skip over unmapped regions.
*/
+#define PAGEMAP_WALK_SIZE (PMD_SIZE)
static ssize_t pagemap_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
- struct page **pages, *page;
- unsigned long uaddr, uend;
struct mm_struct *mm;
struct pagemapread pm;
- int pagecount;
int ret = -ESRCH;
struct mm_walk pagemap_walk = {};
unsigned long src;
unsigned long svpfn;
unsigned long start_vaddr;
unsigned long end_vaddr;
+ int copied = 0;

if (!task)
goto out;
@@ -757,35 +756,11 @@ static ssize_t pagemap_read(struct file
if (!mm)
goto out_task;

-
- uaddr = (unsigned long)buf & PAGE_MASK;
- uend = (unsigned long)(buf + count);
- pagecount = (PAGE_ALIGN(uend) - uaddr) / PAGE_SIZE;
- ret = 0;
- if (pagecount == 0)
- goto out_mm;
- pages = kcalloc(pagecount, sizeof(struct page *), GFP_KERNEL);
- ret = -ENOMEM;
- if (!pages)
+ pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
+ pm.buffer = kmalloc(pm.len, GFP_TEMPORARY);
+ if (!pm.buffer)
goto out_mm;

- down_read(&current->mm->mmap_sem);
- ret = get_user_pages(current, current->mm, uaddr, pagecount,
- 1, 0, pages, NULL);
- up_read(&current->mm->mmap_sem);
-
- if (ret < 0)
- goto out_free;
-
- if (ret != pagecount) {
- pagecount = ret;
- ret = -EFAULT;
- goto out_pages;
- }
-
- pm.out = (u64 __user *)buf;
- pm.end = (u64 __user *)(buf + count);
-
pagemap_walk.pmd_entry = pagemap_pte_range;
pagemap_walk.pte_hole = pagemap_pte_hole;
pagemap_walk.hugetlb_entry = pagemap_hugetlb_range;
@@ -807,23 +782,32 @@ static ssize_t pagemap_read(struct file
* user buffer is tracked in "pm", and the walk
* will stop when we hit the end of the buffer.
*/
- ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk);
- if (ret == PM_END_OF_BUFFER)
- ret = 0;
- /* don't need mmap_sem for these, but this looks cleaner */
- *ppos += (char __user *)pm.out - buf;
- if (!ret)
- ret = (char __user *)pm.out - buf;
-
-out_pages:
- for (; pagecount; pagecount--) {
- page = pages[pagecount-1];
- if (!PageReserved(page))
- SetPageDirty(page);
- page_cache_release(page);
+ while (count && (start_vaddr < end_vaddr)) {
+ int len;
+ unsigned long end;
+
+ pm.pos = 0;
+ end = min(start_vaddr + PAGEMAP_WALK_SIZE, end_vaddr);
+ down_read(&mm->mmap_sem);
+ ret = walk_page_range(start_vaddr, end, &pagemap_walk);
+ up_read(&mm->mmap_sem);
+ start_vaddr += PAGEMAP_WALK_SIZE;
+
+ len = min(count, PM_ENTRY_BYTES * pm.pos);
+ if (copy_to_user(buf, pm.buffer, len) < 0) {
+ ret = -EFAULT;
+ goto out_free;
+ }
+ copied += len;
+ buf += len;
+ count -= len;
}
+ *ppos += copied;
+ if (!ret || ret == PM_END_OF_BUFFER)
+ ret = copied;
+
out_free:
- kfree(pages);
+ kfree(pm.buffer);
out_mm:
mmput(mm);
out_task:

2010-04-01 07:09:41

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk

On Thu, 2010-04-01 at 15:34 +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 1 Apr 2010 15:09:56 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> >
> > + pm.pos = 0;
> > + start_vaddr += PAGEMAP_WALK_SIZE;
> > + end = start_vaddr + PAGEMAP_WALK_SIZE;
>
> Sigh...this is bad..
>
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> In initial design, walk_page_range() was designed just for walking page table and
> it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
> and we need mmap_sem around it.

This looks pretty reasonable. However, it also looks very similar to my
first version of pagemap (which started with double-buffering). It's
going to need re-testing to make sure it hasn't reintroduced any
wrapping, alignment, or off-by-one bugs that have already been ironed
out once or twice.

--
http://selenic.com : development and support for Mercurial and Linux

2010-04-01 07:21:33

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk

> On Thu, 2010-04-01 at 15:34 +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 1 Apr 2010 15:09:56 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > >
> > > + pm.pos = 0;
> > > + start_vaddr += PAGEMAP_WALK_SIZE;
> > > + end = start_vaddr + PAGEMAP_WALK_SIZE;
> >
> > Sigh...this is bad..
> >
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > In initial design, walk_page_range() was designed just for walking page table and
> > it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
> > and we need mmap_sem around it.
>
> This looks pretty reasonable. However, it also looks very similar to my
> first version of pagemap (which started with double-buffering). It's
> going to need re-testing to make sure it hasn't reintroduced any
> wrapping, alignment, or off-by-one bugs that have already been ironed
> out once or twice.

OK, I'm looking for your test result. :)

Thanks matt for your contribution!

2010-04-01 15:15:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk



On Thu, 1 Apr 2010, KAMEZAWA Hiroyuki wrote:
>
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> In initial design, walk_page_range() was designed just for walking page table and
> it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
> and we need mmap_sem around it.
>
> This patch adds mmap_sem around walk_page_range().
>
> Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
> rid of it to do sane fix.
>
> Changelog:
> - fixed start_vaddr calculation
> - removed unnecessary cast.
> - removed unnecessary change in smaps.
> - use GFP_TEMPORARY instead of GFP_KERNEL
> - use min().

Looks mostly correct to me (but just looking at the source, no testing,
obviously). And I like how the double buffering removes more lines of code
than it adds.

However, I think there is a subtle problem with this:

> + while (count && (start_vaddr < end_vaddr)) {
> + int len;
> + unsigned long end;
> +
> + pm.pos = 0;
> + end = min(start_vaddr + PAGEMAP_WALK_SIZE, end_vaddr);
> + down_read(&mm->mmap_sem);
> + ret = walk_page_range(start_vaddr, end, &pagemap_walk);
> + up_read(&mm->mmap_sem);
> + start_vaddr += PAGEMAP_WALK_SIZE;

I think "start_vaddr + PAGEMAP_WALK_SIZE" might overflow, and then 'end'
ends up being odd. You'll never notice on architectures where the user
space doesn't go all the way up to the end (walk_page_range will return 0
etc), but it will do the wrong thing if 'start' is close to the end, end
is _at_ the end, and you'll not be able to read that range (because of the
overflow).

So I do think you should do something like

end = start_vaddr + PAGEMAP_WALK_SIZE;
/* overflow? or final chunk? */
if (end < start_vaddr || end > end_vaddr)
end = end_vaddr;

instead of using 'min()'.

(This only matters if TASK_SIZE_OF() can be ~0ul, but I think that can
happen on sparc, for example)

Linus

2010-04-02 00:15:32

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk

On Thu, 1 Apr 2010 08:10:40 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:
> > + while (count && (start_vaddr < end_vaddr)) {
> > + int len;
> > + unsigned long end;
> > +
> > + pm.pos = 0;
> > + end = min(start_vaddr + PAGEMAP_WALK_SIZE, end_vaddr);
> > + down_read(&mm->mmap_sem);
> > + ret = walk_page_range(start_vaddr, end, &pagemap_walk);
> > + up_read(&mm->mmap_sem);
> > + start_vaddr += PAGEMAP_WALK_SIZE;
>
> I think "start_vaddr + PAGEMAP_WALK_SIZE" might overflow, and then 'end'
> ends up being odd. You'll never notice on architectures where the user
> space doesn't go all the way up to the end (walk_page_range will return 0
> etc), but it will do the wrong thing if 'start' is close to the end, end
> is _at_ the end, and you'll not be able to read that range (because of the
> overflow).
>
I didn't noticed that. thanks.

> So I do think you should do something like
>
> end = start_vaddr + PAGEMAP_WALK_SIZE;
> /* overflow? or final chunk? */
> if (end < start_vaddr || end > end_vaddr)
> end = end_vaddr;
>
> instead of using 'min()'.
>
Ok, here. now
end = start_vaddr _ PAGEMAP_WALK_SIZE;
if (end < start_vaddr || end > end_vaddr)
end = end_vaddr;
....walk....
start_vaddr =end;

Only tested on x86-64.
Thanks,
-Kame

==
From: KAMEZAWA Hiroyuki <[email protected]>

In initial design, walk_page_range() was designed just for walking page table and
it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
and we need mmap_sem around it.

This patch adds mmap_sem around walk_page_range().

Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
rid of it to do sane fix.

Changelog: 2010/Apr/2
- fixed start_vaddr and end overflow
Changelog: 2010/Apr/1
- fixed start_vaddr calculation
- removed unnecessary cast.
- removed unnecessary change in smaps.
- use GFP_TEMPORARY instead of GFP_KERNEL

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 87 ++++++++++++++++++++++-------------------------------
1 file changed, 37 insertions(+), 50 deletions(-)

Index: linux-2.6.34-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c
+++ linux-2.6.34-rc3/fs/proc/task_mmu.c
@@ -406,6 +406,7 @@ static int show_smap(struct seq_file *m,

memset(&mss, 0, sizeof mss);
mss.vma = vma;
+ /* mmap_sem is held in m_start */
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);

@@ -552,7 +553,8 @@ const struct file_operations proc_clear_
};

struct pagemapread {
- u64 __user *out, *end;
+ int pos, len;
+ u64 *buffer;
};

#define PM_ENTRY_BYTES sizeof(u64)
@@ -575,10 +577,8 @@ struct pagemapread {
static int add_to_pagemap(unsigned long addr, u64 pfn,
struct pagemapread *pm)
{
- if (put_user(pfn, pm->out))
- return -EFAULT;
- pm->out++;
- if (pm->out >= pm->end)
+ pm->buffer[pm->pos++] = pfn;
+ if (pm->pos >= pm->len)
return PM_END_OF_BUFFER;
return 0;
}
@@ -720,21 +720,20 @@ static int pagemap_hugetlb_range(pte_t *
* determine which areas of memory are actually mapped and llseek to
* skip over unmapped regions.
*/
+#define PAGEMAP_WALK_SIZE (PMD_SIZE)
static ssize_t pagemap_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
- struct page **pages, *page;
- unsigned long uaddr, uend;
struct mm_struct *mm;
struct pagemapread pm;
- int pagecount;
int ret = -ESRCH;
struct mm_walk pagemap_walk = {};
unsigned long src;
unsigned long svpfn;
unsigned long start_vaddr;
unsigned long end_vaddr;
+ int copied = 0;

if (!task)
goto out;
@@ -757,34 +756,10 @@ static ssize_t pagemap_read(struct file
if (!mm)
goto out_task;

-
- uaddr = (unsigned long)buf & PAGE_MASK;
- uend = (unsigned long)(buf + count);
- pagecount = (PAGE_ALIGN(uend) - uaddr) / PAGE_SIZE;
- ret = 0;
- if (pagecount == 0)
+ pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
+ pm.buffer = kmalloc(pm.len, GFP_TEMPORARY);
+ if (!pm.buffer)
goto out_mm;
- pages = kcalloc(pagecount, sizeof(struct page *), GFP_KERNEL);
- ret = -ENOMEM;
- if (!pages)
- goto out_mm;
-
- down_read(&current->mm->mmap_sem);
- ret = get_user_pages(current, current->mm, uaddr, pagecount,
- 1, 0, pages, NULL);
- up_read(&current->mm->mmap_sem);
-
- if (ret < 0)
- goto out_free;
-
- if (ret != pagecount) {
- pagecount = ret;
- ret = -EFAULT;
- goto out_pages;
- }
-
- pm.out = (u64 __user *)buf;
- pm.end = (u64 __user *)(buf + count);

pagemap_walk.pmd_entry = pagemap_pte_range;
pagemap_walk.pte_hole = pagemap_pte_hole;
@@ -807,23 +782,35 @@ static ssize_t pagemap_read(struct file
* user buffer is tracked in "pm", and the walk
* will stop when we hit the end of the buffer.
*/
- ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk);
- if (ret == PM_END_OF_BUFFER)
- ret = 0;
- /* don't need mmap_sem for these, but this looks cleaner */
- *ppos += (char __user *)pm.out - buf;
- if (!ret)
- ret = (char __user *)pm.out - buf;
-
-out_pages:
- for (; pagecount; pagecount--) {
- page = pages[pagecount-1];
- if (!PageReserved(page))
- SetPageDirty(page);
- page_cache_release(page);
+ while (count && (start_vaddr < end_vaddr)) {
+ int len;
+ unsigned long end;
+
+ pm.pos = 0;
+ end = start_vaddr + PAGEMAP_WALK_SIZE;
+ /* overflow ? */
+ if (end < start_vaddr || end > end_vaddr)
+ end = end_vaddr;
+ down_read(&mm->mmap_sem);
+ ret = walk_page_range(start_vaddr, end, &pagemap_walk);
+ up_read(&mm->mmap_sem);
+ start_vaddr = end;
+
+ len = min(count, PM_ENTRY_BYTES * pm.pos);
+ if (copy_to_user(buf, pm.buffer, len) < 0) {
+ ret = -EFAULT;
+ goto out_free;
+ }
+ copied += len;
+ buf += len;
+ count -= len;
}
+ *ppos += copied;
+ if (!ret || ret == PM_END_OF_BUFFER)
+ ret = copied;
+
out_free:
- kfree(pages);
+ kfree(pm.buffer);
out_mm:
mmput(mm);
out_task:



2010-04-02 14:31:14

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk

On Fri, Apr 02, 2010 at 09:11:29AM +0900, KAMEZAWA Hiroyuki wrote:

> int ret = -ESRCH;
...
> + pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> + pm.buffer = kmalloc(pm.len, GFP_TEMPORARY);
> + if (!pm.buffer)
> goto out_mm;
...
> out_mm:
> mmput(mm);

Looks like this gets the wrong return code?

--
Mathematics is the supreme nostalgia of our time.

2010-04-06 06:52:39

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk

On Fri, 2 Apr 2010 09:30:58 -0500
Matt Mackall <[email protected]> wrote:

> On Fri, Apr 02, 2010 at 09:11:29AM +0900, KAMEZAWA Hiroyuki wrote:
>
> > int ret = -ESRCH;
> ...
> > + pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> > + pm.buffer = kmalloc(pm.len, GFP_TEMPORARY);
> > + if (!pm.buffer)
> > goto out_mm;
> ...
> > out_mm:
> > mmput(mm);
>
> Looks like this gets the wrong return code?
>
I'm sorry. And thank you for pointing out.
I confirmed merged one has fixed code.

Regards,
-Kame