2004-11-17 02:12:09

by David Miller

[permalink] [raw]
Subject: loops in get_user_pages() for VM_IO


Some time recently, I don't know when, the logic in get_user_pages()
appears to have been changed a bit.

The inner-most loop of this routine basically does:

while (follow_page() returns NULL)
try handle_mm_fault();

The problem with this is that for !pfn_valid()
(for example, VM_IO areas mapping to I/O physical
address which have no assosciated page struct) follow_page()
will _always_ return NULL.

So when X tries to mmap() the frame buffer on my
system it loops forever here now.

Is pfn_valid() supposed to return true for I/O areas
too? If so, where does the struct page backing store
come from?

It could be argued that setting VM_LOCKED is invalid.
And not setting it in drivers/video/sbuslib.c would make
this hang go away, but the above analysis means that
make_pages_present() cannot work on VM_IO areas.


2004-11-17 02:25:41

by David Miller

[permalink] [raw]
Subject: Re: loops in get_user_pages() for VM_IO

On Tue, 16 Nov 2004 17:53:28 -0800
"David S. Miller" <[email protected]> wrote:

> Some time recently, I don't know when, the logic in get_user_pages()
> appears to have been changed a bit.

Replying to myself, this hang started when do_mmap_pgoff() was
changed to re-read the vma->vm_flags bits right before we
vma_link() the vma into the mm.

Previously, only the mmap() call specified vm_flags would
be used, but now we also get whatever bit changes to
vma->vm_flags are done by the file->f_op->mmap() method
as well.

In any event, it is still an open question whether get_user_pages()
and thus make_pages_present() is meant to be able to handle
VM_IO areas.

2004-11-17 06:34:21

by Andrew Morton

[permalink] [raw]
Subject: Re: loops in get_user_pages() for VM_IO

"David S. Miller" <[email protected]> wrote:
>
> In any event, it is still an open question whether get_user_pages()
> and thus make_pages_present() is meant to be able to handle
> VM_IO areas.

It doesn't make a lot of sense. Andrea says that the only caller of
get_user_pages() which uses a null `pages' arg is mlock(), and mlock of a
VM_IO region is currently causing hangs, and proposes this change:


--- sles/mm/memory.c.~1~ 2004-11-12 12:30:25.000000000 +0100
+++ sles/mm/memory.c 2004-11-16 17:58:02.752131952 +0100
@@ -753,7 +753,7 @@ int get_user_pages(struct task_struct *t
continue;
}

- if (!vma || (pages && (vma->vm_flags & VM_IO))
+ if (!vma || (vma->vm_flags & VM_IO)
|| !(flags & vma->vm_flags))
return i ? : -EFAULT;

which should fix up the sbuslib.c problem.

Although I suspect this change will make mlockall() return -EFAULT or a
short result to userspace if the caller had a VM_IO region mapped, which
doesn't seem appropriate. So perhaps we should silently bale out in the
VM_IO case.

Or, better, simply advance over the VM_IO vma and onto the next one?


--- 25/mm/memory.c~get_user_pages-skip-VM_IO 2004-11-16 22:24:34.470017896 -0800
+++ 25-akpm/mm/memory.c 2004-11-16 22:32:04.890543568 -0800
@@ -761,9 +761,27 @@ int get_user_pages(struct task_struct *t
continue;
}

- if (!vma || (pages && (vma->vm_flags & VM_IO))
- || !(flags & vma->vm_flags))
- return i ? : -EFAULT;
+ if (!vma || !(flags & vma->vm_flags))
+ return i ? i : -EFAULT;
+
+ if (vma->vm_flags & VM_IO) {
+ if (pages) {
+ /*
+ * No, you cannot gather pageframes from VM_IO
+ * regions
+ */
+ return i ? i : -EFAULT;
+ }
+ /*
+ * OK, someone is simply trying to fault in some pages
+ * and they encountered a VM_IO region. mlockall()
+ * can do this. Simply skip the vma
+ */
+ start = vma->vm_end;
+ len -= (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ i += (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ continue;
+ }

if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
_

(I've probably screwed something up there.)

2004-11-17 06:41:52

by David Miller

[permalink] [raw]
Subject: Re: loops in get_user_pages() for VM_IO

On Tue, 16 Nov 2004 22:33:38 -0800
Andrew Morton <[email protected]> wrote:

> --- sles/mm/memory.c.~1~ 2004-11-12 12:30:25.000000000 +0100
> +++ sles/mm/memory.c 2004-11-16 17:58:02.752131952 +0100
> @@ -753,7 +753,7 @@ int get_user_pages(struct task_struct *t
> continue;
> }
>
> - if (!vma || (pages && (vma->vm_flags & VM_IO))
> + if (!vma || (vma->vm_flags & VM_IO)
> || !(flags & vma->vm_flags))
> return i ? : -EFAULT;
>
> which should fix up the sbuslib.c problem.

It would only becuase do_mmap_pgoff() doesn't check
the return value of make_pages_present().

> Or, better, simply advance over the VM_IO vma and onto the next one?

That would work too.

Although currently in my sparc64 tree I'm setting VM_RESERVED instead
of VM_LOCKED for I/O mappings and that solves the issue as well. It
would not, of course, solve the mlock() case you mentioned.

I think we need to set these bits consistently across the tree.
And remap_pfn_range() is a good model, so that's what I've used.
Parts of the x86 tree trigger this case too btw, for example
the pci mmap support code. In fact, all the pci mmap support
routines set VM_LOCKED, probably because they were copied over
from the first two implementations (sparc64 and i386) :-)

2004-11-17 13:44:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: loops in get_user_pages() for VM_IO

On Tue, Nov 16, 2004 at 10:33:38PM -0800, Andrew Morton wrote:
> short result to userspace if the caller had a VM_IO region mapped, which
> doesn't seem appropriate. So perhaps we should silently bale out in the

why don't we simply make make_pages_present void, nobody but mlock.c is
checking its retval anyways despite many more are calling it? Somebody
else may still want to be notified by get_user_pages about the -EFAULT
on non-ram. It's really make_pages_protected and not get_user_pages that
doesn't want to ever return failure (in most cases nobody checks the
reval of make_pages_protected in the first place). So if we've to hide
the error, I'd prefer to hide it in the higher layer. I even noticed the
kernel tree I'm working with has another hack in mlock.c just to hide
_more_ errors returned by make_pages_present, see what a fraction of my
patch looks like now:

@@ -1835,11 +1835,8 @@ int make_pages_present(unsigned long add
if (end > vma->vm_end)
BUG();
len = (end+PAGE_SIZE-1)/PAGE_SIZE-addr/PAGE_SIZE;
- ret = get_user_pages(current, current->mm, addr,
- len, write, 0, NULL, NULL);
- if (ret < 0)
- return ret;
- return ret == len ? 0 : -1;
+ get_user_pages(current, current->mm, addr,
+ len, write, 0, NULL, NULL);
}

/*

this would be the full port against mainline:

Signed-off-by: Andrea Arcangeli <[email protected]>

Index: linux-2.5/include/linux/mm.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/linux/mm.h,v
retrieving revision 1.195
diff -u -p -r1.195 mm.h
--- linux-2.5/include/linux/mm.h 14 Nov 2004 04:35:49 -0000 1.195
+++ linux-2.5/include/linux/mm.h 17 Nov 2004 13:42:02 -0000
@@ -587,7 +587,7 @@ extern pte_t *FASTCALL(pte_alloc_map(str
extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot);
extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot);
extern int handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access);
-extern int make_pages_present(unsigned long addr, unsigned long end);
+extern void make_pages_present(unsigned long addr, unsigned long end);
extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
void install_arg_page(struct vm_area_struct *, struct page *, unsigned long);

Index: linux-2.5/mm/memory.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/memory.c,v
retrieving revision 1.191
diff -u -p -r1.191 memory.c
--- linux-2.5/mm/memory.c 16 Nov 2004 03:53:40 -0000 1.191
+++ linux-2.5/mm/memory.c 17 Nov 2004 13:42:02 -0000
@@ -761,7 +761,7 @@ int get_user_pages(struct task_struct *t
continue;
}

- if (!vma || (pages && (vma->vm_flags & VM_IO))
+ if (!vma || (vma->vm_flags & VM_IO)
|| !(flags & vma->vm_flags))
return i ? : -EFAULT;

@@ -1754,7 +1754,7 @@ out:
return pmd_offset(pgd, address);
}

-int make_pages_present(unsigned long addr, unsigned long end)
+void make_pages_present(unsigned long addr, unsigned long end)
{
int ret, len, write;
struct vm_area_struct * vma;
@@ -1768,11 +1768,8 @@ int make_pages_present(unsigned long add
if (end > vma->vm_end)
BUG();
len = (end+PAGE_SIZE-1)/PAGE_SIZE-addr/PAGE_SIZE;
- ret = get_user_pages(current, current->mm, addr,
- len, write, 0, NULL, NULL);
- if (ret < 0)
- return ret;
- return ret == len ? 0 : -1;
+ get_user_pages(current, current->mm, addr,
+ len, write, 0, NULL, NULL);
}

/*
Index: linux-2.5/mm/mlock.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/mlock.c,v
retrieving revision 1.16
diff -u -p -r1.16 mlock.c
--- linux-2.5/mm/mlock.c 19 Oct 2004 05:54:02 -0000 1.16
+++ linux-2.5/mm/mlock.c 17 Nov 2004 13:42:26 -0000
@@ -47,7 +47,7 @@ static int mlock_fixup(struct vm_area_st
pages = (end - start) >> PAGE_SHIFT;
if (newflags & VM_LOCKED) {
pages = -pages;
- ret = make_pages_present(start, end);
+ make_pages_present(start, end);
}

vma->vm_mm->locked_vm -= pages;

2004-11-17 16:45:30

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: loops in get_user_pages() for VM_IO

On Wed, Nov 17, 2004 at 02:44:08PM +0100, Andrea Arcangeli wrote:
> this would be the full port against mainline:

while it's not necessary with mlockall (since we walk each vma in
mlock.c before calling make_pages_present), I've to agree having a more
generic implementation of get_user_pages that doesn't stop on VM_IO with
pages = NULL, sounds a nicer API (even if nobody uses that
functionality). Still I'd leave make_pages_present void. Andrew's patch
had two bugs: it didn't check for "len" underflow and it used vm_start
instead of start to calculated the current page walk under VM_IO, plus
it did the same checks more than once for no good reason. I also
believe it's broken to have "len" as an int, should be unsigned long
even if it's in page_size units (64bit would overflow with huge user
address space). This is my current status. Comments? (still I don't see
any value in returning any error from make_pages_present, nobody is
checking that anyways, in my tree the only error that could really
happen in find_vma was turned down for some unknown reason by some
random patch, I mean, mlock.c already does all the checks and returns
-ENOMEM if the mlock range is outside vmas, make_pages_present should
just do its job, we're under the mmap sem so nothing can change to
prevent make_pages_present to do its job)

Signed-off-by: Andrea Arcangeli <[email protected]>

Index: linux-2.5/arch/i386/mm/hugetlbpage.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/i386/mm/hugetlbpage.c,v
retrieving revision 1.54
diff -u -p -r1.54 hugetlbpage.c
--- linux-2.5/arch/i386/mm/hugetlbpage.c 26 Oct 2004 01:24:02 -0000 1.54
+++ linux-2.5/arch/i386/mm/hugetlbpage.c 17 Nov 2004 16:35:31 -0000
@@ -94,10 +94,10 @@ nomem:
int
follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
- unsigned long *position, int *length, int i)
+ unsigned long *position, unsigned long *length, int i)
{
unsigned long vpfn, vaddr = *position;
- int remainder = *length;
+ unsigned long remainder = *length;

WARN_ON(!is_vm_hugetlb_page(vma));

Index: linux-2.5/arch/ia64/mm/hugetlbpage.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/ia64/mm/hugetlbpage.c,v
retrieving revision 1.31
diff -u -p -r1.31 hugetlbpage.c
--- linux-2.5/arch/ia64/mm/hugetlbpage.c 30 Jun 2004 02:26:38 -0000 1.31
+++ linux-2.5/arch/ia64/mm/hugetlbpage.c 17 Nov 2004 16:35:23 -0000
@@ -119,12 +119,12 @@ nomem:
int
follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
- unsigned long *st, int *length, int i)
+ unsigned long *st, unsigned long *length, int i)
{
pte_t *ptep, pte;
unsigned long start = *st;
unsigned long pstart;
- int len = *length;
+ unsigned long len = *length;
struct page *page;

do {
Index: linux-2.5/arch/ppc64/mm/hugetlbpage.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/ppc64/mm/hugetlbpage.c,v
retrieving revision 1.35
diff -u -p -r1.35 hugetlbpage.c
--- linux-2.5/arch/ppc64/mm/hugetlbpage.c 28 Oct 2004 15:19:52 -0000 1.35
+++ linux-2.5/arch/ppc64/mm/hugetlbpage.c 17 Nov 2004 16:36:47 -0000
@@ -325,10 +325,10 @@ int copy_hugetlb_page_range(struct mm_st
int
follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
- unsigned long *position, int *length, int i)
+ unsigned long *position, unsigned long *length, int i)
{
unsigned long vpfn, vaddr = *position;
- int remainder = *length;
+ unsigned long remainder = *length;

WARN_ON(!is_vm_hugetlb_page(vma));

Index: linux-2.5/arch/sh/mm/hugetlbpage.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/sh/mm/hugetlbpage.c,v
retrieving revision 1.10
diff -u -p -r1.10 hugetlbpage.c
--- linux-2.5/arch/sh/mm/hugetlbpage.c 10 May 2004 21:08:49 -0000 1.10
+++ linux-2.5/arch/sh/mm/hugetlbpage.c 17 Nov 2004 16:34:57 -0000
@@ -126,10 +126,10 @@ nomem:

int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
- unsigned long *position, int *length, int i)
+ unsigned long *position, unsigned long *length, int i)
{
unsigned long vaddr = *position;
- int remainder = *length;
+ unsigned long remainder = *length;

WARN_ON(!is_vm_hugetlb_page(vma));

Index: linux-2.5/arch/sh64/mm/hugetlbpage.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/sh64/mm/hugetlbpage.c,v
retrieving revision 1.2
diff -u -p -r1.2 hugetlbpage.c
--- linux-2.5/arch/sh64/mm/hugetlbpage.c 30 Jun 2004 02:18:58 -0000 1.2
+++ linux-2.5/arch/sh64/mm/hugetlbpage.c 17 Nov 2004 16:34:41 -0000
@@ -126,10 +126,10 @@ nomem:

int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
- unsigned long *position, int *length, int i)
+ unsigned long *position, unsigned long *length, int i)
{
unsigned long vaddr = *position;
- int remainder = *length;
+ unsigned long remainder = *length;

WARN_ON(!is_vm_hugetlb_page(vma));

Index: linux-2.5/arch/sparc64/mm/hugetlbpage.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/sparc64/mm/hugetlbpage.c,v
retrieving revision 1.24
diff -u -p -r1.24 hugetlbpage.c
--- linux-2.5/arch/sparc64/mm/hugetlbpage.c 10 May 2004 21:08:49 -0000 1.24
+++ linux-2.5/arch/sparc64/mm/hugetlbpage.c 17 Nov 2004 16:34:36 -0000
@@ -123,10 +123,10 @@ nomem:

int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
- unsigned long *position, int *length, int i)
+ unsigned long *position, unsigned long *length, int i)
{
unsigned long vaddr = *position;
- int remainder = *length;
+ unsigned long remainder = *length;

WARN_ON(!is_vm_hugetlb_page(vma));

Index: linux-2.5/include/linux/hugetlb.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/linux/hugetlb.h,v
retrieving revision 1.33
diff -u -p -r1.33 hugetlb.h
--- linux-2.5/include/linux/hugetlb.h 8 Aug 2004 06:43:47 -0000 1.33
+++ linux-2.5/include/linux/hugetlb.h 17 Nov 2004 16:32:13 -0000
@@ -14,7 +14,7 @@ static inline int is_vm_hugetlb_page(str

int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
-int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
+int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, unsigned long *, int);
void zap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
Index: linux-2.5/include/linux/mm.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/linux/mm.h,v
retrieving revision 1.195
diff -u -p -r1.195 mm.h
--- linux-2.5/include/linux/mm.h 14 Nov 2004 04:35:49 -0000 1.195
+++ linux-2.5/include/linux/mm.h 17 Nov 2004 16:16:50 -0000
@@ -587,12 +587,12 @@ extern pte_t *FASTCALL(pte_alloc_map(str
extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot);
extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot);
extern int handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access);
-extern int make_pages_present(unsigned long addr, unsigned long end);
+extern void make_pages_present(unsigned long addr, unsigned long end);
extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
void install_arg_page(struct vm_area_struct *, struct page *, unsigned long);

int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
- int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
+ unsigned long nr_pages, int write, int force, struct page **pages, struct vm_area_struct **vmas);

int __set_page_dirty_buffers(struct page *page);
int __set_page_dirty_nobuffers(struct page *page);
Index: linux-2.5/mm/memory.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/memory.c,v
retrieving revision 1.191
diff -u -p -r1.191 memory.c
--- linux-2.5/mm/memory.c 16 Nov 2004 03:53:40 -0000 1.191
+++ linux-2.5/mm/memory.c 17 Nov 2004 16:25:26 -0000
@@ -713,7 +713,7 @@ untouched_anonymous_page(struct mm_struc


int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, int len, int write, int force,
+ unsigned long start, unsigned long nr_pages, int write, int force,
struct page **pages, struct vm_area_struct **vmas)
{
int i;
@@ -757,17 +757,35 @@ int get_user_pages(struct task_struct *t
vmas[i] = gate_vma;
i++;
start += PAGE_SIZE;
- len--;
+ nr_pages--;
continue;
}

- if (!vma || (pages && (vma->vm_flags & VM_IO))
- || !(flags & vma->vm_flags))
+ if (!vma || !(flags & vma->vm_flags))
return i ? : -EFAULT;

+ if (unlikely(vma->vm_flags & VM_IO)) {
+ unsigned long this_page_walk;
+
+ if (pages)
+ return i ? : -EFAULT;
+
+ /*
+ * If 'pages' is NULL, we'll silenty skip over all VM_IO
+ * vmas without returning errors.
+ */
+ this_page_walk = (vma->vm_end - start) >> PAGE_SHIFT;
+ if (this_page_walk > nr_pages)
+ this_page_walk = nr_pages;
+ start = vma->vm_end;
+ nr_pages -= this_page_walk;
+ i += this_page_walk;
+ continue;
+ }
+
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
- &start, &len, i);
+ &start, &nr_pages, i);
continue;
}
spin_lock(&mm->page_table_lock);
@@ -829,10 +847,10 @@ int get_user_pages(struct task_struct *t
vmas[i] = vma;
i++;
start += PAGE_SIZE;
- len--;
- } while(len && start < vma->vm_end);
+ nr_pages--;
+ } while(nr_pages && start < vma->vm_end);
spin_unlock(&mm->page_table_lock);
- } while(len);
+ } while(nr_pages);
out:
return i;
}
@@ -1754,25 +1772,23 @@ out:
return pmd_offset(pgd, address);
}

-int make_pages_present(unsigned long addr, unsigned long end)
+void make_pages_present(unsigned long addr, unsigned long end)
{
- int ret, len, write;
+ unsigned long len;
+ int write;
struct vm_area_struct * vma;

vma = find_vma(current->mm, addr);
if (!vma)
- return -1;
+ return;
write = (vma->vm_flags & VM_WRITE) != 0;
if (addr >= end)
BUG();
if (end > vma->vm_end)
BUG();
len = (end+PAGE_SIZE-1)/PAGE_SIZE-addr/PAGE_SIZE;
- ret = get_user_pages(current, current->mm, addr,
- len, write, 0, NULL, NULL);
- if (ret < 0)
- return ret;
- return ret == len ? 0 : -1;
+ get_user_pages(current, current->mm, addr,
+ len, write, 0, NULL, NULL);
}

/*
Index: linux-2.5/mm/mlock.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/mlock.c,v
retrieving revision 1.16
diff -u -p -r1.16 mlock.c
--- linux-2.5/mm/mlock.c 19 Oct 2004 05:54:02 -0000 1.16
+++ linux-2.5/mm/mlock.c 17 Nov 2004 13:42:26 -0000
@@ -47,7 +47,7 @@ static int mlock_fixup(struct vm_area_st
pages = (end - start) >> PAGE_SHIFT;
if (newflags & VM_LOCKED) {
pages = -pages;
- ret = make_pages_present(start, end);
+ make_pages_present(start, end);
}

vma->vm_mm->locked_vm -= pages;

2004-11-17 19:42:08

by Hugh Dickins

[permalink] [raw]
Subject: Re: loops in get_user_pages() for VM_IO

On Wed, 17 Nov 2004, Andrea Arcangeli wrote:
> On Wed, Nov 17, 2004 at 02:44:08PM +0100, Andrea Arcangeli wrote:
> > this would be the full port against mainline:
>
> while it's not necessary with mlockall (since we walk each vma in
> mlock.c before calling make_pages_present), I've to agree having a more
> generic implementation of get_user_pages that doesn't stop on VM_IO with
> pages = NULL, sounds a nicer API (even if nobody uses that
> functionality). Still I'd leave make_pages_present void. Andrew's patch
> had two bugs: it didn't check for "len" underflow and it used vm_start
> instead of start to calculated the current page walk under VM_IO, plus
> it did the same checks more than once for no good reason. I also
> believe it's broken to have "len" as an int, should be unsigned long
> even if it's in page_size units (64bit would overflow with huge user
> address space). This is my current status. Comments? (still I don't see
> any value in returning any error from make_pages_present, nobody is
> checking that anyways, in my tree the only error that could really
> happen in find_vma was turned down for some unknown reason by some
> random patch, I mean, mlock.c already does all the checks and returns
> -ENOMEM if the mlock range is outside vmas, make_pages_present should
> just do its job, we're under the mmap sem so nothing can change to
> prevent make_pages_present to do its job)

I pretty much agree with everything you've said here, but...

It's three patches: void make_pages_present(), stepping over VM_IO
in get_user_pages(), and unsigned long nr_pages to get_user_pages.

I love the void make_pages_present(): at present it returns 0,
error code or -1 - was that really supposed to mean -EPERM?
What happens if someone tries to make_pages_present() more than
is physically available? I think get_user_pages() returns -ENOMEM,
the intervening levels ignore that, the process then gets OOM-killed.
If that's acceptable, then I'd think just go with the simple patch,
void make_pages_present(), for 2.6.10.

Stepping over VM_IO in get_user_pages(): yes, I agreed with Andrew's
take on that (mlockall must not fail just because there's a VM_IO in
the address space), and I agree with your fixes to his draft. But
if we accept your void make_pages_present(), then we're left with
no uses for that code right now. Better leave out the additonal
complexity until it's needed? make_pages_present() already has a
BUG for end > vma->vm_end.

The unsigned long nr_pages to get_user_pages(). That looks incomplete
to me: isn't there an mm/nommu.c version which needs patching too?
and I think you'd better make it a long rather than an unsigned long,
because you also need to change the return value from int to long
(though most callers are safe to stick with assigning it to an int).
Is this a change that 2.6.10 would really need?

I've copied Terence partly to allow me to apologize: it wasn't until
DaveM's mail that I realized that most of these problems suddenly
arose from my copying vma->vm_flags back to vm_flags at the end of
do_mmap_pgoff - I was trying to avoid vm_stat inconsistencies and
vma_merge mismerges, was quite unconscious of this side-effect at
the time - sorry. But Andrea's mlock and mlockall considerations
show that it was a latent problem even without my change there.

It is a good idea not to put VM_LOCKED on the VM_IO regions, since
their pages get counted out of locked_vm (without a prior check).
We could fix that by carefully checking for VM_IO everywhere we
manipulate VM_LOCKED, but tiresome, let's not bother until someone
complains of a real case where VM_IO is really using up too much
locked_vm. We certainly shouldn't hang on VM_LOCKED|VM_IO areas.

(I do wonder whether the VM_IO added in remap_pfn_range will stop
someone from doing a get_user_pages on an area they could do it on
before - the rvmalloc'ed cases, for example, don't really need VM_IO
protection. But that's a different issue, no complaints heard yet.)

Hugh

2004-11-18 20:00:18

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] Re: loops in get_user_pages() for VM_IO

On Wed, 17 Nov 2004, Hugh Dickins wrote:
>
> I love the void make_pages_present(): at present it returns 0,
> error code or -1 - was that really supposed to mean -EPERM?
> What happens if someone tries to make_pages_present() more than
> is physically available? I think get_user_pages() returns -ENOMEM,
> the intervening levels ignore that, the process then gets OOM-killed.
> If that's acceptable, then I'd think just go with the simple patch,
> void make_pages_present(), for 2.6.10.

Well, no. wli specifically added mlock's check for get_user_pages
error in 2.6.0-test6; and I'd be reluctant to hide the -EFAULT from
trying to lock down pages beyond end of file. So here's the fairly
minimal patch I now suggest for 2.6.10....


Whereas get_user_pages simply fails (or stops early) on a VM_IO area
when filling a page vector, it currently proceeds to the follow_page/
handle_mm_fault loop when not passed a vector (by make_pages_present):
which may loop forever with follow_page failing on out-of-range pfn
and handle_mm_fault succeeding on pte already present.

This would already have been a problem for mlock, but I've made it worse
in 2.4.10-rc by updating vm_flags from driver's vma->vm_flags at the end
of do_mmap_pgoff (to avoid possible vma mismerge): if the driver sets
VM_LOCKED|VM_IO (not a recommended combination, but still done), then
make_pages_present is now called and a hang results.

We've removed VM_LOCKED from some of the drivers in question. But it's
surely unacceptable for an mlockall to hang or fail just because there
happens to be a VM_IO area somewhere in the address space.

We can argue whether VM_IO areas should be counted in locked_vm, but
go with a minimal fix for now: get_user_pages fail or stop on VM_IO
whether or not it's filling a page vector; but mlock_fixup (the only
place which checks for error from make_pages_present) skip VM_IO.

Signed-off-by: Hugh Dickins <[email protected]>

--- 2.6.10-bk3/mm/memory.c 2004-11-17 14:47:01.000000000 +0000
+++ linux/mm/memory.c 2004-11-18 18:00:04.204306016 +0000
@@ -761,7 +761,7 @@ int get_user_pages(struct task_struct *t
continue;
}

- if (!vma || (pages && (vma->vm_flags & VM_IO))
+ if (!vma || (vma->vm_flags & VM_IO)
|| !(flags & vma->vm_flags))
return i ? : -EFAULT;

--- 2.6.10-bk3/mm/mlock.c 2004-11-15 16:21:24.000000000 +0000
+++ linux/mm/mlock.c 2004-11-18 18:12:03.688927728 +0000
@@ -47,7 +47,8 @@ static int mlock_fixup(struct vm_area_st
pages = (end - start) >> PAGE_SHIFT;
if (newflags & VM_LOCKED) {
pages = -pages;
- ret = make_pages_present(start, end);
+ if (!(newflags & VM_IO))
+ ret = make_pages_present(start, end);
}

vma->vm_mm->locked_vm -= pages;