2009-06-24 14:38:24

by Thomas Gleixner

[permalink] [raw]
Subject: [PATCH V2] futex: Fix the write access fault problem for real

commit 64d1304a64 (futex: setup writeable mapping for futex ops which
modify user space data) did address only half of the problem of write
access faults.

The patch was made on two wrong assumptions:

1) access_ok(VERIFY_WRITE,...) would actually check write access.

On x86 it does _NOT_. It's a pure address range check.

2) a RW mapped region can not go away under us.

That's wrong as well. Nobody can prevent another thread to call
mprotect(PROT_READ) on that region where the futex resides. If that
call hits between the get_user_pages_fast() verification and the
actual write access in the atomic region we are toast again.

The solution is to not rely on access_ok and get_user() for any write
access related fault on private and shared futexes. Instead we need to
fault it in with verification of write access.

There is no generic non destructive write mechanism which would fault
the user page in trough a #PF, but as we already know that we will
fault we can as well call get_user_pages() directly and avoid the #PF
overhead.

If get_user_pages() returns -EFAULT we know that we can not fix it
anymore and need to bail out to user space.

Remove a bunch of confusing comments on this issue as well.

Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
---
kernel/futex.c | 45 ++++++++++++++++++++++++---------------------
1 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 80b5ce7..1c33711 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -284,6 +284,25 @@ void put_futex_key(int fshared, union futex_key *key)
drop_futex_key_refs(key);
}

+/*
+ * fault_in_user_writeable - fault in user address and verify RW access
+ * @uaddr: pointer to faulting user space address
+ *
+ * Slow path to fixup the fault we just took in the atomic write
+ * access to @uaddr.
+ *
+ * We have no generic implementation of a non destructive write to the
+ * user address. We know that we faulted in the atomic pagefault
+ * disabled section so we can as well avoid the #PF overhead by
+ * calling get_user_pages() right away.
+ */
+static int fault_in_user_writeable(u32 __user *uaddr)
+{
+ int ret = get_user_pages(current, current->mm, (unsigned long)uaddr,
+ sizeof(*uaddr), 1, 0, NULL, NULL);
+ return ret < 0 ? ret : 0;
+}
+
/**
* futex_top_waiter() - Return the highest priority waiter on a futex
* @hb: the hash bucket the futex_q's reside in
@@ -896,7 +915,6 @@ retry:
retry_private:
op_ret = futex_atomic_op_inuser(op, uaddr2);
if (unlikely(op_ret < 0)) {
- u32 dummy;

double_unlock_hb(hb1, hb2);

@@ -914,7 +932,7 @@ retry_private:
goto out_put_keys;
}

- ret = get_user(dummy, uaddr2);
+ ret = fault_in_user_writeable(uaddr2);
if (ret)
goto out_put_keys;

@@ -1204,7 +1222,7 @@ retry_private:
double_unlock_hb(hb1, hb2);
put_futex_key(fshared, &key2);
put_futex_key(fshared, &key1);
- ret = get_user(curval2, uaddr2);
+ ret = fault_in_user_writeable(uaddr2);
if (!ret)
goto retry;
goto out;
@@ -1482,7 +1500,7 @@ retry:
handle_fault:
spin_unlock(q->lock_ptr);

- ret = get_user(uval, uaddr);
+ ret = fault_in_user_writeable(uaddr);

spin_lock(q->lock_ptr);

@@ -1807,7 +1825,6 @@ static int futex_lock_pi(u32 __user *uaddr, int fshared,
{
struct hrtimer_sleeper timeout, *to = NULL;
struct futex_hash_bucket *hb;
- u32 uval;
struct futex_q q;
int res, ret;

@@ -1909,16 +1926,9 @@ out:
return ret != -EINTR ? ret : -ERESTARTNOINTR;

uaddr_faulted:
- /*
- * We have to r/w *(int __user *)uaddr, and we have to modify it
- * atomically. Therefore, if we continue to fault after get_user()
- * below, we need to handle the fault ourselves, while still holding
- * the mmap_sem. This can occur if the uaddr is under contention as
- * we have to drop the mmap_sem in order to call get_user().
- */
queue_unlock(&q, hb);

- ret = get_user(uval, uaddr);
+ ret = fault_in_user_writeable(uaddr);
if (ret)
goto out_put_key;

@@ -2013,17 +2023,10 @@ out:
return ret;

pi_faulted:
- /*
- * We have to r/w *(int __user *)uaddr, and we have to modify it
- * atomically. Therefore, if we continue to fault after get_user()
- * below, we need to handle the fault ourselves, while still holding
- * the mmap_sem. This can occur if the uaddr is under contention as
- * we have to drop the mmap_sem in order to call get_user().
- */
spin_unlock(&hb->lock);
put_futex_key(fshared, &key);

- ret = get_user(uval, uaddr);
+ ret = fault_in_user_writeable(uaddr);
if (!ret)
goto retry;

--
1.6.0.6


2009-06-24 16:48:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH V2] futex: Fix the write access fault problem for real



On Wed, 24 Jun 2009, Thomas Gleixner wrote:
>
> The solution is to not rely on access_ok and get_user() for any write
> access related fault on private and shared futexes. Instead we need to
> fault it in with verification of write access.
>
> There is no generic non destructive write mechanism which would fault
> the user page in trough a #PF, but as we already know that we will
> fault we can as well call get_user_pages() directly and avoid the #PF
> overhead.

Ack. Patch looks saner this way, and the commit message explains the
issues.

Linus

2009-06-24 19:48:43

by Thomas Gleixner

[permalink] [raw]
Subject: [GIT pull] futex fixes for 2.6.31

Linus,

Please pull the latest futexes-for-linus git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git futexes-for-linus

Thanks,

tglx

------------------>
Thomas Gleixner (1):
futex: Fix the write access fault problem for real


kernel/futex.c | 45 ++++++++++++++++++++++++---------------------
1 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 80b5ce7..1c33711 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -284,6 +284,25 @@ void put_futex_key(int fshared, union futex_key *key)
drop_futex_key_refs(key);
}

+/*
+ * fault_in_user_writeable - fault in user address and verify RW access
+ * @uaddr: pointer to faulting user space address
+ *
+ * Slow path to fixup the fault we just took in the atomic write
+ * access to @uaddr.
+ *
+ * We have no generic implementation of a non destructive write to the
+ * user address. We know that we faulted in the atomic pagefault
+ * disabled section so we can as well avoid the #PF overhead by
+ * calling get_user_pages() right away.
+ */
+static int fault_in_user_writeable(u32 __user *uaddr)
+{
+ int ret = get_user_pages(current, current->mm, (unsigned long)uaddr,
+ sizeof(*uaddr), 1, 0, NULL, NULL);
+ return ret < 0 ? ret : 0;
+}
+
/**
* futex_top_waiter() - Return the highest priority waiter on a futex
* @hb: the hash bucket the futex_q's reside in
@@ -896,7 +915,6 @@ retry:
retry_private:
op_ret = futex_atomic_op_inuser(op, uaddr2);
if (unlikely(op_ret < 0)) {
- u32 dummy;

double_unlock_hb(hb1, hb2);

@@ -914,7 +932,7 @@ retry_private:
goto out_put_keys;
}

- ret = get_user(dummy, uaddr2);
+ ret = fault_in_user_writeable(uaddr2);
if (ret)
goto out_put_keys;

@@ -1204,7 +1222,7 @@ retry_private:
double_unlock_hb(hb1, hb2);
put_futex_key(fshared, &key2);
put_futex_key(fshared, &key1);
- ret = get_user(curval2, uaddr2);
+ ret = fault_in_user_writeable(uaddr2);
if (!ret)
goto retry;
goto out;
@@ -1482,7 +1500,7 @@ retry:
handle_fault:
spin_unlock(q->lock_ptr);

- ret = get_user(uval, uaddr);
+ ret = fault_in_user_writeable(uaddr);

spin_lock(q->lock_ptr);

@@ -1807,7 +1825,6 @@ static int futex_lock_pi(u32 __user *uaddr, int fshared,
{
struct hrtimer_sleeper timeout, *to = NULL;
struct futex_hash_bucket *hb;
- u32 uval;
struct futex_q q;
int res, ret;

@@ -1909,16 +1926,9 @@ out:
return ret != -EINTR ? ret : -ERESTARTNOINTR;

uaddr_faulted:
- /*
- * We have to r/w *(int __user *)uaddr, and we have to modify it
- * atomically. Therefore, if we continue to fault after get_user()
- * below, we need to handle the fault ourselves, while still holding
- * the mmap_sem. This can occur if the uaddr is under contention as
- * we have to drop the mmap_sem in order to call get_user().
- */
queue_unlock(&q, hb);

- ret = get_user(uval, uaddr);
+ ret = fault_in_user_writeable(uaddr);
if (ret)
goto out_put_key;

@@ -2013,17 +2023,10 @@ out:
return ret;

pi_faulted:
- /*
- * We have to r/w *(int __user *)uaddr, and we have to modify it
- * atomically. Therefore, if we continue to fault after get_user()
- * below, we need to handle the fault ourselves, while still holding
- * the mmap_sem. This can occur if the uaddr is under contention as
- * we have to drop the mmap_sem in order to call get_user().
- */
spin_unlock(&hb->lock);
put_futex_key(fshared, &key);

- ret = get_user(uval, uaddr);
+ ret = fault_in_user_writeable(uaddr);
if (!ret)
goto retry;

2009-06-25 02:43:26

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [GIT pull] futex fixes for 2.6.31

On Wed, 2009-06-24 at 21:48 +0200, Thomas Gleixner wrote:
> Linus,
>
> Please pull the latest futexes-for-linus git tree from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git futexes-for-linus
>
> Thanks,
>
> tglx
>
> ------------------>
> Thomas Gleixner (1):
> futex: Fix the write access fault problem for real
>
>
> kernel/futex.c | 45 ++++++++++++++++++++++++---------------------
> 1 files changed, 24 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 80b5ce7..1c33711 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -284,6 +284,25 @@ void put_futex_key(int fshared, union futex_key *key)
> drop_futex_key_refs(key);
> }
>
> +/*
> + * fault_in_user_writeable - fault in user address and verify RW access
> + * @uaddr: pointer to faulting user space address
> + *
> + * Slow path to fixup the fault we just took in the atomic write
> + * access to @uaddr.
> + *
> + * We have no generic implementation of a non destructive write to the
> + * user address. We know that we faulted in the atomic pagefault
> + * disabled section so we can as well avoid the #PF overhead by
> + * calling get_user_pages() right away.
> + */
> +static int fault_in_user_writeable(u32 __user *uaddr)
> +{
> + int ret = get_user_pages(current, current->mm, (unsigned long)uaddr,
> + sizeof(*uaddr), 1, 0, NULL, NULL);
The 4th parameter of get_user_pages means page number. sizeof(*uaddr) is equal to
4, so you want 4 pages?

Yanmin

2009-06-25 09:57:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [GIT pull] futex fixes for 2.6.31

On Thu, 25 Jun 2009, Zhang, Yanmin wrote:
> On Wed, 2009-06-24 at 21:48 +0200, Thomas Gleixner wrote:
> > +static int fault_in_user_writeable(u32 __user *uaddr)
> > +{
> > + int ret = get_user_pages(current, current->mm, (unsigned long)uaddr,
> > + sizeof(*uaddr), 1, 0, NULL, NULL);
> The 4th parameter of get_user_pages means page number. sizeof(*uaddr) is equal to
> 4, so you want 4 pages?

Grrr. I looked up the prototype of it in include/linux/mm.h:

int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int write, int force,
struct page **pages, struct vm_area_struct **vmas);

len == number of pages ! How intutitive. :(

I guess Linus tripped over it as well. See:
http://lkml.org/lkml/2009/6/21/95

> > > In fact, since you're not actually interested in the page, you _could_
> > > just do
> > >
> > > get_user_pages(tsk, mm, uaddr, 4, 1, 0, NULL, NULL);

Will fix, thanks for noticing!

tglx

2009-06-25 09:59:14

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] clarify get_user_pages() prototype

On Thu, 2009-06-25 at 10:43 +0800, Zhang, Yanmin wrote:

> > +static int fault_in_user_writeable(u32 __user *uaddr)
> > +{
> > + int ret = get_user_pages(current, current->mm, (unsigned long)uaddr,
> > + sizeof(*uaddr), 1, 0, NULL, NULL);
> The 4th parameter of get_user_pages means page number. sizeof(*uaddr) is equal to
> 4, so you want 4 pages?

crap... that's what you get for only looking at the prototype :/

---
Subject: clarify get_user_pages() prototype

Currently the 4th parameter of get_user_pages() is called len, but its
in pages, not bytes. Rename the thing to nr_pages to avoid future
confusion.

Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/mm.h | 2 +-
mm/memory.c | 26 ++++++++++++--------------
mm/nommu.c | 12 +++++-------
3 files changed, 18 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d006e93..ba3a7cb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -826,7 +826,7 @@ extern int make_pages_present(unsigned long addr, unsigned long end);
extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);

int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, int len, int write, int force,
+ unsigned long start, int nr_pages, int write, int force,
struct page **pages, struct vm_area_struct **vmas);
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
diff --git a/mm/memory.c b/mm/memory.c
index f46ac18..6521619 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1207,8 +1207,8 @@ static inline int use_zero_page(struct vm_area_struct *vma)


int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, int len, int flags,
- struct page **pages, struct vm_area_struct **vmas)
+ unsigned long start, int nr_pages, int flags,
+ struct page **pages, struct vm_area_struct **vmas)
{
int i;
unsigned int vm_flags = 0;
@@ -1217,7 +1217,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);

- if (len <= 0)
+ if (nr_pages <= 0)
return 0;
/*
* Require read or write permissions.
@@ -1269,7 +1269,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
vmas[i] = gate_vma;
i++;
start += PAGE_SIZE;
- len--;
+ nr_pages--;
continue;
}

@@ -1280,7 +1280,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,

if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
- &start, &len, i, write);
+ &start, &nr_pages, i, write);
continue;
}

@@ -1357,9 +1357,9 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
vmas[i] = vma;
i++;
start += PAGE_SIZE;
- len--;
- } while (len && start < vma->vm_end);
- } while (len);
+ nr_pages--;
+ } while (nr_pages && start < vma->vm_end);
+ } while (nr_pages);
return i;
}

@@ -1368,7 +1368,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
* @tsk: task_struct of target task
* @mm: mm_struct of target mm
* @start: starting user address
- * @len: number of pages from start to pin
+ * @nr_pages: number of pages from start to pin
* @write: whether pages will be written to by the caller
* @force: whether to force write access even if user mapping is
* readonly. This will result in the page being COWed even
@@ -1380,7 +1380,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
* Or NULL if the caller does not require them.
*
* Returns number of pages pinned. This may be fewer than the number
- * requested. If len is 0 or negative, returns 0. If no pages
+ * requested. If nr_pages is 0 or negative, returns 0. If no pages
* were pinned, returns -errno. Each page returned must be released
* with a put_page() call when it is finished with. vmas will only
* remain valid while mmap_sem is held.
@@ -1414,7 +1414,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
* See also get_user_pages_fast, for performance critical applications.
*/
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, int len, int write, int force,
+ unsigned long start, int nr_pages, int write, int force,
struct page **pages, struct vm_area_struct **vmas)
{
int flags = 0;
@@ -1424,9 +1424,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
if (force)
flags |= GUP_FLAGS_FORCE;

- return __get_user_pages(tsk, mm,
- start, len, flags,
- pages, vmas);
+ return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
}

EXPORT_SYMBOL(get_user_pages);
diff --git a/mm/nommu.c b/mm/nommu.c
index 2fd2ad5..bf0cc76 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -173,8 +173,8 @@ unsigned int kobjsize(const void *objp)
}

int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, int len, int flags,
- struct page **pages, struct vm_area_struct **vmas)
+ unsigned long start, int nr_pages, int flags,
+ struct page **pages, struct vm_area_struct **vmas)
{
struct vm_area_struct *vma;
unsigned long vm_flags;
@@ -189,7 +189,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
vm_flags = write ? (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
vm_flags &= force ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);

- for (i = 0; i < len; i++) {
+ for (i = 0; i < nr_pages; i++) {
vma = find_vma(mm, start);
if (!vma)
goto finish_or_fault;
@@ -224,7 +224,7 @@ finish_or_fault:
* - don't permit access to VMAs that don't support it, such as I/O mappings
*/
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, int len, int write, int force,
+ unsigned long start, int nr_pages, int write, int force,
struct page **pages, struct vm_area_struct **vmas)
{
int flags = 0;
@@ -234,9 +234,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
if (force)
flags |= GUP_FLAGS_FORCE;

- return __get_user_pages(tsk, mm,
- start, len, flags,
- pages, vmas);
+ return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
}
EXPORT_SYMBOL(get_user_pages);

2009-06-25 10:04:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT pull] futex fixes for 2.6.31


* Thomas Gleixner <[email protected]> wrote:

> On Thu, 25 Jun 2009, Zhang, Yanmin wrote:
> > On Wed, 2009-06-24 at 21:48 +0200, Thomas Gleixner wrote:
> > > +static int fault_in_user_writeable(u32 __user *uaddr)
> > > +{
> > > + int ret = get_user_pages(current, current->mm, (unsigned long)uaddr,
> > > + sizeof(*uaddr), 1, 0, NULL, NULL);
> > The 4th parameter of get_user_pages means page number. sizeof(*uaddr) is equal to
> > 4, so you want 4 pages?
>
> Grrr. I looked up the prototype of it in include/linux/mm.h:
>
> int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> unsigned long start, int len, int write, int force,
> struct page **pages, struct vm_area_struct **vmas);
>
> len == number of pages ! How intutitive. :(
>
> I guess Linus tripped over it as well. See:
> http://lkml.org/lkml/2009/6/21/95

heh, and then i suggested to you to change the '4' to sizeof(u32)
;-)

Ingo

2009-06-25 18:15:50

by Thomas Gleixner

[permalink] [raw]
Subject: [GIT pull] final futex fixes for 2.6.31

Linus,

I'm truly embarrassed to ask you to pull the latest futexes-for-linus
git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git futexes-for-linus

I did not just copy your suggestion blindly, I even wondered and
looked up the prototype in mm.h which has the 4th argument "int len"
and said to myself "hey, Linus is right as usual" while in reality
"len" is number of pages.

As I'm running out of brown paperbags I'm declaring all futex bugs
fixed by definition. Futex bugs which will be discovered from now on
are going to be automatically converted to documented features.

Thanks,

tglx

------------------>
Thomas Gleixner (1):
futex: request only one page from get_user_pages()


kernel/futex.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 1c33711..794c862 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -299,7 +299,7 @@ void put_futex_key(int fshared, union futex_key *key)
static int fault_in_user_writeable(u32 __user *uaddr)
{
int ret = get_user_pages(current, current->mm, (unsigned long)uaddr,
- sizeof(*uaddr), 1, 0, NULL, NULL);
+ 1, 1, 0, NULL, NULL);
return ret < 0 ? ret : 0;
}

2009-06-25 18:27:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT pull] final futex fixes for 2.6.31



On Thu, 25 Jun 2009, Thomas Gleixner wrote:
>
> I'm truly embarrassed to ask you to pull the latest futexes-for-linus
> git tree from:

No problem. It was a harmless bug (at worst we'd do some extra work), and
I made the same mistake, so I can't blame you.

And I merged Peter's prototype clarification so that next time there will
be less reason for this to happen.

Linus