2011-06-06 14:28:40

by David C. Oliver

[permalink] [raw]
Subject: Change in functionality of futex() system call.

Hello,

The functionality of the futex() system call appears to have changed
between versions 2.6.18 and 2.6.32.28.

Specifically, performing a FUTEX_WAIT on a read-only mapped location
results in an EFAULT. Although other operations, such as FUTEX_WAKE,
are only meaningful for writable locations, FUTEX_WAIT is useful for
processes with read-only access to a memory-mapped file.

The code below illustrates the changed behavior (each of the EXPECT
operations succeed on the older kernel, the ASSERTs pass in each
case), assuming the file /tmp/futex_test exists and contains int(42).

With the older kernel, the syscall() suspends until another process
changes the file and issues a FUTEX_WAKE, whereas the new behavior is
for an EFAULT error, independent of the file contents.

Let me know if you need further clarification.

Cheers!

David Oliver.


#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
typedef uint32_t u32; ? // for futex.h
#include <linux/futex.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <unistd.h>
#include "gtest/gtest.h" // test framework to illustrate issue.


TEST(Futex, futex_in_read_only_file_is_ok) {
? int fd = open("/tmp/futex_test", O_RDONLY);
? ASSERT_GE(fd, 0);
? int* futex = static_cast<int *>(mmap(0, sizeof(int), PROT_READ,
MAP_SHARED, fd, 0));
? ASSERT_NE((int *)(0), futex);

? int rc = syscall(SYS_futex, futex, FUTEX_WAIT, 42, 0, 0, 0);

? EXPECT_NE(-1, rc); ? ? ? ? ? ? ?// fails.
? if (rc == -1) {
? ? ? EXPECT_NE(errno, EFAULT); ? // fails.
? }
}


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.


2011-06-06 15:23:45

by Eric Dumazet

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Le lundi 06 juin 2011 à 09:28 -0500, David Oliver a écrit :
> Hello,
>
> The functionality of the futex() system call appears to have changed
> between versions 2.6.18 and 2.6.32.28.
>
> Specifically, performing a FUTEX_WAIT on a read-only mapped location
> results in an EFAULT. Although other operations, such as FUTEX_WAKE,
> are only meaningful for writable locations, FUTEX_WAIT is useful for
> processes with read-only access to a memory-mapped file.
>
> The code below illustrates the changed behavior (each of the EXPECT
> operations succeed on the older kernel, the ASSERTs pass in each
> case), assuming the file /tmp/futex_test exists and contains int(42).
>
> With the older kernel, the syscall() suspends until another process
> changes the file and issues a FUTEX_WAKE, whereas the new behavior is
> for an EFAULT error, independent of the file contents.
>
> Let me know if you need further clarification.
>
> Cheers!
>
> David Oliver.
>
>
> #include <errno.h>
> #include <fcntl.h>
> #include <stdint.h>
> typedef uint32_t u32; // for futex.h
> #include <linux/futex.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <unistd.h>
> #include "gtest/gtest.h" // test framework to illustrate issue.
>
>
> TEST(Futex, futex_in_read_only_file_is_ok) {
> int fd = open("/tmp/futex_test", O_RDONLY);
> ASSERT_GE(fd, 0);
> int* futex = static_cast<int *>(mmap(0, sizeof(int), PROT_READ,
> MAP_SHARED, fd, 0));
> ASSERT_NE((int *)(0), futex);
>
> int rc = syscall(SYS_futex, futex, FUTEX_WAIT, 42, 0, 0, 0);
>
> EXPECT_NE(-1, rc); // fails.
> if (rc == -1) {
> EXPECT_NE(errno, EFAULT); // fails.
> }
> }
>

Right you are, this came from commit 7485d0d3758e8e6491a5 (futexes:
Remove rw parameter from get_futex_key()) in 2.6.33

commit 7485d0d3758e8e6491a5c9468114e74dc050785d
Author: KOSAKI Motohiro <[email protected]>
Date: Tue Jan 5 16:32:43 2010 +0900

futexes: Remove rw parameter from get_futex_key()

Currently, futexes have two problem:

A) The current futex code doesn't handle private file mappings properly.

get_futex_key() uses PageAnon() to distinguish file and
anon, which can cause the following bad scenario:

1) thread-A call futex(private-mapping, FUTEX_WAIT), it
sleeps on file mapping object.
2) thread-B writes a variable and it makes it cow.
3) thread-B calls futex(private-mapping, FUTEX_WAKE), it
wakes up blocked thread on the anonymous page. (but it's nothing)

B) Current futex code doesn't handle zero page properly.

Read mode get_user_pages() can return zero page, but current
futex code doesn't handle it at all. Then, zero page makes
infinite loop internally.

The solution is to use write mode get_user_page() always for
page lookup. It prevents the lookup of both file page of private
mappings and zero page.

Performance concerns:

Probaly very little, because glibc always initialize variables
for futex before to call futex(). It means glibc users never see
the overhead of this patch.

Compatibility concerns:

This patch has few compatibility issues. After this patch,
FUTEX_WAIT require writable access to futex variables (read-only
mappings makes EFAULT). But practically it's not a problem,
glibc always initalizes variables for futexes explicitly - nobody
uses read-only mappings.

Reported-by: Hugh Dickins <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: Darren Hart <[email protected]>
Cc: <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Ulrich Drepper <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


2011-06-06 15:56:55

by Shawn Bohrer

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Mon, Jun 06, 2011 at 05:23:39PM +0200, Eric Dumazet wrote:
> Le lundi 06 juin 2011 ? 09:28 -0500, David Oliver a ?crit :
> > Hello,
> >
> > The functionality of the futex() system call appears to have changed
> > between versions 2.6.18 and 2.6.32.28.
> >
> > Specifically, performing a FUTEX_WAIT on a read-only mapped location
> > results in an EFAULT. Although other operations, such as FUTEX_WAKE,
> > are only meaningful for writable locations, FUTEX_WAIT is useful for
> > processes with read-only access to a memory-mapped file.
> >
> > The code below illustrates the changed behavior (each of the EXPECT
> > operations succeed on the older kernel, the ASSERTs pass in each
> > case), assuming the file /tmp/futex_test exists and contains int(42).
> >
> > With the older kernel, the syscall() suspends until another process
> > changes the file and issues a FUTEX_WAKE, whereas the new behavior is
> > for an EFAULT error, independent of the file contents.
> >
> > Let me know if you need further clarification.
> >
> > Cheers!
> >
> > David Oliver.
> >
> >
> > #include <errno.h>
> > #include <fcntl.h>
> > #include <stdint.h>
> > typedef uint32_t u32; // for futex.h
> > #include <linux/futex.h>
> > #include <sys/mman.h>
> > #include <sys/syscall.h>
> > #include <unistd.h>
> > #include "gtest/gtest.h" // test framework to illustrate issue.
> >
> >
> > TEST(Futex, futex_in_read_only_file_is_ok) {
> > int fd = open("/tmp/futex_test", O_RDONLY);
> > ASSERT_GE(fd, 0);
> > int* futex = static_cast<int *>(mmap(0, sizeof(int), PROT_READ,
> > MAP_SHARED, fd, 0));
> > ASSERT_NE((int *)(0), futex);
> >
> > int rc = syscall(SYS_futex, futex, FUTEX_WAIT, 42, 0, 0, 0);
> >
> > EXPECT_NE(-1, rc); // fails.
> > if (rc == -1) {
> > EXPECT_NE(errno, EFAULT); // fails.
> > }
> > }
> >
>
> Right you are, this came from commit 7485d0d3758e8e6491a5 (futexes:
> Remove rw parameter from get_futex_key()) in 2.6.33
>
> commit 7485d0d3758e8e6491a5c9468114e74dc050785d
> Author: KOSAKI Motohiro <[email protected]>
> Date: Tue Jan 5 16:32:43 2010 +0900
>
> futexes: Remove rw parameter from get_futex_key()
>
> Currently, futexes have two problem:
>
> A) The current futex code doesn't handle private file mappings properly.
>
> get_futex_key() uses PageAnon() to distinguish file and
> anon, which can cause the following bad scenario:
>
> 1) thread-A call futex(private-mapping, FUTEX_WAIT), it
> sleeps on file mapping object.
> 2) thread-B writes a variable and it makes it cow.
> 3) thread-B calls futex(private-mapping, FUTEX_WAKE), it
> wakes up blocked thread on the anonymous page. (but it's nothing)
>
> B) Current futex code doesn't handle zero page properly.
>
> Read mode get_user_pages() can return zero page, but current
> futex code doesn't handle it at all. Then, zero page makes
> infinite loop internally.
>
> The solution is to use write mode get_user_page() always for
> page lookup. It prevents the lookup of both file page of private
> mappings and zero page.
>
> Performance concerns:
>
> Probaly very little, because glibc always initialize variables
> for futex before to call futex(). It means glibc users never see
> the overhead of this patch.
>
> Compatibility concerns:
>
> This patch has few compatibility issues. After this patch,
> FUTEX_WAIT require writable access to futex variables (read-only
> mappings makes EFAULT). But practically it's not a problem,
> glibc always initalizes variables for futexes explicitly - nobody
> uses read-only mappings.

We use read-only mappings. In our case we have Process A writing to
a memory-mapped file, and Processes B, C, D, etc reading that memory
mapped file. Using a futex in the file header is a convenient way to
notify the readers of updates.

glibc is not the only user of futexes.

--
Shawn


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-06 16:11:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Mon, 2011-06-06 at 17:23 +0200, Eric Dumazet wrote:
> Le lundi 06 juin 2011 à 09:28 -0500, David Oliver a écrit :
> > Hello,
> >
> > The functionality of the futex() system call appears to have changed
> > between versions 2.6.18 and 2.6.32.28.
> >
> > Specifically, performing a FUTEX_WAIT on a read-only mapped location
> > results in an EFAULT. Although other operations, such as FUTEX_WAKE,
> > are only meaningful for writable locations, FUTEX_WAIT is useful for
> > processes with read-only access to a memory-mapped file.
> >
> > The code below illustrates the changed behavior (each of the EXPECT
> > operations succeed on the older kernel, the ASSERTs pass in each
> > case), assuming the file /tmp/futex_test exists and contains int(42).
> >
> > With the older kernel, the syscall() suspends until another process
> > changes the file and issues a FUTEX_WAKE, whereas the new behavior is
> > for an EFAULT error, independent of the file contents.
> >
> > Let me know if you need further clarification.
> >
> > Cheers!
> >
> > David Oliver.
> >
> >
> > #include <errno.h>
> > #include <fcntl.h>
> > #include <stdint.h>
> > typedef uint32_t u32; // for futex.h
> > #include <linux/futex.h>
> > #include <sys/mman.h>
> > #include <sys/syscall.h>
> > #include <unistd.h>
> > #include "gtest/gtest.h" // test framework to illustrate issue.
> >
> >
> > TEST(Futex, futex_in_read_only_file_is_ok) {
> > int fd = open("/tmp/futex_test", O_RDONLY);
> > ASSERT_GE(fd, 0);
> > int* futex = static_cast<int *>(mmap(0, sizeof(int), PROT_READ,
> > MAP_SHARED, fd, 0));
> > ASSERT_NE((int *)(0), futex);
> >
> > int rc = syscall(SYS_futex, futex, FUTEX_WAIT, 42, 0, 0, 0);
> >
> > EXPECT_NE(-1, rc); // fails.
> > if (rc == -1) {
> > EXPECT_NE(errno, EFAULT); // fails.
> > }
> > }
> >
>
> Right you are, this came from commit 7485d0d3758e8e6491a5 (futexes:
> Remove rw parameter from get_futex_key()) in 2.6.33
>
> commit 7485d0d3758e8e6491a5c9468114e74dc050785d
> Author: KOSAKI Motohiro <[email protected]>
> Date: Tue Jan 5 16:32:43 2010 +0900
>
> futexes: Remove rw parameter from get_futex_key()
>
> Currently, futexes have two problem:
>
> A) The current futex code doesn't handle private file mappings properly.
>
> get_futex_key() uses PageAnon() to distinguish file and
> anon, which can cause the following bad scenario:
>
> 1) thread-A call futex(private-mapping, FUTEX_WAIT), it
> sleeps on file mapping object.
> 2) thread-B writes a variable and it makes it cow.
> 3) thread-B calls futex(private-mapping, FUTEX_WAKE), it
> wakes up blocked thread on the anonymous page. (but it's nothing)
>
> B) Current futex code doesn't handle zero page properly.
>
> Read mode get_user_pages() can return zero page, but current
> futex code doesn't handle it at all. Then, zero page makes
> infinite loop internally.
>
> The solution is to use write mode get_user_page() always for
> page lookup. It prevents the lookup of both file page of private
> mappings and zero page.
>
> Performance concerns:
>
> Probaly very little, because glibc always initialize variables
> for futex before to call futex(). It means glibc users never see
> the overhead of this patch.
>
> Compatibility concerns:
>
> This patch has few compatibility issues. After this patch,
> FUTEX_WAIT require writable access to futex variables (read-only
> mappings makes EFAULT). But practically it's not a problem,
> glibc always initalizes variables for futexes explicitly - nobody
> uses read-only mappings.

Urgh,. maybe something like the below but with more conditionals that
enable the extra logic only for FUTEX_WAIT..

The idea is to try a RO gup() when the RW gup() fails so as not to slow
down the common path of writable anonymous maps and bail when we used
the RO path on anonymous memory.

---
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..11f2ad1 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -234,7 +234,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
struct page *page, *page_head;
- int err;
+ int err, ro = 0;

/*
* The futex address must be "naturally" aligned.
@@ -262,6 +262,10 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)

again:
err = get_user_pages_fast(address, 1, 1, &page);
+ if (err == -EFAULT) {
+ err = get_user_pages_fast(address, 1, 0, &page);
+ ro = 1;
+ }
if (err < 0)
return err;

@@ -316,6 +320,11 @@ again:
* the object not the particular process.
*/
if (PageAnon(page_head)) {
+ if (ro) {
+ err = -EFAULT;
+ goto out;
+ }
+
key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
key->private.mm = mm;
key->private.address = address;
@@ -327,9 +336,10 @@ again:

get_futex_key_refs(key);

+out:
unlock_page(page_head);
put_page(page_head);
- return 0;
+ return err;
}

static inline void put_futex_key(union futex_key *key)

2011-06-06 16:16:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Mon, 2011-06-06 at 18:11 +0200, Peter Zijlstra wrote:
> On Mon, 2011-06-06 at 17:23 +0200, Eric Dumazet wrote:
> > Le lundi 06 juin 2011 à 09:28 -0500, David Oliver a écrit :
> > > Hello,
> > >
> > > The functionality of the futex() system call appears to have changed
> > > between versions 2.6.18 and 2.6.32.28.
> > >
> > > Specifically, performing a FUTEX_WAIT on a read-only mapped location
> > > results in an EFAULT. Although other operations, such as FUTEX_WAKE,
> > > are only meaningful for writable locations, FUTEX_WAIT is useful for
> > > processes with read-only access to a memory-mapped file.
> > >
> > > The code below illustrates the changed behavior (each of the EXPECT
> > > operations succeed on the older kernel, the ASSERTs pass in each
> > > case), assuming the file /tmp/futex_test exists and contains int(42).
> > >
> > > With the older kernel, the syscall() suspends until another process
> > > changes the file and issues a FUTEX_WAKE, whereas the new behavior is
> > > for an EFAULT error, independent of the file contents.
> > >
> > > Let me know if you need further clarification.
> > >
> > > Cheers!
> > >
> > > David Oliver.
> > >
> > >
> > > #include <errno.h>
> > > #include <fcntl.h>
> > > #include <stdint.h>
> > > typedef uint32_t u32; // for futex.h
> > > #include <linux/futex.h>
> > > #include <sys/mman.h>
> > > #include <sys/syscall.h>
> > > #include <unistd.h>
> > > #include "gtest/gtest.h" // test framework to illustrate issue.
> > >
> > >
> > > TEST(Futex, futex_in_read_only_file_is_ok) {
> > > int fd = open("/tmp/futex_test", O_RDONLY);
> > > ASSERT_GE(fd, 0);
> > > int* futex = static_cast<int *>(mmap(0, sizeof(int), PROT_READ,
> > > MAP_SHARED, fd, 0));
> > > ASSERT_NE((int *)(0), futex);
> > >
> > > int rc = syscall(SYS_futex, futex, FUTEX_WAIT, 42, 0, 0, 0);
> > >
> > > EXPECT_NE(-1, rc); // fails.
> > > if (rc == -1) {
> > > EXPECT_NE(errno, EFAULT); // fails.
> > > }
> > > }
> > >
> >
> > Right you are, this came from commit 7485d0d3758e8e6491a5 (futexes:
> > Remove rw parameter from get_futex_key()) in 2.6.33
> >
> > commit 7485d0d3758e8e6491a5c9468114e74dc050785d
> > Author: KOSAKI Motohiro <[email protected]>
> > Date: Tue Jan 5 16:32:43 2010 +0900
> >
> > futexes: Remove rw parameter from get_futex_key()
> >
> > Currently, futexes have two problem:
> >
> > A) The current futex code doesn't handle private file mappings properly.
> >
> > get_futex_key() uses PageAnon() to distinguish file and
> > anon, which can cause the following bad scenario:
> >
> > 1) thread-A call futex(private-mapping, FUTEX_WAIT), it
> > sleeps on file mapping object.
> > 2) thread-B writes a variable and it makes it cow.
> > 3) thread-B calls futex(private-mapping, FUTEX_WAKE), it
> > wakes up blocked thread on the anonymous page. (but it's nothing)
> >
> > B) Current futex code doesn't handle zero page properly.
> >
> > Read mode get_user_pages() can return zero page, but current
> > futex code doesn't handle it at all. Then, zero page makes
> > infinite loop internally.
> >
> > The solution is to use write mode get_user_page() always for
> > page lookup. It prevents the lookup of both file page of private
> > mappings and zero page.
> >
> > Performance concerns:
> >
> > Probaly very little, because glibc always initialize variables
> > for futex before to call futex(). It means glibc users never see
> > the overhead of this patch.
> >
> > Compatibility concerns:
> >
> > This patch has few compatibility issues. After this patch,
> > FUTEX_WAIT require writable access to futex variables (read-only
> > mappings makes EFAULT). But practically it's not a problem,
> > glibc always initalizes variables for futexes explicitly - nobody
> > uses read-only mappings.
>
> Urgh,. maybe something like the below but with more conditionals that
> enable the extra logic only for FUTEX_WAIT..
>
> The idea is to try a RO gup() when the RW gup() fails so as not to slow
> down the common path of writable anonymous maps and bail when we used
> the RO path on anonymous memory.
>
> ---
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fe28dc2..11f2ad1 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -234,7 +234,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> unsigned long address = (unsigned long)uaddr;
> struct mm_struct *mm = current->mm;
> struct page *page, *page_head;
> - int err;
> + int err, ro = 0;
>
> /*
> * The futex address must be "naturally" aligned.
> @@ -262,6 +262,10 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
>
> again:
> err = get_user_pages_fast(address, 1, 1, &page);
> + if (err == -EFAULT) {
> + err = get_user_pages_fast(address, 1, 0, &page);
> + ro = 1;
> + }
> if (err < 0)
> return err;
>
> @@ -316,6 +320,11 @@ again:
> * the object not the particular process.
> */
> if (PageAnon(page_head)) {
> + if (ro) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
> key->private.mm = mm;
> key->private.address = address;
> @@ -327,9 +336,10 @@ again:
>
> get_futex_key_refs(key);
>
> +out:
> unlock_page(page_head);
> put_page(page_head);
> - return 0;
> + return err;
> }
>
> static inline void put_futex_key(union futex_key *key)
>

Hmm, wouldn't that still be susceptible to the zero-page thing if: we
create a writable private file map of a sparse file, touch a page and
then remap the thing RO?


2011-06-06 16:22:34

by Eric Dumazet

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Le lundi 06 juin 2011 à 18:16 +0200, Peter Zijlstra a écrit :

> Hmm, wouldn't that still be susceptible to the zero-page thing if: we
> create a writable private file map of a sparse file, touch a page and
> then remap the thing RO?
>
>
>

Also I am not sure how MAP_PRIVATE could be affected. If we still try a
RW gup()... It will allocate a page for us, instead of still pointing to
shared one.

On previous kernel, the application using read-only mapping could use
MAP_PRIVATE or MAP_SHARED with same 'behavior'

2011-06-06 16:29:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Mon, 2011-06-06 at 18:22 +0200, Eric Dumazet wrote:
> Le lundi 06 juin 2011 à 18:16 +0200, Peter Zijlstra a écrit :
>
> > Hmm, wouldn't that still be susceptible to the zero-page thing if: we
> > create a writable private file map of a sparse file, touch a page and
> > then remap the thing RO?
> >
> >
> >
>
> Also I am not sure how MAP_PRIVATE could be affected. If we still try a
> RW gup()... It will allocate a page for us, instead of still pointing to
> shared one.
>
> On previous kernel, the application using read-only mapping could use
> MAP_PRIVATE or MAP_SHARED with same 'behavior'

But by not forcing the COW you get different behaviour depending on when
you call FUTEX_WAIT, surely that's not correct either?

2011-06-06 16:42:51

by Eric Dumazet

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Le lundi 06 juin 2011 à 18:29 +0200, Peter Zijlstra a écrit :
> On Mon, 2011-06-06 at 18:22 +0200, Eric Dumazet wrote:
> > Le lundi 06 juin 2011 à 18:16 +0200, Peter Zijlstra a écrit :
> >
> > > Hmm, wouldn't that still be susceptible to the zero-page thing if: we
> > > create a writable private file map of a sparse file, touch a page and
> > > then remap the thing RO?
> > >
> > >
> > >
> >
> > Also I am not sure how MAP_PRIVATE could be affected. If we still try a
> > RW gup()... It will allocate a page for us, instead of still pointing to
> > shared one.
> >
> > On previous kernel, the application using read-only mapping could use
> > MAP_PRIVATE or MAP_SHARED with same 'behavior'
>
> But by not forcing the COW you get different behaviour depending on when
> you call FUTEX_WAIT, surely that's not correct either?


As long as the current process never writes to the page holding the
futex, the page stay shared. Behavior should be same with PRIVATE or
SHARED ?

In David Oliver case, this is needed : He wants to catch a change in a
file/memory region written by another process.


2011-06-06 17:05:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Mon, 2011-06-06 at 18:42 +0200, Eric Dumazet wrote:
> Le lundi 06 juin 2011 à 18:29 +0200, Peter Zijlstra a écrit :
> > On Mon, 2011-06-06 at 18:22 +0200, Eric Dumazet wrote:
> > > Le lundi 06 juin 2011 à 18:16 +0200, Peter Zijlstra a écrit :
> > >
> > > > Hmm, wouldn't that still be susceptible to the zero-page thing if: we
> > > > create a writable private file map of a sparse file, touch a page and
> > > > then remap the thing RO?
> > > >
> > > >
> > > >
> > >
> > > Also I am not sure how MAP_PRIVATE could be affected. If we still try a
> > > RW gup()... It will allocate a page for us, instead of still pointing to
> > > shared one.
> > >
> > > On previous kernel, the application using read-only mapping could use
> > > MAP_PRIVATE or MAP_SHARED with same 'behavior'
> >
> > But by not forcing the COW you get different behaviour depending on when
> > you call FUTEX_WAIT, surely that's not correct either?
>
>
> As long as the current process never writes to the page holding the
> futex, the page stay shared. Behavior should be same with PRIVATE or
> SHARED ?

Dunno, using futexes on private file maps is stupid imo, its just asking
for trouble, ro private file maps are even worse. Forcing the COW is the
only sane answer in that it gives consistent results and 'breaks' silly
expectations early instead of sometimes.

Anyway, that's not really the issue here, as David uses MAP_SHARED (as
one should if one is interested in the shared value).

2011-06-06 17:11:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Le lundi 06 juin 2011 à 19:05 +0200, Peter Zijlstra a écrit :

> Dunno, using futexes on private file maps is stupid imo, its just asking
> for trouble, ro private file maps are even worse. Forcing the COW is the
> only sane answer in that it gives consistent results and 'breaks' silly
> expectations early instead of sometimes.
>
> Anyway, that's not really the issue here, as David uses MAP_SHARED (as
> one should if one is interested in the shared value).

Sure, but maybe another guy is 'stupid' and uses MAP_PRIVATE on its
read-only mappings. With old kernels this was working, and we were not
doing the COW.

(Note : the other process writes to the file, using MAP_SHARED)


2011-06-06 17:27:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Mon, Jun 06, 2011 at 07:11:37PM +0200, Eric Dumazet wrote:
> Le lundi 06 juin 2011 ? 19:05 +0200, Peter Zijlstra a ?crit :
>
> > Dunno, using futexes on private file maps is stupid imo, its just asking
> > for trouble, ro private file maps are even worse. Forcing the COW is the
> > only sane answer in that it gives consistent results and 'breaks' silly
> > expectations early instead of sometimes.
> >
> > Anyway, that's not really the issue here, as David uses MAP_SHARED (as
> > one should if one is interested in the shared value).
>
> Sure, but maybe another guy is 'stupid' and uses MAP_PRIVATE on its
> read-only mappings. With old kernels this was working, and we were not
> doing the COW.

That sounds like a bug in both the kernel and userspace. I would expect
a MAP_PRIVATE not be seen by any other process regardless. That's the
definition of PRIVATE.

From: http://www.gnu.org/s/hello/manual/libc/Memory_002dmapped-I_002fO.html

MAP_PRIVATE
This specifies that writes to the region should never be written back
to the attached file. Instead, a copy is made for the process, and the
region will be swapped normally if memory runs low. No other process
will see the changes.

-- Steve

2011-06-06 17:53:41

by Darren Hart

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.



On 06/06/2011 09:42 AM, Eric Dumazet wrote:
> Le lundi 06 juin 2011 à 18:29 +0200, Peter Zijlstra a écrit :
>> On Mon, 2011-06-06 at 18:22 +0200, Eric Dumazet wrote:
>>> Le lundi 06 juin 2011 à 18:16 +0200, Peter Zijlstra a écrit :
>>>
>>>> Hmm, wouldn't that still be susceptible to the zero-page thing if: we
>>>> create a writable private file map of a sparse file, touch a page and
>>>> then remap the thing RO?
>>>>
>>>>
>>>>
>>>
>>> Also I am not sure how MAP_PRIVATE could be affected. If we still try a
>>> RW gup()... It will allocate a page for us, instead of still pointing to
>>> shared one.
>>>
>>> On previous kernel, the application using read-only mapping could use
>>> MAP_PRIVATE or MAP_SHARED with same 'behavior'
>>
>> But by not forcing the COW you get different behaviour depending on when
>> you call FUTEX_WAIT, surely that's not correct either?
>
>
> As long as the current process never writes to the page holding the
> futex, the page stay shared. Behavior should be same with PRIVATE or
> SHARED ?

If I understand the problem correctly, RO private mapping really doesn't
make any sense and we should probably explicitly not support it, while
special casing the RO shared mapping in support of David's scenario.

>
> In David Oliver case, this is needed : He wants to catch a change in a
> file/memory region written by another process.

But with shared mapping and shared futexes. He just needs the ability to
FUTEX_WAIT on a RO mapping. Or is that what you were saying?

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-06 17:56:15

by Darren Hart

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.



On 06/06/2011 10:27 AM, Steven Rostedt wrote:
> On Mon, Jun 06, 2011 at 07:11:37PM +0200, Eric Dumazet wrote:
>> Le lundi 06 juin 2011 ? 19:05 +0200, Peter Zijlstra a ?crit :
>>
>>> Dunno, using futexes on private file maps is stupid imo, its just asking
>>> for trouble, ro private file maps are even worse. Forcing the COW is the
>>> only sane answer in that it gives consistent results and 'breaks' silly
>>> expectations early instead of sometimes.
>>>
>>> Anyway, that's not really the issue here, as David uses MAP_SHARED (as
>>> one should if one is interested in the shared value).
>>
>> Sure, but maybe another guy is 'stupid' and uses MAP_PRIVATE on its
>> read-only mappings. With old kernels this was working, and we were not
>> doing the COW.
>
> That sounds like a bug in both the kernel and userspace. I would expect
> a MAP_PRIVATE not be seen by any other process regardless. That's the
> definition of PRIVATE.
>
> From: http://www.gnu.org/s/hello/manual/libc/Memory_002dmapped-I_002fO.html
>
> MAP_PRIVATE
> This specifies that writes to the region should never be written back
> to the attached file. Instead, a copy is made for the process, and the
> region will be swapped normally if memory runs low. No other process
> will see the changes.


This doesn't address what happens if changes to a MAP_SHARED mapping are
visible to a MAP_PRIVATE mapping, which is more the issue at hand I believe.

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-06 18:11:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Le lundi 06 juin 2011 à 10:53 -0700, Darren Hart a écrit :
>

> If I understand the problem correctly, RO private mapping really doesn't
> make any sense and we should probably explicitly not support it, while
> special casing the RO shared mapping in support of David's scenario.
>

We supported them in 2.6.18 kernels, apparently. This might sounds
stupid but who knows ?

> >
> > In David Oliver case, this is needed : He wants to catch a change in a
> > file/memory region written by another process.
>
> But with shared mapping and shared futexes. He just needs the ability to
> FUTEX_WAIT on a RO mapping. Or is that what you were saying?
>

I am saying that in David Oliver case, he sure uses a MAP_SHARED ro
mapping.

Now, what if other software uses a MAP_PRIVATE ro mapping ?

It was working in previous kernels as well.

We can say its stupid, but IMHO its not.

In other words, this program should work, if process never touches
(writes) into first page.

This program on previous kernels gave :
rc=-1 errno=11
(allowing to wait for a value change and a futex_WAKE)

With new kernel :
rc=-1 errno=14 [ no sleep allowed ]

#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
typedef uint32_t u32; // for futex.h
#include <linux/futex.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <unistd.h>


int main() {
int fd, *futex, rc;

fd = open("/tmp/futex_test", O_RDWR|O_CREAT, 0644);
write(fd, "\1\1\1\1", 4);
futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAIT, 42, 0, 0, 0);
printf("rc=%d errno=%d\n", rc, errno);
}

2011-06-06 18:20:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Mon, 2011-06-06 at 19:11 +0200, Eric Dumazet wrote:
> Le lundi 06 juin 2011 à 19:05 +0200, Peter Zijlstra a écrit :
>
> > Dunno, using futexes on private file maps is stupid imo, its just asking
> > for trouble, ro private file maps are even worse. Forcing the COW is the
> > only sane answer in that it gives consistent results and 'breaks' silly
> > expectations early instead of sometimes.
> >
> > Anyway, that's not really the issue here, as David uses MAP_SHARED (as
> > one should if one is interested in the shared value).
>
> Sure, but maybe another guy is 'stupid' and uses MAP_PRIVATE on its
> read-only mappings. With old kernels this was working, and we were not
> doing the COW.
>
> (Note : the other process writes to the file, using MAP_SHARED)

That's really not the point, what do we do when the COW happens during
the FUTEX_WAIT? At that point the process vaddr changes mapping and we
cannot continue the wait on the old page, since that would expose
invisible information, nor can we switch to the new page since we queued
on the old page.

Therefore we have to force the COW and queue on the private copy, it
really is the only semi sane semantic.



2011-06-06 18:28:00

by Eric Dumazet

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Le lundi 06 juin 2011 à 20:23 +0200, Peter Zijlstra a écrit :

>
> That's really not the point, what do we do when the COW happens during
> the FUTEX_WAIT? At that point the process vaddr changes mapping and we
> cannot continue the wait on the old page, since that would expose
> invisible information, nor can we switch to the new page since we queued
> on the old page.
>
> Therefore we have to force the COW and queue on the private copy, it
> really is the only semi sane semantic.

The point is we dont necessarly have to COW the page. If you attempt
this COW, you shoot on user that did not expect to have a COW.

Take this program : COW is not allowed, still this worked on 2.6.18 (it
waits until another process change the value in file and call
futex_wait())

Using PROT_READ | PROT_WRITE instead of PROT_READ was OK too.

(If we use PROT_READ | PROT_WRITE, then after your patch, program doesnt
work anymore since this process gets a private page after your hidden
COW : It'll wait forever)

#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
typedef uint32_t u32; // for futex.h
#include <linux/futex.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <unistd.h>


int main(int argc, char *argv[]) {
int fd, *futex, rc, val = 42;

fd = open("/tmp/futex_test", O_RDWR|O_CREAT, 0644);
write(fd, &val, 4);
futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, 0, 0, 0);
printf("rc=%d errno=%d\n", rc, errno);
}


2011-06-07 03:13:58

by Darren Hart

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.



On 06/06/2011 11:11 AM, Eric Dumazet wrote:
> Le lundi 06 juin 2011 à 10:53 -0700, Darren Hart a écrit :
>>
>
>> If I understand the problem correctly, RO private mapping really doesn't
>> make any sense and we should probably explicitly not support it, while
>> special casing the RO shared mapping in support of David's scenario.
>>
>
> We supported them in 2.6.18 kernels, apparently. This might sounds
> stupid but who knows ?


I guess this is actually the key point we need to agree on to provide a
solution. This particular case "worked" in 2.6.18 kernels, but that
doesn't necessarily mean it was supported, or even intentional.

It sounds to me that we agree that we should support RO shared mappings.
The question remains about whether we should introduce deliberate
support of RO private mappings, and if so, if the forced COW approach is
appropriate or not.

Does anyone with a longer history working with futexes than I have an
opinion on this? Is support for RO private mappings part of our futex
API, or was it an unintentional side effect of the futex simply being a
userspace address.

--
Darren


>
>>>
>>> In David Oliver case, this is needed : He wants to catch a change in a
>>> file/memory region written by another process.
>>
>> But with shared mapping and shared futexes. He just needs the ability to
>> FUTEX_WAIT on a RO mapping. Or is that what you were saying?
>>
>
> I am saying that in David Oliver case, he sure uses a MAP_SHARED ro
> mapping.
>
> Now, what if other software uses a MAP_PRIVATE ro mapping ?
>
> It was working in previous kernels as well.
>
> We can say its stupid, but IMHO its not.
>
> In other words, this program should work, if process never touches
> (writes) into first page.
>
> This program on previous kernels gave :
> rc=-1 errno=11
> (allowing to wait for a value change and a futex_WAKE)
>
> With new kernel :
> rc=-1 errno=14 [ no sleep allowed ]
>
> #include <errno.h>
> #include <fcntl.h>
> #include <stdint.h>
> typedef uint32_t u32; // for futex.h
> #include <linux/futex.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <unistd.h>
>
>
> int main() {
> int fd, *futex, rc;
>
> fd = open("/tmp/futex_test", O_RDWR|O_CREAT, 0644);
> write(fd, "\1\1\1\1", 4);
> futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE, fd, 0);
> rc = syscall(SYS_futex, futex, FUTEX_WAIT, 42, 0, 0, 0);
> printf("rc=%d errno=%d\n", rc, errno);
> }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-07 03:49:29

by Eric Dumazet

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Le lundi 06 juin 2011 à 20:13 -0700, Darren Hart a écrit :
>
> On 06/06/2011 11:11 AM, Eric Dumazet wrote:
> > Le lundi 06 juin 2011 à 10:53 -0700, Darren Hart a écrit :
> >>
> >
> >> If I understand the problem correctly, RO private mapping really doesn't
> >> make any sense and we should probably explicitly not support it, while
> >> special casing the RO shared mapping in support of David's scenario.
> >>
> >
> > We supported them in 2.6.18 kernels, apparently. This might sounds
> > stupid but who knows ?
>
>
> I guess this is actually the key point we need to agree on to provide a
> solution. This particular case "worked" in 2.6.18 kernels, but that
> doesn't necessarily mean it was supported, or even intentional.
>
> It sounds to me that we agree that we should support RO shared mappings.
> The question remains about whether we should introduce deliberate
> support of RO private mappings, and if so, if the forced COW approach is
> appropriate or not.
>
> Does anyone with a longer history working with futexes than I have an
> opinion on this? Is support for RO private mappings part of our futex
> API, or was it an unintentional side effect of the futex simply being a
> userspace address.
>

I personnally dont care as I dont use ro mappings for my futexes land,
but I can feel the pain of people discovering yet another
incompatibility in their user apps after a kernel upgrade, spending so
much time to find the root of the problem (hey, not everybody is a
kernel hacker)

If we think about it, futex_wait() should not touch memory, only read
it. Some smart layer could be upset by this (valgrind ?)

Its like saying write(int fd, const void *buffer, size_t count) could
try to do a COW on buffer, because it makes kernel programmer life more
comfortable, this makes litle sense to me IMHO.

Part of the problem comes from futex() syscall being a multiplexor.
What a mess.

If we had a clean API at the beginning, then we would have :

int sys_futex_wait(const void *futex, int val, const struct timespec *t);

And really, doing COW in futex_wait() would clearly be wrong.


2011-06-07 14:44:32

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On 06/06/2011 11:13 PM, Darren Hart wrote:
>
>
> On 06/06/2011 11:11 AM, Eric Dumazet wrote:
>> Le lundi 06 juin 2011 à 10:53 -0700, Darren Hart a écrit :
>>>
>>
>>> If I understand the problem correctly, RO private mapping really doesn't
>>> make any sense and we should probably explicitly not support it, while
>>> special casing the RO shared mapping in support of David's scenario.
>>>
>>
>> We supported them in 2.6.18 kernels, apparently. This might sounds
>> stupid but who knows ?
>
>
> I guess this is actually the key point we need to agree on to provide a
> solution. This particular case "worked" in 2.6.18 kernels, but that
> doesn't necessarily mean it was supported, or even intentional.
>
> It sounds to me that we agree that we should support RO shared mappings.
> The question remains about whether we should introduce deliberate
> support of RO private mappings, and if so, if the forced COW approach is
> appropriate or not.
>

I disagree.

FUTEX_WAIT has side-effects. Specifically, it eats one wakeup sent by
FUTEX_WAKE. So if something uses futexes on a file mapping, then a
process with only read access could (if the semantics were changed) DoS
the other processes by spawning a bunch of threads and FUTEX_WAITing
from each of them.

If there were a FUTEX_WAIT_NOCONSUME that did not consume a wakeup and
worked on RO mappings, I would drop my objection.

--Andy

2011-06-07 15:56:37

by Darren Hart

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.



On 06/07/2011 07:44 AM, Andy Lutomirski wrote:
> On 06/06/2011 11:13 PM, Darren Hart wrote:
>>
>>
>> On 06/06/2011 11:11 AM, Eric Dumazet wrote:
>>> Le lundi 06 juin 2011 à 10:53 -0700, Darren Hart a écrit :
>>>>
>>>
>>>> If I understand the problem correctly, RO private mapping really doesn't
>>>> make any sense and we should probably explicitly not support it, while
>>>> special casing the RO shared mapping in support of David's scenario.
>>>>
>>>
>>> We supported them in 2.6.18 kernels, apparently. This might sounds
>>> stupid but who knows ?
>>
>>
>> I guess this is actually the key point we need to agree on to provide a
>> solution. This particular case "worked" in 2.6.18 kernels, but that
>> doesn't necessarily mean it was supported, or even intentional.
>>
>> It sounds to me that we agree that we should support RO shared mappings.
>> The question remains about whether we should introduce deliberate
>> support of RO private mappings, and if so, if the forced COW approach is
>> appropriate or not.
>>
>
> I disagree.
>
> FUTEX_WAIT has side-effects. Specifically, it eats one wakeup sent by
> FUTEX_WAKE. So if something uses futexes on a file mapping, then a
> process with only read access could (if the semantics were changed) DoS
> the other processes by spawning a bunch of threads and FUTEX_WAITing
> from each of them.
>
> If there were a FUTEX_WAIT_NOCONSUME that did not consume a wakeup and
> worked on RO mappings, I would drop my objection.


This sounds like an argument for properly managing file permissions and
carefully selecting the mapping backing your futex word - but I don't
see this as compelling rationale to disable RO support entirely and
certainly not to add yet another futex op code.


--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-07 15:58:26

by Eric Dumazet

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Le mardi 07 juin 2011 à 10:44 -0400, Andy Lutomirski a écrit :
> On 06/06/2011 11:13 PM, Darren Hart wrote:
> >
> >
> > On 06/06/2011 11:11 AM, Eric Dumazet wrote:
> >> Le lundi 06 juin 2011 à 10:53 -0700, Darren Hart a écrit :
> >>>
> >>
> >>> If I understand the problem correctly, RO private mapping really doesn't
> >>> make any sense and we should probably explicitly not support it, while
> >>> special casing the RO shared mapping in support of David's scenario.
> >>>
> >>
> >> We supported them in 2.6.18 kernels, apparently. This might sounds
> >> stupid but who knows ?
> >
> >
> > I guess this is actually the key point we need to agree on to provide a
> > solution. This particular case "worked" in 2.6.18 kernels, but that
> > doesn't necessarily mean it was supported, or even intentional.
> >
> > It sounds to me that we agree that we should support RO shared mappings.
> > The question remains about whether we should introduce deliberate
> > support of RO private mappings, and if so, if the forced COW approach is
> > appropriate or not.
> >
>
> I disagree.
>
> FUTEX_WAIT has side-effects. Specifically, it eats one wakeup sent by
> FUTEX_WAKE. So if something uses futexes on a file mapping, then a
> process with only read access could (if the semantics were changed) DoS
> the other processes by spawning a bunch of threads and FUTEX_WAITing
> from each of them.
>
> If there were a FUTEX_WAIT_NOCONSUME that did not consume a wakeup and
> worked on RO mappings, I would drop my objection.

If a group of cooperating processes uses a memory segment to exchange
critical information, do you really think this memory segment will be
readable by other unrelated processes on the machine ?

How is this related to futex code ?

Same problem for legacy IPC (shm, msg, sem) : Appropriate protections
are needed, obviously.

BTW, kernel/futex.c uses a global hash table (futex_queues[256]) and a
very predictable hash_futex(), so its easy to slow down futex users...


2011-06-07 18:31:29

by Joel Becker

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Mon, Jun 06, 2011 at 08:11:38PM +0200, Eric Dumazet wrote:
> Le lundi 06 juin 2011 ? 10:53 -0700, Darren Hart a ?crit :
> >
>
> > If I understand the problem correctly, RO private mapping really doesn't
> > make any sense and we should probably explicitly not support it, while
> > special casing the RO shared mapping in support of David's scenario.
> >
>
> We supported them in 2.6.18 kernels, apparently. This might sounds
> stupid but who knows ?

Trying to come up with a strawman for this sort of operation.
What about a process that creates a private mapping and then creates
threads with CLONE_VM? Would we CoW in that case?

Joel

--

Pitchers and catchers report.

http://www.jlbec.org/
[email protected]

2011-06-07 18:43:27

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Tue, Jun 7, 2011 at 11:58 AM, Eric Dumazet <[email protected]> wrote:
> Le mardi 07 juin 2011 ? 10:44 -0400, Andy Lutomirski a ?crit :
>> On 06/06/2011 11:13 PM, Darren Hart wrote:
>> >
>> >
>> > On 06/06/2011 11:11 AM, Eric Dumazet wrote:
>> >> Le lundi 06 juin 2011 ? 10:53 -0700, Darren Hart a ?crit :
>> >>>
>> >>
>> >>> If I understand the problem correctly, RO private mapping really doesn't
>> >>> make any sense and we should probably explicitly not support it, while
>> >>> special casing the RO shared mapping in support of David's scenario.
>> >>>
>> >>
>> >> We supported them in 2.6.18 kernels, apparently. This might sounds
>> >> stupid but who knows ?
>> >
>> >
>> > I guess this is actually the key point we need to agree on to provide a
>> > solution. This particular case "worked" in 2.6.18 kernels, but that
>> > doesn't necessarily mean it was supported, or even intentional.
>> >
>> > It sounds to me that we agree that we should support RO shared mappings.
>> > The question remains about whether we should introduce deliberate
>> > support of RO private mappings, and if so, if the forced COW approach is
>> > appropriate or not.
>> >
>>
>> I disagree.
>>
>> FUTEX_WAIT has side-effects. ?Specifically, it eats one wakeup sent by
>> FUTEX_WAKE. ?So if something uses futexes on a file mapping, then a
>> process with only read access could (if the semantics were changed) DoS
>> the other processes by spawning a bunch of threads and FUTEX_WAITing
>> from each of them.
>>
>> If there were a FUTEX_WAIT_NOCONSUME that did not consume a wakeup and
>> worked on RO mappings, I would drop my objection.
>
> If a group of cooperating processes uses a memory segment to exchange
> critical information, do you really think this memory segment will be
> readable by other unrelated processes on the machine ?

Depends on the design.

I have some software I'm working on that uses shared files and could
easily use futexes. I don't want random read-only processes to
interfere with the futex protocol.

>
> How is this related to futex code ?

Because this usage is currently safe and would become unsafe with the
proposed change.

>
> Same problem for legacy IPC (shm, msg, sem) : Appropriate protections
> are needed, obviously.
>
> BTW, kernel/futex.c uses a global hash table (futex_queues[256]) and a
> very predictable hash_futex(), so its easy to slow down futex users...

There's a difference between slowing down users by abusing a kernel
hash and deadlocking users by eating a wakeup. (If you eat a wakeup
the wakeup won't magically come back later. It's gone.)

--Andy

2011-06-07 19:01:38

by Darren Hart

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.



On 06/07/2011 11:43 AM, Andrew Lutomirski wrote:
> On Tue, Jun 7, 2011 at 11:58 AM, Eric Dumazet <[email protected]> wrote:
>> Le mardi 07 juin 2011 ? 10:44 -0400, Andy Lutomirski a ?crit :
>>> On 06/06/2011 11:13 PM, Darren Hart wrote:
>>>>
>>>>
>>>> On 06/06/2011 11:11 AM, Eric Dumazet wrote:
>>>>> Le lundi 06 juin 2011 ? 10:53 -0700, Darren Hart a ?crit :
>>>>>>
>>>>>
>>>>>> If I understand the problem correctly, RO private mapping really doesn't
>>>>>> make any sense and we should probably explicitly not support it, while
>>>>>> special casing the RO shared mapping in support of David's scenario.
>>>>>>
>>>>>
>>>>> We supported them in 2.6.18 kernels, apparently. This might sounds
>>>>> stupid but who knows ?
>>>>
>>>>
>>>> I guess this is actually the key point we need to agree on to provide a
>>>> solution. This particular case "worked" in 2.6.18 kernels, but that
>>>> doesn't necessarily mean it was supported, or even intentional.
>>>>
>>>> It sounds to me that we agree that we should support RO shared mappings.
>>>> The question remains about whether we should introduce deliberate
>>>> support of RO private mappings, and if so, if the forced COW approach is
>>>> appropriate or not.
>>>>
>>>
>>> I disagree.
>>>
>>> FUTEX_WAIT has side-effects. Specifically, it eats one wakeup sent by
>>> FUTEX_WAKE. So if something uses futexes on a file mapping, then a
>>> process with only read access could (if the semantics were changed) DoS
>>> the other processes by spawning a bunch of threads and FUTEX_WAITing
>>> from each of them.
>>>
>>> If there were a FUTEX_WAIT_NOCONSUME that did not consume a wakeup and
>>> worked on RO mappings, I would drop my objection.
>>
>> If a group of cooperating processes uses a memory segment to exchange
>> critical information, do you really think this memory segment will be
>> readable by other unrelated processes on the machine ?
>
> Depends on the design.
>
> I have some software I'm working on that uses shared files and could
> easily use futexes. I don't want random read-only processes to
> interfere with the futex protocol.


So don't use world readable files.


>>
>> How is this related to futex code ?
>
> Because this usage is currently safe and would become unsafe with the
> proposed change.
>
>>
>> Same problem for legacy IPC (shm, msg, sem) : Appropriate protections
>> are needed, obviously.
>>
>> BTW, kernel/futex.c uses a global hash table (futex_queues[256]) and a
>> very predictable hash_futex(), so its easy to slow down futex users...
>
> There's a difference between slowing down users by abusing a kernel
> hash and deadlocking users by eating a wakeup. (If you eat a wakeup
> the wakeup won't magically come back later. It's gone.)

That's the nature of SHARED, you have to protect the mapping independent
of the futex mechanism.

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-07 19:04:24

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Tue, Jun 7, 2011 at 3:01 PM, Darren Hart <[email protected]> wrote:
>>> If a group of cooperating processes uses a memory segment to exchange
>>> critical information, do you really think this memory segment will be
>>> readable by other unrelated processes on the machine ?
>>
>> Depends on the design.
>>
>> I have some software I'm working on that uses shared files and could
>> easily use futexes. ?I don't want random read-only processes to
>> interfere with the futex protocol.
>
>
> So don't use world readable files.

...which prevents people from *reading* them, which was the whole point.

>
>
>>>
>>> How is this related to futex code ?
>>
>> Because this usage is currently safe and would become unsafe with the
>> proposed change.
>>
>>>
>>> Same problem for legacy IPC (shm, msg, sem) : Appropriate protections
>>> are needed, obviously.
>>>
>>> BTW, kernel/futex.c uses a global hash table (futex_queues[256]) and a
>>> very predictable hash_futex(), so its easy to slow down futex users...
>>
>> There's a difference between slowing down users by abusing a kernel
>> hash and deadlocking users by eating a wakeup. ?(If you eat a wakeup
>> the wakeup won't magically come back later. ?It's gone.)
>
> That's the nature of SHARED, you have to protect the mapping independent
> of the futex mechanism.

Well... it used to mean you have to protect from untrusted RW users.
Now it will mean you have to protect from untrusted RO users.

AFAICT sys_futex will become the only way that a user with RO access
to a file can actually interfere with the owner of the file (as
opposed to just learning information).

Why do we need this change again?

--Andy

2011-06-07 19:06:27

by Eric Dumazet

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Le mardi 07 juin 2011 à 14:43 -0400, Andrew Lutomirski a écrit :

> I have some software I'm working on that uses shared files and could
> easily use futexes. I don't want random read-only processes to
> interfere with the futex protocol.

Relying on kernel doing 'the right thing for you' wont work on old
kernels. Some people still want to run 2.6.18 ones, for very good
reasons. Also this 'magical protection' is not documented on man pages,
so this is would be a bad choice.

Regression being one year old, you can bet many machines run with
previous behavior (safe too, unless your application design is not
secured)

> There's a difference between slowing down users by abusing a kernel
> hash and deadlocking users by eating a wakeup. (If you eat a wakeup
> the wakeup won't magically come back later. It's gone.)

Sure, you could let messages queues, named pipes, being readable as
well. Or your files being writeable, who knows, and lost your data too.

I dont know why I even discuss the point. A regression was added, you
cant justify it saying futexes were insecure from 2002 to 2010.


2011-06-07 19:10:45

by David C. Oliver

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Tue, Jun 7, 2011 at 1:43 PM, Andrew Lutomirski <[email protected]> wrote:
> On Tue, Jun 7, 2011 at 11:58 AM, Eric Dumazet <[email protected]> wrote:
>> Le mardi 07 juin 2011 ? 10:44 -0400, Andy Lutomirski a ?crit :
>>> On 06/06/2011 11:13 PM, Darren Hart wrote:
>>> >
>>> >
>>> > On 06/06/2011 11:11 AM, Eric Dumazet wrote:
>>> >> Le lundi 06 juin 2011 ? 10:53 -0700, Darren Hart a ?crit :
>>> >>>
>>> >>
>>> >>> If I understand the problem correctly, RO private mapping really doesn't
>>> >>> make any sense and we should probably explicitly not support it, while
>>> >>> special casing the RO shared mapping in support of David's scenario.
>>> >>>
>>> >>
>>> >> We supported them in 2.6.18 kernels, apparently. This might sounds
>>> >> stupid but who knows ?
>>> >
>>> >
>>> > I guess this is actually the key point we need to agree on to provide a
>>> > solution. This particular case "worked" in 2.6.18 kernels, but that
>>> > doesn't necessarily mean it was supported, or even intentional.
>>> >
>>> > It sounds to me that we agree that we should support RO shared mappings.
>>> > The question remains about whether we should introduce deliberate
>>> > support of RO private mappings, and if so, if the forced COW approach is
>>> > appropriate or not.
>>> >
>>>
>>> I disagree.
>>>
>>> FUTEX_WAIT has side-effects. ?Specifically, it eats one wakeup sent by
>>> FUTEX_WAKE. ?So if something uses futexes on a file mapping, then a
>>> process with only read access could (if the semantics were changed) DoS
>>> the other processes by spawning a bunch of threads and FUTEX_WAITing
>>> from each of them.
>>>
>>> If there were a FUTEX_WAIT_NOCONSUME that did not consume a wakeup and
>>> worked on RO mappings, I would drop my objection.
>>
>> If a group of cooperating processes uses a memory segment to exchange
>> critical information, do you really think this memory segment will be
>> readable by other unrelated processes on the machine ?
>
> Depends on the design.
>
> I have some software I'm working on that uses shared files and could
> easily use futexes.
>
I have software which currently uses shared files for a one way
transfer of information, which is modeled precisely by the futex (as
contrasted to the mutex) model. In this case, the number of receivers
is undetermined, so the number of wakeups is set to maxint.

The receivers are minimally trusted: they have read access to the
files, so they cannot accidentally affect other processes use of the
data. Requiring my files to be writeable by all clients would require
a serious increase in the amount of software needing to be trusted.

>?I don't want random read-only processes to
> interfere with the futex protocol.
>
If limited wakeups are needed for your application, this can be
prevented by preventing access to the files. This is analogous to
preventing read access to fifos: reading a random unprotected named
pipe would have interesting consequences.

>> How is this related to futex code ?
>
> Because this usage is currently safe and would become unsafe with the
> proposed change.
>
My usage was both safe and useful before the recent kernel change.
>>
>> Same problem for legacy IPC (shm, msg, sem) : Appropriate protections
>> are needed, obviously.
>>
>> BTW, kernel/futex.c uses a global hash table (futex_queues[256]) and a
>> very predictable hash_futex(), so its easy to slow down futex users...
>
> There's a difference between slowing down users by abusing a kernel
> hash and deadlocking users by eating a wakeup. ?(If you eat a wakeup
> the wakeup won't magically come back later. ?It's gone.)
>
> --Andy
>

--
Cheers!

David.


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-07 19:19:55

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Tue, Jun 7, 2011 at 3:10 PM, David Oliver <[email protected]> wrote:
> On Tue, Jun 7, 2011 at 1:43 PM, Andrew Lutomirski <[email protected]> wrote:
>> On Tue, Jun 7, 2011 at 11:58 AM, Eric Dumazet <[email protected]> wrote:
>>> Le mardi 07 juin 2011 ? 10:44 -0400, Andy Lutomirski a ?crit :
>>>> On 06/06/2011 11:13 PM, Darren Hart wrote:
>>>> >
>>>> >
>>>> > On 06/06/2011 11:11 AM, Eric Dumazet wrote:
>>>> >> Le lundi 06 juin 2011 ? 10:53 -0700, Darren Hart a ?crit :
>>>> >>>
>>>> >>
>>>> >>> If I understand the problem correctly, RO private mapping really doesn't
>>>> >>> make any sense and we should probably explicitly not support it, while
>>>> >>> special casing the RO shared mapping in support of David's scenario.
>>>> >>>
>>>> >>
>>>> >> We supported them in 2.6.18 kernels, apparently. This might sounds
>>>> >> stupid but who knows ?
>>>> >
>>>> >
>>>> > I guess this is actually the key point we need to agree on to provide a
>>>> > solution. This particular case "worked" in 2.6.18 kernels, but that
>>>> > doesn't necessarily mean it was supported, or even intentional.
>>>> >
>>>> > It sounds to me that we agree that we should support RO shared mappings.
>>>> > The question remains about whether we should introduce deliberate
>>>> > support of RO private mappings, and if so, if the forced COW approach is
>>>> > appropriate or not.
>>>> >
>>>>
>>>> I disagree.
>>>>
>>>> FUTEX_WAIT has side-effects. ?Specifically, it eats one wakeup sent by
>>>> FUTEX_WAKE. ?So if something uses futexes on a file mapping, then a
>>>> process with only read access could (if the semantics were changed) DoS
>>>> the other processes by spawning a bunch of threads and FUTEX_WAITing
>>>> from each of them.
>>>>
>>>> If there were a FUTEX_WAIT_NOCONSUME that did not consume a wakeup and
>>>> worked on RO mappings, I would drop my objection.
>>>
>>> If a group of cooperating processes uses a memory segment to exchange
>>> critical information, do you really think this memory segment will be
>>> readable by other unrelated processes on the machine ?
>>
>> Depends on the design.
>>
>> I have some software I'm working on that uses shared files and could
>> easily use futexes.
>>
> I have software which currently uses shared files for a one way
> transfer of information, which is modeled precisely by the futex (as
> contrasted to the mutex) model. In this case, the number of receivers
> is undetermined, so the number of wakeups is set to maxint.
>
> The receivers are minimally trusted: they have read access to the
> files, so they cannot accidentally affect other processes use of the
> data. Requiring my files to be writeable by all clients would require
> a serious increase in the amount of software needing to be trusted.

What's wrong with adding a FUTEX_WAIT_NOCONSUME flag then? Your
program can use it to get exactly the semantics it wants and my
program can use it or not depending on which semantics it wants.

Then we can document in the man page that, on kernels newer than
whichever version introduced the regression, read-only mappings of a
file cannot be used to interfere with futexes on that file.

--Andy

2011-06-07 19:33:55

by David C. Oliver

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Tue, Jun 7, 2011 at 2:19 PM, Andrew Lutomirski <[email protected]> wrote:
> On Tue, Jun 7, 2011 at 3:10 PM, David Oliver <[email protected]> wrote:
>> On Tue, Jun 7, 2011 at 1:43 PM, Andrew Lutomirski <[email protected]> wrote:
>>> On Tue, Jun 7, 2011 at 11:58 AM, Eric Dumazet <[email protected]> wrote:
>>>> Le mardi 07 juin 2011 ? 10:44 -0400, Andy Lutomirski a ?crit :
>>>>> On 06/06/2011 11:13 PM, Darren Hart wrote:
>>>>> >
>>>>> >
>>>>> > On 06/06/2011 11:11 AM, Eric Dumazet wrote:
>>>>> >> Le lundi 06 juin 2011 ? 10:53 -0700, Darren Hart a ?crit :
>>>>> >>>
>>>>> >>
>>>>> >>> If I understand the problem correctly, RO private mapping really doesn't
>>>>> >>> make any sense and we should probably explicitly not support it, while
>>>>> >>> special casing the RO shared mapping in support of David's scenario.
>>>>> >>>
>>>>> >>
>>>>> >> We supported them in 2.6.18 kernels, apparently. This might sounds
>>>>> >> stupid but who knows ?
>>>>> >
>>>>> >
>>>>> > I guess this is actually the key point we need to agree on to provide a
>>>>> > solution. This particular case "worked" in 2.6.18 kernels, but that
>>>>> > doesn't necessarily mean it was supported, or even intentional.
>>>>> >
>>>>> > It sounds to me that we agree that we should support RO shared mappings.
>>>>> > The question remains about whether we should introduce deliberate
>>>>> > support of RO private mappings, and if so, if the forced COW approach is
>>>>> > appropriate or not.
>>>>> >
>>>>>
>>>>> I disagree.
>>>>>
>>>>> FUTEX_WAIT has side-effects. ?Specifically, it eats one wakeup sent by
>>>>> FUTEX_WAKE. ?So if something uses futexes on a file mapping, then a
>>>>> process with only read access could (if the semantics were changed) DoS
>>>>> the other processes by spawning a bunch of threads and FUTEX_WAITing
>>>>> from each of them.
>>>>>
>>>>> If there were a FUTEX_WAIT_NOCONSUME that did not consume a wakeup and
>>>>> worked on RO mappings, I would drop my objection.
>>>>
>>>> If a group of cooperating processes uses a memory segment to exchange
>>>> critical information, do you really think this memory segment will be
>>>> readable by other unrelated processes on the machine ?
>>>
>>> Depends on the design.
>>>
>>> I have some software I'm working on that uses shared files and could
>>> easily use futexes.
>>>
>> I have software which currently uses shared files for a one way
>> transfer of information, which is modeled precisely by the futex (as
>> contrasted to the mutex) model. In this case, the number of receivers
>> is undetermined, so the number of wakeups is set to maxint.
>>
>> The receivers are minimally trusted: they have read access to the
>> files, so they cannot accidentally affect other processes use of the
>> data. Requiring my files to be writeable by all clients would require
>> a serious increase in the amount of software needing to be trusted.
>
> What's wrong with adding a FUTEX_WAIT_NOCONSUME flag then? ?Your
> program can use it to get exactly the semantics it wants and my
> program can use it or not depending on which semantics it wants.
>
1. I would prefer not to require my programs have to check for kernel
version (code named "working", "regressed", and "altered") to decide
which parameters need to be sent to the futex call.
2. Doing FUTEX_WAIT_NOCONSUME would change the semantics of
futex_wake() between the "working" and "altered" kernels, as it would
no longer return the number of processes woken.

It seems that FUTEX_WAIT_NOCONSUME would be rather like a
non-consuming read on a pipe.

> Then we can document in the man page that, on kernels newer than
> whichever version introduced the regression, read-only mappings of a
> file cannot be used to interfere with futexes on that file.
>
> --Andy
>

--
Cheers!

David.


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-07 19:53:35

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Tue, Jun 7, 2011 at 3:33 PM, David Oliver <[email protected]> wrote:
> On Tue, Jun 7, 2011 at 2:19 PM, Andrew Lutomirski <[email protected]> wrote:
>> On Tue, Jun 7, 2011 at 3:10 PM, David Oliver <[email protected]> wrote:
>>> On Tue, Jun 7, 2011 at 1:43 PM, Andrew Lutomirski <[email protected]> wrote:
>>>> On Tue, Jun 7, 2011 at 11:58 AM, Eric Dumazet <[email protected]> wrote:
>>>>> Le mardi 07 juin 2011 ? 10:44 -0400, Andy Lutomirski a ?crit :
>>>>>> On 06/06/2011 11:13 PM, Darren Hart wrote:
>>>>>> >
>>>>>> >
>>>>>> > On 06/06/2011 11:11 AM, Eric Dumazet wrote:
>>>>>> >> Le lundi 06 juin 2011 ? 10:53 -0700, Darren Hart a ?crit :
>>>>>> >>>
>>>>>> >>
>>>>>> >>> If I understand the problem correctly, RO private mapping really doesn't
>>>>>> >>> make any sense and we should probably explicitly not support it, while
>>>>>> >>> special casing the RO shared mapping in support of David's scenario.
>>>>>> >>>
>>>>>> >>
>>>>>> >> We supported them in 2.6.18 kernels, apparently. This might sounds
>>>>>> >> stupid but who knows ?
>>>>>> >
>>>>>> >
>>>>>> > I guess this is actually the key point we need to agree on to provide a
>>>>>> > solution. This particular case "worked" in 2.6.18 kernels, but that
>>>>>> > doesn't necessarily mean it was supported, or even intentional.
>>>>>> >
>>>>>> > It sounds to me that we agree that we should support RO shared mappings.
>>>>>> > The question remains about whether we should introduce deliberate
>>>>>> > support of RO private mappings, and if so, if the forced COW approach is
>>>>>> > appropriate or not.
>>>>>> >
>>>>>>
>>>>>> I disagree.
>>>>>>
>>>>>> FUTEX_WAIT has side-effects. ?Specifically, it eats one wakeup sent by
>>>>>> FUTEX_WAKE. ?So if something uses futexes on a file mapping, then a
>>>>>> process with only read access could (if the semantics were changed) DoS
>>>>>> the other processes by spawning a bunch of threads and FUTEX_WAITing
>>>>>> from each of them.
>>>>>>
>>>>>> If there were a FUTEX_WAIT_NOCONSUME that did not consume a wakeup and
>>>>>> worked on RO mappings, I would drop my objection.
>>>>>
>>>>> If a group of cooperating processes uses a memory segment to exchange
>>>>> critical information, do you really think this memory segment will be
>>>>> readable by other unrelated processes on the machine ?
>>>>
>>>> Depends on the design.
>>>>
>>>> I have some software I'm working on that uses shared files and could
>>>> easily use futexes.
>>>>
>>> I have software which currently uses shared files for a one way
>>> transfer of information, which is modeled precisely by the futex (as
>>> contrasted to the mutex) model. In this case, the number of receivers
>>> is undetermined, so the number of wakeups is set to maxint.
>>>
>>> The receivers are minimally trusted: they have read access to the
>>> files, so they cannot accidentally affect other processes use of the
>>> data. Requiring my files to be writeable by all clients would require
>>> a serious increase in the amount of software needing to be trusted.
>>
>> What's wrong with adding a FUTEX_WAIT_NOCONSUME flag then? ?Your
>> program can use it to get exactly the semantics it wants and my
>> program can use it or not depending on which semantics it wants.
>>
> 1. I would prefer not to require my programs have to check for kernel
> version (code named "working", "regressed", and "altered") to decide
> which parameters need to be sent to the futex call.

You don't have to check for kernel version. Just try
FUTEX_WAIT_NOCONSUME first and retry with FUTEX_WAIT if it returns
-EINVAL.

I think you've already lost on regressed kernels regardless :-/

> 2. Doing FUTEX_WAIT_NOCONSUME would change the semantics of
> futex_wake() between the "working" and "altered" kernels, as it would
> no longer return the number of processes woken.

True, but that change couldn't affect old code because old code
wouldn't use FUTEX_WAIT_NOCONSUME.

>
> It seems that FUTEX_WAIT_NOCONSUME would be rather like a
> non-consuming read on a pipe.

More like a nonconsuming read on an eventfd, which sounds very useful.
(Actually, I'm porting code from Windows to Linux right now that
wants that feature...)

The reason I bring this up now is that I've been annoyed that
FUTEX_WAIT can be used on an R/O mapping to interfere with futexes in
that mapping. Under the original semantics this would have been
pretty much impossible to fix, but the regression has been there for
long enough that we have the option right now to fix it better instead
of restoring the original behavior.


--Andy

2011-06-07 20:04:30

by David C. Oliver

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Tue, Jun 7, 2011 at 2:53 PM, Andrew Lutomirski <[email protected]> wrote:
> On Tue, Jun 7, 2011 at 3:33 PM, David Oliver <[email protected]> wrote:
>> On Tue, Jun 7, 2011 at 2:19 PM, Andrew Lutomirski <[email protected]> wrote:
>>> On Tue, Jun 7, 2011 at 3:10 PM, David Oliver <[email protected]> wrote:
>>>> On Tue, Jun 7, 2011 at 1:43 PM, Andrew Lutomirski <[email protected]> wrote:
>>>>> On Tue, Jun 7, 2011 at 11:58 AM, Eric Dumazet <[email protected]> wrote:
>>>>>> Le mardi 07 juin 2011 à 10:44 -0400, Andy Lutomirski a écrit :
>>>>>>> On 06/06/2011 11:13 PM, Darren Hart wrote:
>>>>>>> >
>>>>>>> >
>>>>>>> > On 06/06/2011 11:11 AM, Eric Dumazet wrote:
>>>>>>> >> Le lundi 06 juin 2011 à 10:53 -0700, Darren Hart a écrit :
>>>>>>> >>>
>>>>>>> >>
>>>>>>> >>> If I understand the problem correctly, RO private mapping really doesn't
>>>>>>> >>> make any sense and we should probably explicitly not support it, while
>>>>>>> >>> special casing the RO shared mapping in support of David's scenario.
>>>>>>> >>>
>>>>>>> >>
>>>>>>> >> We supported them in 2.6.18 kernels, apparently. This might sounds
>>>>>>> >> stupid but who knows ?
>>>>>>> >
>>>>>>> >
>>>>>>> > I guess this is actually the key point we need to agree on to provide a
>>>>>>> > solution. This particular case "worked" in 2.6.18 kernels, but that
>>>>>>> > doesn't necessarily mean it was supported, or even intentional.
>>>>>>> >
>>>>>>> > It sounds to me that we agree that we should support RO shared mappings.
>>>>>>> > The question remains about whether we should introduce deliberate
>>>>>>> > support of RO private mappings, and if so, if the forced COW approach is
>>>>>>> > appropriate or not.
>>>>>>> >
>>>>>>>
>>>>>>> I disagree.
>>>>>>>
>>>>>>> FUTEX_WAIT has side-effects.  Specifically, it eats one wakeup sent by
>>>>>>> FUTEX_WAKE.  So if something uses futexes on a file mapping, then a
>>>>>>> process with only read access could (if the semantics were changed) DoS
>>>>>>> the other processes by spawning a bunch of threads and FUTEX_WAITing
>>>>>>> from each of them.
>>>>>>>
>>>>>>> If there were a FUTEX_WAIT_NOCONSUME that did not consume a wakeup and
>>>>>>> worked on RO mappings, I would drop my objection.
>>>>>>
>>>>>> If a group of cooperating processes uses a memory segment to exchange
>>>>>> critical information, do you really think this memory segment will be
>>>>>> readable by other unrelated processes on the machine ?
>>>>>
>>>>> Depends on the design.
>>>>>
>>>>> I have some software I'm working on that uses shared files and could
>>>>> easily use futexes.
>>>>>
>>>> I have software which currently uses shared files for a one way
>>>> transfer of information, which is modeled precisely by the futex (as
>>>> contrasted to the mutex) model. In this case, the number of receivers
>>>> is undetermined, so the number of wakeups is set to maxint.
>>>>
>>>> The receivers are minimally trusted: they have read access to the
>>>> files, so they cannot accidentally affect other processes use of the
>>>> data. Requiring my files to be writeable by all clients would require
>>>> a serious increase in the amount of software needing to be trusted.
>>>
>>> What's wrong with adding a FUTEX_WAIT_NOCONSUME flag then?  Your
>>> program can use it to get exactly the semantics it wants and my
>>> program can use it or not depending on which semantics it wants.
>>>
>> 1. I would prefer not to require my programs have to check for kernel
>> version (code named "working", "regressed", and "altered") to decide
>> which parameters need to be sent to the futex call.
>
> You don't have to check for kernel version.  Just try
> FUTEX_WAIT_NOCONSUME first and retry with FUTEX_WAIT if it returns
> -EINVAL.
>
... and punt if that gives me an EFAULT. Possible but clumsy.
Fortunately, I'm not writing code for general consumption.

> I think you've already lost on regressed kernels regardless :-/
>
>> 2. Doing FUTEX_WAIT_NOCONSUME would change the semantics of
>> futex_wake() between the "working" and "altered" kernels, as it would
>> no longer return the number of processes woken.
>
> True, but that change couldn't affect old code because old code
> wouldn't use FUTEX_WAIT_NOCONSUME.
>
So, how would I find out the number of processes awakened by the
futex_wake() - I only care for statistical purposes.

>>
>> It seems that FUTEX_WAIT_NOCONSUME would be rather like a
>> non-consuming read on a pipe.
>
> More like a nonconsuming read on an eventfd, which sounds very useful.
>  (Actually, I'm porting code from Windows to Linux right now that
> wants that feature...)
>
> The reason I bring this up now is that I've been annoyed that
> FUTEX_WAIT can be used on an R/O mapping to interfere with futexes in
> that mapping.  Under the original semantics this would have been
> pretty much impossible to fix, but the regression has been there for
> long enough that we have the option right now to fix it better instead
> of restoring the original behavior.
>
Not being a kernel developer, the change seems very recent - about
when I started finding my code failing with EFAULTs.

>From my perspective, that's a real case of my futexes being interfered with :).
>
> --Andy
>

--
Cheers!

David.


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-07 20:12:50

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Tue, Jun 7, 2011 at 4:04 PM, David Oliver <[email protected]> wrote:
> On Tue, Jun 7, 2011 at 2:53 PM, Andrew Lutomirski <[email protected]> wrote:
>> On Tue, Jun 7, 2011 at 3:33 PM, David Oliver <[email protected]> wrote:
>>> On Tue, Jun 7, 2011 at 2:19 PM, Andrew Lutomirski <[email protected]> wrote:
>>>> On Tue, Jun 7, 2011 at 3:10 PM, David Oliver <[email protected]> wrote:
>>>>> On Tue, Jun 7, 2011 at 1:43 PM, Andrew Lutomirski <[email protected]> wrote:
>>>>>> On Tue, Jun 7, 2011 at 11:58 AM, Eric Dumazet <[email protected]> wrote:
>>>>>>> Le mardi 07 juin 2011 à 10:44 -0400, Andy Lutomirski a écrit :
>>>>>>>> On 06/06/2011 11:13 PM, Darren Hart wrote:
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On 06/06/2011 11:11 AM, Eric Dumazet wrote:
>>>>>>>> >> Le lundi 06 juin 2011 à 10:53 -0700, Darren Hart a écrit :
>>>>>>>> >>>
>>>>>>>> >>
>>>>>>>> >>> If I understand the problem correctly, RO private mapping really doesn't
>>>>>>>> >>> make any sense and we should probably explicitly not support it, while
>>>>>>>> >>> special casing the RO shared mapping in support of David's scenario.
>>>>>>>> >>>
>>>>>>>> >>
>>>>>>>> >> We supported them in 2.6.18 kernels, apparently. This might sounds
>>>>>>>> >> stupid but who knows ?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > I guess this is actually the key point we need to agree on to provide a
>>>>>>>> > solution. This particular case "worked" in 2.6.18 kernels, but that
>>>>>>>> > doesn't necessarily mean it was supported, or even intentional.
>>>>>>>> >
>>>>>>>> > It sounds to me that we agree that we should support RO shared mappings.
>>>>>>>> > The question remains about whether we should introduce deliberate
>>>>>>>> > support of RO private mappings, and if so, if the forced COW approach is
>>>>>>>> > appropriate or not.
>>>>>>>> >
>>>>>>>>
>>>>>>>> I disagree.
>>>>>>>>
>>>>>>>> FUTEX_WAIT has side-effects.  Specifically, it eats one wakeup sent by
>>>>>>>> FUTEX_WAKE.  So if something uses futexes on a file mapping, then a
>>>>>>>> process with only read access could (if the semantics were changed) DoS
>>>>>>>> the other processes by spawning a bunch of threads and FUTEX_WAITing
>>>>>>>> from each of them.
>>>>>>>>
>>>>>>>> If there were a FUTEX_WAIT_NOCONSUME that did not consume a wakeup and
>>>>>>>> worked on RO mappings, I would drop my objection.
>>>>>>>
>>>>>>> If a group of cooperating processes uses a memory segment to exchange
>>>>>>> critical information, do you really think this memory segment will be
>>>>>>> readable by other unrelated processes on the machine ?
>>>>>>
>>>>>> Depends on the design.
>>>>>>
>>>>>> I have some software I'm working on that uses shared files and could
>>>>>> easily use futexes.
>>>>>>
>>>>> I have software which currently uses shared files for a one way
>>>>> transfer of information, which is modeled precisely by the futex (as
>>>>> contrasted to the mutex) model. In this case, the number of receivers
>>>>> is undetermined, so the number of wakeups is set to maxint.
>>>>>
>>>>> The receivers are minimally trusted: they have read access to the
>>>>> files, so they cannot accidentally affect other processes use of the
>>>>> data. Requiring my files to be writeable by all clients would require
>>>>> a serious increase in the amount of software needing to be trusted.
>>>>
>>>> What's wrong with adding a FUTEX_WAIT_NOCONSUME flag then?  Your
>>>> program can use it to get exactly the semantics it wants and my
>>>> program can use it or not depending on which semantics it wants.
>>>>
>>> 1. I would prefer not to require my programs have to check for kernel
>>> version (code named "working", "regressed", and "altered") to decide
>>> which parameters need to be sent to the futex call.
>>
>> You don't have to check for kernel version.  Just try
>> FUTEX_WAIT_NOCONSUME first and retry with FUTEX_WAIT if it returns
>> -EINVAL.
>>
> ... and punt if that gives me an EFAULT. Possible but clumsy.
> Fortunately, I'm not writing code for general consumption.
>
>> I think you've already lost on regressed kernels regardless :-/
>>
>>> 2. Doing FUTEX_WAIT_NOCONSUME would change the semantics of
>>> futex_wake() between the "working" and "altered" kernels, as it would
>>> no longer return the number of processes woken.
>>
>> True, but that change couldn't affect old code because old code
>> wouldn't use FUTEX_WAIT_NOCONSUME.
>>
> So, how would I find out the number of processes awakened by the
> futex_wake() - I only care for statistical purposes.

Add a FUTEX_WAKE_COUNT_NOCONSUME or some such magic flag. Yeah, not so pretty.

>
>>>
>>> It seems that FUTEX_WAIT_NOCONSUME would be rather like a
>>> non-consuming read on a pipe.
>>
>> More like a nonconsuming read on an eventfd, which sounds very useful.
>>  (Actually, I'm porting code from Windows to Linux right now that
>> wants that feature...)
>>
>> The reason I bring this up now is that I've been annoyed that
>> FUTEX_WAIT can be used on an R/O mapping to interfere with futexes in
>> that mapping.  Under the original semantics this would have been
>> pretty much impossible to fix, but the regression has been there for
>> long enough that we have the option right now to fix it better instead
>> of restoring the original behavior.
>>
> Not being a kernel developer, the change seems very recent - about
> when I started finding my code failing with EFAULTs.
>
> From my perspective, that's a real case of my futexes being interfered with :).

Fair enough. But it's a little late to prevent the regression.

--Andy

2011-06-07 22:26:33

by Kyle Moffett

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Tue, Jun 7, 2011 at 15:19, Andrew Lutomirski <[email protected]> wrote:
> On Tue, Jun 7, 2011 at 3:10 PM, David Oliver <[email protected]> wrote:
>> I have software which currently uses shared files for a one way
>> transfer of information, which is modeled precisely by the futex (as
>> contrasted to the mutex) model. In this case, the number of receivers
>> is undetermined, so the number of wakeups is set to maxint.
>>
>> The receivers are minimally trusted: they have read access to the
>> files, so they cannot accidentally affect other processes use of the
>> data. Requiring my files to be writeable by all clients would require
>> a serious increase in the amount of software needing to be trusted.
>
> What's wrong with adding a FUTEX_WAIT_NOCONSUME flag then?  Your
> program can use it to get exactly the semantics it wants and my
> program can use it or not depending on which semantics it wants.
>
> Then we can document in the man page that, on kernels newer than
> whichever version introduced the regression, read-only mappings of a
> file cannot be used to interfere with futexes on that file.

Hmm, I would actually call it "FUTEX_POLL", since that better reflects the
operation being performed.

Certainly you would want to avoid allowing FUTEX_POLL to "steal"
limited wakeups from FUTEX_CMP_REQUEUE or whatever, so you
also need a new "FUTEX_NOTIFY". Alternatively I guess you could just
special-case the FUTEX_WAKE && wakeups == INTMAX combination to
also notify FUTEX_POLL processes.

I almost wonder if long-term there might possibly be some decent way
to integrate this with eventfds to allow a thread to wait for notifications from
any number of memory addresses as well as other event sources. This
would be a similar extension to signalfd, only for futexes.

Cheers,
Kyle Moffett

2011-06-08 15:20:19

by David C. Oliver

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Tue, Jun 7, 2011 at 5:26 PM, Kyle Moffett <[email protected]> wrote:
> On Tue, Jun 7, 2011 at 15:19, Andrew Lutomirski <[email protected]> wrote:
>> On Tue, Jun 7, 2011 at 3:10 PM, David Oliver <[email protected]> wrote:
>>> I have software which currently uses shared files for a one way
>>> transfer of information, which is modeled precisely by the futex (as
>>> contrasted to the mutex) model. In this case, the number of receivers
>>> is undetermined, so the number of wakeups is set to maxint.
>>>
>>> The receivers are minimally trusted: they have read access to the
>>> files, so they cannot accidentally affect other processes use of the
>>> data. Requiring my files to be writeable by all clients would require
>>> a serious increase in the amount of software needing to be trusted.
>>
>> What's wrong with adding a FUTEX_WAIT_NOCONSUME flag then? ?Your
>> program can use it to get exactly the semantics it wants and my
>> program can use it or not depending on which semantics it wants.
>>
>> Then we can document in the man page that, on kernels newer than
>> whichever version introduced the regression, read-only mappings of a
>> file cannot be used to interfere with futexes on that file.
>
> Hmm, I would actually call it "FUTEX_POLL", since that better reflects the
> operation being performed.
>
> Certainly you would want to avoid allowing FUTEX_POLL to "steal"
> limited wakeups from FUTEX_CMP_REQUEUE or whatever, so you
> also need a new "FUTEX_NOTIFY". ?Alternatively I guess you could just
> special-case the FUTEX_WAKE && wakeups == INTMAX combination to
> also notify FUTEX_POLL processes.
>
> I almost wonder if long-term there might possibly be some decent way
> to integrate this with eventfds to allow a thread to wait for notifications from
> any number of memory addresses as well as other event sources. ?This
> would be a similar extension to signalfd, only for futexes.
>
> Cheers,
> Kyle Moffett
>

Having a new call is inelegant from a futex(2) user perspective, as
the need for a change is due to the kernel implementation and/or mutex
requirements. The futex() system call, as documented, is ideal for a
single producer to signal multiple receivers of state updates.

If it is truly necessary to add new variants to futex() to protect
applications that allow untrusted applications read access to their
mutexes, I would avoid both the names suggested, as consumption of
wakeups is not an obvious issue to users, and POLL suggests waiting
for multiple entities as in poll(2) (which is not provided), or
returning immediately (which is orthogonally provided by the timeout
parameter). What is being provided from the user point of view is a
FUTEX_WAIT per the man page, which doesn't require write access. How
about FUTEX_WAIT_RDONLY?

Alternatively, use the current call and document that when process
performing a FUTEX_WAIT on read-only memory are woken, they do not
count towards the number reported as being woken.

Best, IMHO, would be to document that providing read access to mutexes
to untrusted software is unsafe behavior, and restore read only access
to readers of futexes.

--
Cheers!

David.


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-08 15:22:21

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Wed, Jun 8, 2011 at 11:20 AM, David Oliver <[email protected]> wrote:
>
> Having a new call is inelegant from a futex(2) user perspective, as
> the need for a change is due to the kernel implementation and/or mutex
> requirements. The futex() system call, as documented, is ideal for a
> single producer to signal multiple receivers of state updates.
>
> If it is truly necessary to add new variants to futex() to protect
> applications that allow untrusted applications read access to their
> mutexes, I would avoid both the names suggested, as consumption of
> wakeups is not an obvious issue to users, and POLL suggests waiting
> for multiple entities as in poll(2) (which is not provided), or
> returning immediately (which is orthogonally provided by the timeout
> parameter). What is being provided from the user point of view is a
> FUTEX_WAIT per the man page, which doesn't require write access. How
> about FUTEX_WAIT_RDONLY?

That name sounds good.

>
> Alternatively, use the current call and document that when process
> performing a FUTEX_WAIT on read-only memory are woken, they do not
> count towards the number reported as being woken.

I don't see anything wrong with that, either.

>
> Best, IMHO, would be to document that providing read access to mutexes
> to untrusted software is unsafe behavior, and restore read only access
> to readers of futexes.

I think it should be safe, and it would be easy to make it be safe
(i.e. make no changes at all).

--Andy

2011-06-08 16:21:39

by Darren Hart

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.



On 06/08/2011 08:20 AM, David Oliver wrote:
> On Tue, Jun 7, 2011 at 5:26 PM, Kyle Moffett <[email protected]> wrote:
>> On Tue, Jun 7, 2011 at 15:19, Andrew Lutomirski <[email protected]> wrote:
>>> On Tue, Jun 7, 2011 at 3:10 PM, David Oliver <[email protected]> wrote:
>>>> I have software which currently uses shared files for a one way
>>>> transfer of information, which is modeled precisely by the futex (as
>>>> contrasted to the mutex) model. In this case, the number of receivers
>>>> is undetermined, so the number of wakeups is set to maxint.
>>>>
>>>> The receivers are minimally trusted: they have read access to the
>>>> files, so they cannot accidentally affect other processes use of the
>>>> data. Requiring my files to be writeable by all clients would require
>>>> a serious increase in the amount of software needing to be trusted.
>>>
>>> What's wrong with adding a FUTEX_WAIT_NOCONSUME flag then? Your
>>> program can use it to get exactly the semantics it wants and my
>>> program can use it or not depending on which semantics it wants.
>>>
>>> Then we can document in the man page that, on kernels newer than
>>> whichever version introduced the regression, read-only mappings of a
>>> file cannot be used to interfere with futexes on that file.
>>
>> Hmm, I would actually call it "FUTEX_POLL", since that better reflects the
>> operation being performed.
>>
>> Certainly you would want to avoid allowing FUTEX_POLL to "steal"
>> limited wakeups from FUTEX_CMP_REQUEUE or whatever, so you
>> also need a new "FUTEX_NOTIFY". Alternatively I guess you could just
>> special-case the FUTEX_WAKE && wakeups == INTMAX combination to
>> also notify FUTEX_POLL processes.
>>
>> I almost wonder if long-term there might possibly be some decent way
>> to integrate this with eventfds to allow a thread to wait for notifications from
>> any number of memory addresses as well as other event sources. This
>> would be a similar extension to signalfd, only for futexes.
>>
>> Cheers,
>> Kyle Moffett
>>
>
> Having a new call is inelegant from a futex(2) user perspective, as
> the need for a change is due to the kernel implementation and/or mutex
> requirements. The futex() system call, as documented, is ideal for a
> single producer to signal multiple receivers of state updates.
>
> If it is truly necessary to add new variants to futex() to protect
> applications that allow untrusted applications read access to their
> mutexes, I would avoid both the names suggested, as consumption of
> wakeups is not an obvious issue to users, and POLL suggests waiting
> for multiple entities as in poll(2) (which is not provided), or
> returning immediately (which is orthogonally provided by the timeout
> parameter). What is being provided from the user point of view is a
> FUTEX_WAIT per the man page, which doesn't require write access. How
> about FUTEX_WAIT_RDONLY?
>
> Alternatively, use the current call and document that when process
> performing a FUTEX_WAIT on read-only memory are woken, they do not
> count towards the number reported as being woken.
>
> Best, IMHO, would be to document that providing read access to mutexes
> to untrusted software is unsafe behavior, and restore read only access
> to readers of futexes.

I'm inclined to agree with this approach.

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-09 00:44:40

by George Spelvin

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

I'm not sure if it's best, but the risk of RO waiters interfering could
be solved by giving them a lower prioirty for wakeup and always waking
RW-mapped waiters first.

In most cases, there would only be one type of waiter, so this would
have no effect, but if there were, a RO-mapped reader couldn't "steal"
wakeup from someone who had it RW-mapped.

It might conflict with a fairness goal, but it does solve the problem
without any extension to the API.

2011-06-09 03:02:44

by Darren Hart

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On 06/08/2011 05:44 PM, George Spelvin wrote:
> I'm not sure if it's best, but the risk of RO waiters interfering could
> be solved by giving them a lower prioirty for wakeup and always waking
> RW-mapped waiters first.

This strikes me as bending over backwards and jumping through hoops
inside the kernel to avoid having to do proper permissions management in
userspace.


> In most cases, there would only be one type of waiter, so this would
> have no effect, but if there were, a RO-mapped reader couldn't "steal"
> wakeup from someone who had it RW-mapped.
>
> It might conflict with a fairness goal, but it does solve the problem
> without any extension to the API.

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-09 03:39:08

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Wed, Jun 8, 2011 at 11:02 PM, Darren Hart <[email protected]> wrote:
> On 06/08/2011 05:44 PM, George Spelvin wrote:
>> I'm not sure if it's best, but the risk of RO waiters interfering could
>> be solved by giving them a lower prioirty for wakeup and always waking
>> RW-mapped waiters first.
>
> This strikes me as bending over backwards and jumping through hoops
> inside the kernel to avoid having to do proper permissions management in
> userspace.

Huh?

I still don't understand why userspace ought to need to deny read
access to a file to prevent DoS. I think it's entirely reasonable for
userspace to make the assumption that users with read access cannot
make changes visible to writers unless explicitly documented (i.e.
file locking, which is so thoroughly broken that it shouldn't be taken
as an example of how to design anything).

Given that current kernels make this use safe and the proposal is to
make it unsafe, I think it's worth designing the interface to avoid
introducing new security problems.

--Andy

2011-06-09 03:54:26

by Eric Dumazet

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Le mercredi 08 juin 2011 à 23:38 -0400, Andrew Lutomirski a écrit :

> Huh?
>
> I still don't understand why userspace ought to need to deny read
> access to a file to prevent DoS. I think it's entirely reasonable for
> userspace to make the assumption that users with read access cannot
> make changes visible to writers unless explicitly documented (i.e.
> file locking, which is so thoroughly broken that it shouldn't be taken
> as an example of how to design anything).
>
> Given that current kernels make this use safe and the proposal is to
> make it unsafe, I think it's worth designing the interface to avoid
> introducing new security problems.

I am very tired of this discussion, you repeat the same arguments over
and over.

You can not prevent DOS on a machine if you allow a process to RO map
your critical files (where you put futexes), because you allow this
process to interfere with critical cache lines bouncing between cpus.

Really, please forget about this crazy idea of allowing foreigners to
_read_ or memory _map_ your files. Dont do it.


2011-06-09 04:11:04

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Wed, Jun 8, 2011 at 11:54 PM, Eric Dumazet <[email protected]> wrote:
>
> You can not prevent DOS on a machine if you allow a process to RO map
> your critical files (where you put futexes), because you allow this
> process to interfere with critical cache lines bouncing between cpus.

The cacheline bounce DoS slows things down and they go back to normal
when you kill the DoS-ing task.

The wakeup-eating DoS is permanent. Seems a good deal worse to me.

If you make this change, please at least document it in the man page.

>
> Really, please forget about this crazy idea of allowing foreigners to
> _read_ or memory _map_ your files. Dont do it.
>

Then how am I supposed to efficiently broadcast information to
untrusted processes? I'll have to put any futexes involved into
different files, but one way or another the actual data will have to
be memory mapped to avoid syscall overhead.

--Andy

2011-06-09 04:43:57

by George Spelvin

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

> You can not prevent DOS on a machine if you allow a process to RO map
> your critical files (where you put futexes), because you allow this
> process to interfere with critical cache lines bouncing between cpus.
>
> Really, please forget about this crazy idea of allowing foreigners to
> _read_ or memory _map_ your files. Don't do it.

Hold on. There's a significant difference between making a system
run slowly (there are a hundred ways to achieve that, starting with
the classic fork bomb), and a permanent hard lockup caused by losing
a FUTEX_WAIT.

Among other things, you can notice a DoS after a while and kill it.
Things should then return to normal.

There may well be a reason to share a file containing futexes read-only.

Just for example, consider a circular buffer that a few (trusted)
processes can write to, but many (less trusted) can read. Obviously,
being able to sleep on the head pointer is useful.

We are already dealing with one problem caused by a too-narrow idea of
how futexes would be used. Let's not add another "nobody would ever
want to do that!" unnecessarily.

2011-06-09 04:45:12

by Kyle Moffett

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Wed, Jun 8, 2011 at 23:54, Eric Dumazet <[email protected]> wrote:
> Le mercredi 08 juin 2011 à 23:38 -0400, Andrew Lutomirski a écrit :
>> Huh?
>>
>> I still don't understand why userspace ought to need to deny read
>> access to a file to prevent DoS.  I think it's entirely reasonable for
>> userspace to make the assumption that users with read access cannot
>> make changes visible to writers unless explicitly documented (i.e.
>> file locking, which is so thoroughly broken that it shouldn't be taken
>> as an example of how to design anything).
>>
>> Given that current kernels make this use safe and the proposal is to
>> make it unsafe, I think it's worth designing the interface to avoid
>> introducing new security problems.
>
> I am very tired of this discussion, you repeat the same arguments over
> and over.
>
> You can not prevent DOS on a machine if you allow a process to RO map
> your critical files (where you put futexes), because you allow this
> process to interfere with critical cache lines bouncing between cpus.
>
> Really, please forget about this crazy idea of allowing foreigners to
> _read_ or memory _map_ your files. Dont do it.

The issue is NOT that things get "slow". There are lots of ways to do that
in an untrusted process on a normal Linux system. Chewing CPU time
and reading random small files from all over the disk are the easiest ones,
and most Linux distributions usually ship lots of such files in directories
such as /usr/share/doc, /usr/share/zoneinfo, various locales directories, etc.

The issue is that this allows you to eat wakeups and make processes hang.

One relatively trivial example would be a database library like libdb or
similar. The library could very reasonably use futexes to communicate
between multiple simultaneous threads writing to the same database
file. Since the library wants to be well-behaved and avoid thundering
herd problems, it only issues a single wakeup for each lock release.

Now you have another program which uses the same database library
to do lockless queries of the of the DB file. This is all well and good
except that it can now permanently hang an unlimited number of writer
threads in FUTEX_WAIT with trivial effort and virtually zero CPU.

All the attacker process needs to do is mmap() one page containing
one lock that the victim threads take occasionally and do this in a loop:
int *victimfutex = [...];
while(1)
futex(victimfutex, *victimfutex, FUTEX_WAIT, NULL, NULL, 0);

Suddenly read-only access to *ANY* database file that happens to
use an in-file futex means that you can hang the database... period.
If you write it in ASM, you could even probably start a whole bunch
of threads in parallel by sharing the same stack.

Even better from a DoS standpoint, this does not trigger any resource
limits the way other attacks would, because you are sleeping 99.999%
of the time and are using no memory. On top of that, once all of the
program's threads are stuck you can exit and it will just stay stuck.

This kind of thing is incredibly common in web-applications and other
similar environments, where "www-data" should be allowed to query
various file databases which are maintained by another daemon.

If the C library happens to use an in-file futex for arbitrating processes
writing to /var/log/utmp or /var/log/lastlog, it suddenly becomes trivial
to lock up every login process.

That is why FUTEX_WAIT needs separate handling for read-only files.

Cheers,
Kyle Moffett

2011-06-09 05:11:27

by Eric Dumazet

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Le jeudi 09 juin 2011 à 00:10 -0400, Andrew Lutomirski a écrit :
> On Wed, Jun 8, 2011 at 11:54 PM, Eric Dumazet <[email protected]> wrote:
> >
> > You can not prevent DOS on a machine if you allow a process to RO map
> > your critical files (where you put futexes), because you allow this
> > process to interfere with critical cache lines bouncing between cpus.
>
> The cacheline bounce DoS slows things down and they go back to normal
> when you kill the DoS-ing task.
>
> The wakeup-eating DoS is permanent. Seems a good deal worse to me.
>
> If you make this change, please at least document it in the man page.
>


This is how futexes had working for years.

It was very obvious from the beginning. Please submit a man page change
since you raised the point. You own the credit to open a CVE and
immediately release a fix to all 2.6 versions !

How come a critical fix (according to you) went without being noticed
and documented ?

> Then how am I supposed to efficiently broadcast information to
> untrusted processes? I'll have to put any futexes involved into
> different files, but one way or another the actual data will have to
> be memory mapped to avoid syscall overhead.

futexes are a linux extension over standard VM games.

If you dont know how to share a memory segment between a group of
processes, disallowing others to come spy on you, maybe its better to
use another IPC ?

Instead of 'fixing' futexes, what about educating people how to
correctly use memory segments ?


2011-06-09 05:25:39

by Eric Dumazet

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Le jeudi 09 juin 2011 à 00:43 -0400, George Spelvin a écrit :

> Just for example, consider a circular buffer that a few (trusted)
> processes can write to, but many (less trusted) can read. Obviously,
> being able to sleep on the head pointer is useful.


If its useful, then it needs a futex extension (and this must be
emulated on old kernels without this extension)

Remember futex_wake() call would just have to wakeup _all_ threads
instead of one, if kernel lacks this function.

If you dont trust futex_wait() users, just use futex_wake(ALL)

Its should not be a consequence of a previous (unrelated) patch, it
should be an added functionality, dully documented.


2011-06-09 11:37:43

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Hi

I'm sorry for the long delay.

>> Having a new call is inelegant from a futex(2) user perspective, as
>> the need for a change is due to the kernel implementation and/or mutex
>> requirements. The futex() system call, as documented, is ideal for a
>> single producer to signal multiple receivers of state updates.
>>
>> If it is truly necessary to add new variants to futex() to protect
>> applications that allow untrusted applications read access to their
>> mutexes, I would avoid both the names suggested, as consumption of
>> wakeups is not an obvious issue to users, and POLL suggests waiting
>> for multiple entities as in poll(2) (which is not provided), or
>> returning immediately (which is orthogonally provided by the timeout
>> parameter). What is being provided from the user point of view is a
>> FUTEX_WAIT per the man page, which doesn't require write access. How
>> about FUTEX_WAIT_RDONLY?
>>
>> Alternatively, use the current call and document that when process
>> performing a FUTEX_WAIT on read-only memory are woken, they do not
>> count towards the number reported as being woken.
>>
>> Best, IMHO, would be to document that providing read access to mutexes
>> to untrusted software is unsafe behavior, and restore read only access
>> to readers of futexes.
>
> I'm inclined to agree with this approach.

Honestly says, I didn't thought RO mapping users are there in real world.
I'm sorry. I did hope to fix COW issue, not intended to break David Oliver's
workload.

And, I now also recognized this is hard to fix issue. Any fix might screw up
Perter's long time performance improvement effort. Hmm....

Personally, I also incline to agree with FUTEX_WAIT_RDONLY approach. But I
also hope to remove David's head pain. I now have big dilemma. Could you
please give me a chance to show my fixing idea once? If anybody dislike
my idea, I'll drop my opinion quickly.

Today, I've made one concept proof patch. The idea is

1) check the pte of the target address is RW attribute.
2-1) if yes, it has no COW chance. we can safely use gup result.
2-2) if no, we have to fallback slow vma walk. maybe, it's okey.
both RO mappings and COW are rare situation.

I'm not futex expert and I really hope and have to get Darren't review.
I hope to hear somebody's comments.

0001-Revert-futexes-Remove-rw-parameter-from-get_futex_ke.patch
0002-make-slowpath.patch
Alternative fixing idea. Just RFC.

0001-FUTEX_WAIT-read-only.patch
testcase for Darren't futextest git tree.

performance.txt
performance result of Darren't performance test in futextest git.
This seems no significant performance loss.


Again, I have no strong opinion. If anyone put objection, I'll drop this
proposal immediately.


Thanks.




Attachments:
0001-FUTEX_WAIT-read-only.patch (4.85 kB)
0001-Revert-futexes-Remove-rw-parameter-from-get_futex_ke.patch (4.36 kB)
0002-make-slowpath.patch (2.61 kB)
performance.txt (4.21 kB)
Download all attachments

2011-06-09 12:05:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Mon, 2011-06-06 at 20:11 +0200, Eric Dumazet wrote:

> Now, what if other software uses a MAP_PRIVATE ro mapping ?
>
> It was working in previous kernels as well.
>
> We can say its stupid, but IMHO its not.
>
> In other words, this program should work, if process never touches
> (writes) into first page.
>
> This program on previous kernels gave :
> rc=-1 errno=11
> (allowing to wait for a value change and a futex_WAKE)
>
> With new kernel :
> rc=-1 errno=14 [ no sleep allowed ]

Yeah, so friggin what? I can write a similar program that is broken with
the old behaviour. The fact that the semantics are just crap (wtf does
it mean to FUTEX_WAIT on a address that changes mapping on COW) should
be enough to convince you that the old behaviour simply doesn't make
sense.

Once the COW happened you shouldn't be able to observe the old page, yet
that's exactly what the still pending FUTEX_WAIT will do.

The fact that this can happen, means the behaviour is broken, no if else
or buts about it.

2011-06-09 12:05:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Thu, 2011-06-09 at 20:37 +0900, KOSAKI Motohiro wrote:
> 1) check the pte of the target address is RW attribute.
> 2-1) if yes, it has no COW chance. we can safely use gup result.
> 2-2) if no, we have to fallback slow vma walk. maybe, it's okey.
> both RO mappings and COW are rare situation.

Not so, clean file pages are RO, even if the vma is writable. We use the
fault to track dirty state with.

2011-06-09 12:13:20

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Thu, Jun 9, 2011 at 1:11 AM, Eric Dumazet <[email protected]> wrote:
> Le jeudi 09 juin 2011 ? 00:10 -0400, Andrew Lutomirski a ?crit :
>> On Wed, Jun 8, 2011 at 11:54 PM, Eric Dumazet <[email protected]> wrote:
>> >
>> > You can not prevent DOS on a machine if you allow a process to RO map
>> > your critical files (where you put futexes), because you allow this
>> > process to interfere with critical cache lines bouncing between cpus.
>>
>> The cacheline bounce DoS slows things down and they go back to normal
>> when you kill the DoS-ing task.
>>
>> The wakeup-eating DoS is permanent. ?Seems a good deal worse to me.
>>
>> If you make this change, please at least document it in the man page.
>>
>
>
> This is how futexes had working for years.
>
> It was very obvious from the beginning. Please submit a man page change
> since you raised the point. You own the credit to open a CVE and
> immediately release a fix to all 2.6 versions !
>
> How come a critical fix (according to you) went without being noticed
> and documented ?
>

<cynical answer>Because Linux system calls aren't really documented.

It's reasonable to ask people to read a specification of how a system
call works to see if it's well-thought-out, usable, and has good
security properties.

It's also reasonable to ask people to read an implementation of a
system call and check to see that it conforms to the spec.

It's IMO a little less reasonable to ask people to review complicated
mm code to see if the interface it implements is well-designed.

The futex(2) and futex(7) manpages are very incomplete, even
today.</cynical answer>

>
> If its useful, then it needs a futex extension (and this must be
> emulated on old kernels without this extension)

I'm arguing for the extension. I don't think the kernel has any
obligation to make sure that new use-cases are possible on old
versions, though.

--Andy

2011-06-09 17:58:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Thu, 2011-06-09 at 14:05 +0200, Peter Zijlstra wrote:
> On Thu, 2011-06-09 at 20:37 +0900, KOSAKI Motohiro wrote:
> > 1) check the pte of the target address is RW attribute.
> > 2-1) if yes, it has no COW chance. we can safely use gup result.
> > 2-2) if no, we have to fallback slow vma walk. maybe, it's okey.
> > both RO mappings and COW are rare situation.
>
> Not so, clean file pages are RO, even if the vma is writable. We use the
> fault to track dirty state with.

Why can't something like
http://marc.info/?l=linux-kernel&m=130737669810421 work once you restore
the rw argument?

2011-06-10 03:27:16

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

(2011/06/09 21:05), Peter Zijlstra wrote:
> On Thu, 2011-06-09 at 20:37 +0900, KOSAKI Motohiro wrote:
>> 1) check the pte of the target address is RW attribute.
>> 2-1) if yes, it has no COW chance. we can safely use gup result.
>> 2-2) if no, we have to fallback slow vma walk. maybe, it's okey.
>> both RO mappings and COW are rare situation.
>
> Not so, clean file pages are RO, even if the vma is writable. We use the
> fault to track dirty state with.

Right you are. Sigh. Please forget my previous mail.

2011-06-10 03:31:16

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

(2011/06/10 2:58), Peter Zijlstra wrote:
> On Thu, 2011-06-09 at 14:05 +0200, Peter Zijlstra wrote:
>> On Thu, 2011-06-09 at 20:37 +0900, KOSAKI Motohiro wrote:
>>> 1) check the pte of the target address is RW attribute.
>>> 2-1) if yes, it has no COW chance. we can safely use gup result.
>>> 2-2) if no, we have to fallback slow vma walk. maybe, it's okey.
>>> both RO mappings and COW are rare situation.
>>
>> Not so, clean file pages are RO, even if the vma is writable. We use the
>> fault to track dirty state with.
>
> Why can't something like
> http://marc.info/?l=linux-kernel&m=130737669810421 work once you restore
> the rw argument?

Yeah, I agree this is best way. Private RO mappings is not sane and no real
world users.

2011-06-10 12:10:26

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

>> Urgh,. maybe something like the below but with more conditionals that
>> enable the extra logic only for FUTEX_WAIT..
>>
>> The idea is to try a RO gup() when the RW gup() fails so as not to slow
>> down the common path of writable anonymous maps and bail when we used
>> the RO path on anonymous memory.
>>
>> ---
>> diff --git a/kernel/futex.c b/kernel/futex.c
>> index fe28dc2..11f2ad1 100644
>> --- a/kernel/futex.c
>> +++ b/kernel/futex.c
>> @@ -234,7 +234,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
>> unsigned long address = (unsigned long)uaddr;
>> struct mm_struct *mm = current->mm;
>> struct page *page, *page_head;
>> - int err;
>> + int err, ro = 0;
>>
>> /*
>> * The futex address must be "naturally" aligned.
>> @@ -262,6 +262,10 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
>>
>> again:
>> err = get_user_pages_fast(address, 1, 1, &page);
>> + if (err == -EFAULT) {
>> + err = get_user_pages_fast(address, 1, 0, &page);
>> + ro = 1;
>> + }
>> if (err < 0)
>> return err;
>>
>> @@ -316,6 +320,11 @@ again:
>> * the object not the particular process.
>> */
>> if (PageAnon(page_head)) {
>> + if (ro) {
>> + err = -EFAULT;
>> + goto out;
>> + }
>> +
>> key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
>> key->private.mm = mm;
>> key->private.address = address;
>> @@ -327,9 +336,10 @@ again:
>>
>> get_futex_key_refs(key);
>>

Need err=0 here. (note: get_user_pages_fast() return 1) Other than that looks
good to me and this patch passed my test.
Reviewed-and-tested-by: KOSAKI Motohiro <[email protected]>

>> +out:
>> unlock_page(page_head);
>> put_page(page_head);
>> - return 0;
>> + return err;
>> }
>>
>> static inline void put_futex_key(union futex_key *key)
>>
>
> Hmm, wouldn't that still be susceptible to the zero-page thing if: we
> create a writable private file map of a sparse file, touch a page and
> then remap the thing RO?

After while thinking, I've conclude this is ok. Because 1) as Andrew and
Kyle described, RO mapping usage is not so sane. We need to care it for
only compatibility. 2) David Oliver's case is real compatibility issue.
but I doubt such mprotect() vs futex() race is happen on real world.
3) Anyway, overkill compatibility care might make code slower perhaps.

Off topic: current futex documentations are near terribly unclear and
many futex op are completely undocumented. They are one of root cause
that every change can make compatibility issue. (;_;

2011-06-10 17:29:14

by Darren Hart

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.



On 06/10/2011 05:10 AM, KOSAKI Motohiro wrote:

>
> Off topic: current futex documentations are near terribly unclear and
> many futex op are completely undocumented. They are one of root cause
> that every change can make compatibility issue. (;_;

What documentation are you referring to? The futex man page is a wreck,
and I'm not sure what to do with it since glibc removed the futex()
call. You now have to wrap the syscall manually anyway.

If you are referring to the futex.c file itself, I have been documenting
functions as I modify them. If you found any of those lacking, please
let me know which ones and I'll try to clean them up. If you're
referring to those that remain undocumented, please send a doc patch and
I'll review and help get it upstream. I'd like to see this improved as well.

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-13 02:11:34

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

>> Off topic: current futex documentations are near terribly unclear and
>> many futex op are completely undocumented. They are one of root cause
>> that every change can make compatibility issue. (;_;
>
> What documentation are you referring to? The futex man page is a wreck,
> and I'm not sure what to do with it since glibc removed the futex()
> call. You now have to wrap the syscall manually anyway.

Honestly, I don't know linux man pages policy at all. example, gettid(2)
also need to be wrap syscall manually. and it's documented and NOTES section
describe "Glibc does not provide a wrapper for this system call; call it
using syscall(2)".

Or, if nobody want to update the doc, shouldn't we just remove futex(2) man
pages? out date docs are often wrong than nothing. I dunno.


> If you are referring to the futex.c file itself, I have been documenting
> functions as I modify them. If you found any of those lacking, please
> let me know which ones and I'll try to clean them up. If you're
> referring to those that remain undocumented, please send a doc patch and
> I'll review and help get it upstream. I'd like to see this improved as well.

No. I think the comments of futex.c are very good, at least, than a lot of mm code. ;)

2011-06-13 15:50:37

by Darren Hart

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.



On 06/12/2011 07:11 PM, KOSAKI Motohiro wrote:
>>> Off topic: current futex documentations are near terribly unclear and
>>> many futex op are completely undocumented. They are one of root cause
>>> that every change can make compatibility issue. (;_;
>>
>> What documentation are you referring to? The futex man page is a wreck,
>> and I'm not sure what to do with it since glibc removed the futex()
>> call. You now have to wrap the syscall manually anyway.
>
> Honestly, I don't know linux man pages policy at all. example, gettid(2)
> also need to be wrap syscall manually. and it's documented and NOTES section
> describe "Glibc does not provide a wrapper for this system call; call it
> using syscall(2)".

Something like that would be good. I don't have any experience pushing
man-page updates. This page seems to document the process:

http://www.kernel.org/doc/man-pages/

>
> Or, if nobody want to update the doc, shouldn't we just remove futex(2) man
> pages? out date docs are often wrong than nothing. I dunno.

I'd prefer to see it updated as I know of several users of the interface
outside of glibc.

>
>
>> If you are referring to the futex.c file itself, I have been documenting
>> functions as I modify them. If you found any of those lacking, please
>> let me know which ones and I'll try to clean them up. If you're
>> referring to those that remain undocumented, please send a doc patch and
>> I'll review and help get it upstream. I'd like to see this improved as well.
>
> No. I think the comments of futex.c are very good, at least, than a lot of mm code. ;)
>
>

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-15 18:51:08

by Shawn Bohrer

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Fri, Jun 10, 2011 at 09:10:03PM +0900, KOSAKI Motohiro wrote:
> >> Urgh,. maybe something like the below but with more conditionals that
> >> enable the extra logic only for FUTEX_WAIT..
> >>
> >> The idea is to try a RO gup() when the RW gup() fails so as not to slow
> >> down the common path of writable anonymous maps and bail when we used
> >> the RO path on anonymous memory.
> >>
> >> ---
> >> diff --git a/kernel/futex.c b/kernel/futex.c
> >> index fe28dc2..11f2ad1 100644
> >> --- a/kernel/futex.c
> >> +++ b/kernel/futex.c
> >> @@ -234,7 +234,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> >> unsigned long address = (unsigned long)uaddr;
> >> struct mm_struct *mm = current->mm;
> >> struct page *page, *page_head;
> >> - int err;
> >> + int err, ro = 0;
> >>
> >> /*
> >> * The futex address must be "naturally" aligned.
> >> @@ -262,6 +262,10 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> >>
> >> again:
> >> err = get_user_pages_fast(address, 1, 1, &page);
> >> + if (err == -EFAULT) {
> >> + err = get_user_pages_fast(address, 1, 0, &page);
> >> + ro = 1;
> >> + }
> >> if (err < 0)
> >> return err;
> >>
> >> @@ -316,6 +320,11 @@ again:
> >> * the object not the particular process.
> >> */
> >> if (PageAnon(page_head)) {
> >> + if (ro) {
> >> + err = -EFAULT;
> >> + goto out;
> >> + }
> >> +
> >> key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
> >> key->private.mm = mm;
> >> key->private.address = address;
> >> @@ -327,9 +336,10 @@ again:
> >>
> >> get_futex_key_refs(key);
> >>
>
> Need err=0 here. (note: get_user_pages_fast() return 1) Other than that looks
> good to me and this patch passed my test.
> Reviewed-and-tested-by: KOSAKI Motohiro <[email protected]>
>
> >> +out:
> >> unlock_page(page_head);
> >> put_page(page_head);
> >> - return 0;
> >> + return err;
> >> }
> >>
> >> static inline void put_futex_key(union futex_key *key)
> >>

I've reviewed and tested Peter's change with KOSAKI's addition against
2.6.32.41 and this passes our tests using FUTEX_WAIT with read only
shared mappings.

Reviewed-and-tested-by: Shawn Bohrer <[email protected]>

--
Shawn


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-15 18:54:49

by Darren Hart

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.



On 06/15/2011 11:50 AM, Shawn Bohrer wrote:
> On Fri, Jun 10, 2011 at 09:10:03PM +0900, KOSAKI Motohiro wrote:
>>>> Urgh,. maybe something like the below but with more conditionals that
>>>> enable the extra logic only for FUTEX_WAIT..
>>>>
>>>> The idea is to try a RO gup() when the RW gup() fails so as not to slow
>>>> down the common path of writable anonymous maps and bail when we used
>>>> the RO path on anonymous memory.
>>>>
>>>> ---
>>>> diff --git a/kernel/futex.c b/kernel/futex.c
>>>> index fe28dc2..11f2ad1 100644
>>>> --- a/kernel/futex.c
>>>> +++ b/kernel/futex.c
>>>> @@ -234,7 +234,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
>>>> unsigned long address = (unsigned long)uaddr;
>>>> struct mm_struct *mm = current->mm;
>>>> struct page *page, *page_head;
>>>> - int err;
>>>> + int err, ro = 0;
>>>>
>>>> /*
>>>> * The futex address must be "naturally" aligned.
>>>> @@ -262,6 +262,10 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
>>>>
>>>> again:
>>>> err = get_user_pages_fast(address, 1, 1, &page);
>>>> + if (err == -EFAULT) {
>>>> + err = get_user_pages_fast(address, 1, 0, &page);
>>>> + ro = 1;
>>>> + }
>>>> if (err < 0)
>>>> return err;
>>>>
>>>> @@ -316,6 +320,11 @@ again:
>>>> * the object not the particular process.
>>>> */
>>>> if (PageAnon(page_head)) {
>>>> + if (ro) {
>>>> + err = -EFAULT;
>>>> + goto out;
>>>> + }
>>>> +
>>>> key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
>>>> key->private.mm = mm;
>>>> key->private.address = address;
>>>> @@ -327,9 +336,10 @@ again:
>>>>
>>>> get_futex_key_refs(key);
>>>>
>>
>> Need err=0 here. (note: get_user_pages_fast() return 1) Other than that looks
>> good to me and this patch passed my test.
>> Reviewed-and-tested-by: KOSAKI Motohiro <[email protected]>
>>
>>>> +out:
>>>> unlock_page(page_head);
>>>> put_page(page_head);
>>>> - return 0;
>>>> + return err;
>>>> }
>>>>
>>>> static inline void put_futex_key(union futex_key *key)
>>>>
>
> I've reviewed and tested Peter's change with KOSAKI's addition against
> 2.6.32.41 and this passes our tests using FUTEX_WAIT with read only
> shared mappings.
>
> Reviewed-and-tested-by: Shawn Bohrer <[email protected]>


Would someone care to roll this all together and send a patch with
commit log clearly documenting the issues and which are addressed with
the patch?

Kosaki, do you have updated futextest patches or should I look at the
ones you send previously?

--
Darren


>
> --
> Shawn
>
>
> ---------------------------------------------------------------
> This email, along with any attachments, is confidential. If you
> believe you received this message in error, please contact the
> sender immediately and delete all copies of the message.
> Thank you.
>

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-17 13:40:39

by Shawn Bohrer

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

On Wed, Jun 15, 2011 at 11:54:46AM -0700, Darren Hart wrote:
>
>
> On 06/15/2011 11:50 AM, Shawn Bohrer wrote:
> > On Fri, Jun 10, 2011 at 09:10:03PM +0900, KOSAKI Motohiro wrote:
> >>>> Urgh,. maybe something like the below but with more conditionals that
> >>>> enable the extra logic only for FUTEX_WAIT..
> >>>>
> >>>> The idea is to try a RO gup() when the RW gup() fails so as not to slow
> >>>> down the common path of writable anonymous maps and bail when we used
> >>>> the RO path on anonymous memory.
> >>>>
> >>>> ---
> >>>> diff --git a/kernel/futex.c b/kernel/futex.c
> >>>> index fe28dc2..11f2ad1 100644
> >>>> --- a/kernel/futex.c
> >>>> +++ b/kernel/futex.c
> >>>> @@ -234,7 +234,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> >>>> unsigned long address = (unsigned long)uaddr;
> >>>> struct mm_struct *mm = current->mm;
> >>>> struct page *page, *page_head;
> >>>> - int err;
> >>>> + int err, ro = 0;
> >>>>
> >>>> /*
> >>>> * The futex address must be "naturally" aligned.
> >>>> @@ -262,6 +262,10 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> >>>>
> >>>> again:
> >>>> err = get_user_pages_fast(address, 1, 1, &page);
> >>>> + if (err == -EFAULT) {
> >>>> + err = get_user_pages_fast(address, 1, 0, &page);
> >>>> + ro = 1;
> >>>> + }
> >>>> if (err < 0)
> >>>> return err;
> >>>>
> >>>> @@ -316,6 +320,11 @@ again:
> >>>> * the object not the particular process.
> >>>> */
> >>>> if (PageAnon(page_head)) {
> >>>> + if (ro) {
> >>>> + err = -EFAULT;
> >>>> + goto out;
> >>>> + }
> >>>> +
> >>>> key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
> >>>> key->private.mm = mm;
> >>>> key->private.address = address;
> >>>> @@ -327,9 +336,10 @@ again:
> >>>>
> >>>> get_futex_key_refs(key);
> >>>>
> >>
> >> Need err=0 here. (note: get_user_pages_fast() return 1) Other than that looks
> >> good to me and this patch passed my test.
> >> Reviewed-and-tested-by: KOSAKI Motohiro <[email protected]>
> >>
> >>>> +out:
> >>>> unlock_page(page_head);
> >>>> put_page(page_head);
> >>>> - return 0;
> >>>> + return err;
> >>>> }
> >>>>
> >>>> static inline void put_futex_key(union futex_key *key)
> >>>>
> >
> > I've reviewed and tested Peter's change with KOSAKI's addition against
> > 2.6.32.41 and this passes our tests using FUTEX_WAIT with read only
> > shared mappings.
> >
> > Reviewed-and-tested-by: Shawn Bohrer <[email protected]>
>
>
> Would someone care to roll this all together and send a patch with
> commit log clearly documenting the issues and which are addressed with
> the patch?

Since it doesn't appear that anyone is jumping on this opportunity I
can do it.

Peter, is it alright if I add your signed off by on the previous
patch?

--
Shawn


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-22 19:19:51

by Shawn Bohrer

[permalink] [raw]
Subject: [PATCH RFC] futex: Fix regression with read only mappings

commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
parameter from get_futex_key()) in 2.6.33 introduced a user-mode
regression in that it additionally prevented futex operations on a
region within a read only memory mapped file. For example this breaks
workloads that have one or more reader processes doing a FUTEX_WAIT on a
futex within a read only shared mapping, and a writer processes that has
a writable mapping issuing the FUTEX_WAKE.

This fixes the regression for futex operations that should be valid on
RO mappings by trying a RO get_user_pages_fast() when the RW
get_user_pages_fast() fails so as not to slow down the common path of
writable anonymous maps and bailing when we used the RO path on
anonymous memory.

Patch based on Peter Zijlstra's initial patch with modifications to only
allow RO mappings for futex operations that need VERIFY_READ access.

Signed-off-by: Shawn Bohrer <[email protected]>
---

Interestingly this patch also allows doing a FUTEX_WAIT on a RO private
mapping. Where my tests on 2.6.18 show that this used to wait
indefinitely. Performing a FUTEX_WAIT on a RW private mapping waits
indefinitely in 2.6.18, 3.0.0, and with this patch applied. It is
unclear to me if this is a good thing or a bug.

kernel/futex.c | 38 ++++++++++++++++++++++++++------------
1 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..e8cd532 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -218,6 +218,8 @@ static void drop_futex_key_refs(union futex_key *key)
* @uaddr: virtual address of the futex
* @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
* @key: address where result is stored.
+ * @rw: mapping needs to be read/write (values: VERIFY_READ,
+ * VERIFY_WRITE)
*
* Returns a negative error code or 0
* The key words are stored in *key on success.
@@ -229,12 +231,12 @@ static void drop_futex_key_refs(union futex_key *key)
* lock_page() might sleep, the caller should not hold a spinlock.
*/
static int
-get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
+get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
struct page *page, *page_head;
- int err;
+ int err, ro = 0;

/*
* The futex address must be "naturally" aligned.
@@ -262,6 +264,10 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)

again:
err = get_user_pages_fast(address, 1, 1, &page);
+ if (err == -EFAULT && rw == VERIFY_READ) {
+ err = get_user_pages_fast(address, 1, 0, &page);
+ ro = 1;
+ }
if (err < 0)
return err;

@@ -316,6 +322,11 @@ again:
* the object not the particular process.
*/
if (PageAnon(page_head)) {
+ if (ro) {
+ err = -EFAULT;
+ goto out;
+ }
+
key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
key->private.mm = mm;
key->private.address = address;
@@ -327,9 +338,11 @@ again:

get_futex_key_refs(key);

+ err = 0;
+out:
unlock_page(page_head);
put_page(page_head);
- return 0;
+ return err;
}

static inline void put_futex_key(union futex_key *key)
@@ -940,7 +953,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
if (!bitset)
return -EINVAL;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_READ);
if (unlikely(ret != 0))
goto out;

@@ -986,10 +999,10 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
int ret, op_ret;

retry:
- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
+ ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1243,10 +1256,11 @@ retry:
pi_state = NULL;
}

- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
+ ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2,
+ requeue_pi ? VERIFY_WRITE : VERIFY_READ);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1790,7 +1804,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
* while the syscall executes.
*/
retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key, VERIFY_READ);
if (unlikely(ret != 0))
return ret;

@@ -1941,7 +1955,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, int detect,
}

retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE);
if (unlikely(ret != 0))
goto out;

@@ -2060,7 +2074,7 @@ retry:
if ((uval & FUTEX_TID_MASK) != vpid)
return -EPERM;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
if (unlikely(ret != 0))
goto out;

@@ -2249,7 +2263,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
debug_rt_mutex_init_waiter(&rt_waiter);
rt_waiter.task = NULL;

- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
if (unlikely(ret != 0))
goto out;

--
1.6.5.2



---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-22 20:14:40

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH RFC] futex: Fix regression with read only mappings

Hi Shawn,

Thanks for taking this up. Would you share your testcases as well?

On 06/22/2011 12:19 PM, Shawn Bohrer wrote:
> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
> regression in that it additionally prevented futex operations on a
> region within a read only memory mapped file. For example this breaks
> workloads that have one or more reader processes doing a FUTEX_WAIT on a
> futex within a read only shared mapping, and a writer processes that has
> a writable mapping issuing the FUTEX_WAKE.
>
> This fixes the regression for futex operations that should be valid on
> RO mappings by trying a RO get_user_pages_fast() when the RW
> get_user_pages_fast() fails so as not to slow down the common path of
> writable anonymous maps and bailing when we used the RO path on
> anonymous memory.
>
> Patch based on Peter Zijlstra's initial patch with modifications to only
> allow RO mappings for futex operations that need VERIFY_READ access.
>
> Signed-off-by: Shawn Bohrer <[email protected]>
> ---
>
> Interestingly this patch also allows doing a FUTEX_WAIT on a RO private
> mapping.

I don't see any harm in this.

> Where my tests on 2.6.18 show that this used to wait
> indefinitely. Performing a FUTEX_WAIT on a RW private mapping waits
> indefinitely in 2.6.18, 3.0.0, and with this patch applied. It is
> unclear to me if this is a good thing or a bug.
>
> kernel/futex.c | 38 ++++++++++++++++++++++++++------------
> 1 files changed, 26 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fe28dc2..e8cd532 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -218,6 +218,8 @@ static void drop_futex_key_refs(union futex_key *key)
> * @uaddr: virtual address of the futex
> * @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
> * @key: address where result is stored.
> + * @rw: mapping needs to be read/write (values: VERIFY_READ,
> + * VERIFY_WRITE)
> *


I'm OK with this, but ideally I I'd prefer fshared and rw be replaced
with flags. We already have a FLAGS_SHARED, adding a FLAGS_WRITE would
be simple, and most call sites could be updated to send the bare flags
rather than a logical and with FLAGS_SHARED as they do now.


> * Returns a negative error code or 0
> * The key words are stored in *key on success.
> @@ -229,12 +231,12 @@ static void drop_futex_key_refs(union futex_key *key)
> * lock_page() might sleep, the caller should not hold a spinlock.
> */
> static int
> -get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> +get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
> {
> unsigned long address = (unsigned long)uaddr;
> struct mm_struct *mm = current->mm;
> struct page *page, *page_head;
> - int err;
> + int err, ro = 0;
>
> /*
> * The futex address must be "naturally" aligned.
> @@ -262,6 +264,10 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
>
> again:
> err = get_user_pages_fast(address, 1, 1, &page);
> + if (err == -EFAULT && rw == VERIFY_READ) {
> + err = get_user_pages_fast(address, 1, 0, &page);

I verified this is correct .... we ran into a little trouble a while
back mixing up the parameters of get_user_pages_fast. This is correct :)

> + ro = 1;
> + }
> if (err < 0)
> return err;
>
> @@ -316,6 +322,11 @@ again:
> * the object not the particular process.
> */
> if (PageAnon(page_head)) {
> + if (ro) {
> + err = -EFAULT;
> + goto out;
> + }

This if block needs a comment as to why RO anonymous pages trigger a
failure. In fact... I thought you said with this patch FUTEX_WAIT waits
indefinitely on RO private mappings? This test suggests that it
shouldn't. Are you testing this via glibc pthread_mutexes? If so, and if
I remember correctly, glibc loops forever on -EFAULT here, unfortunately.

> +
> key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
> key->private.mm = mm;
> key->private.address = address;
> @@ -327,9 +338,11 @@ again:
>
> get_futex_key_refs(key);
>
> + err = 0;

Shouldn't this be 0 anyway? If it was non-zero, you would have jumped to
out: earlier, right?

<snip/>

Thanks,

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-23 02:52:02

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH RFC] futex: Fix regression with read only mappings

(2011/06/23 5:14), Darren Hart wrote:
> Hi Shawn,
>
> Thanks for taking this up. Would you share your testcases as well?
>
> On 06/22/2011 12:19 PM, Shawn Bohrer wrote:
>> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
>> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
>> regression in that it additionally prevented futex operations on a
>> region within a read only memory mapped file. For example this breaks
>> workloads that have one or more reader processes doing a FUTEX_WAIT on a
>> futex within a read only shared mapping, and a writer processes that has
>> a writable mapping issuing the FUTEX_WAKE.
>>
>> This fixes the regression for futex operations that should be valid on
>> RO mappings by trying a RO get_user_pages_fast() when the RW
>> get_user_pages_fast() fails so as not to slow down the common path of
>> writable anonymous maps and bailing when we used the RO path on
>> anonymous memory.
>>
>> Patch based on Peter Zijlstra's initial patch with modifications to only
>> allow RO mappings for futex operations that need VERIFY_READ access.
>>
>> Signed-off-by: Shawn Bohrer <[email protected]>
>> ---
>>
>> Interestingly this patch also allows doing a FUTEX_WAIT on a RO private
>> mapping.
>
> I don't see any harm in this.

Hi

I have no objection. However I'd like to explain why I decided to
prefer to refuse RO private mappings and used prefault.

private mapping semantics is, to write access makes process private pages
(ie PageAnon=1).

So, this patch introduce following another corner case.

Thread-A: call futex(FUTEX_WAIT, memory-region-A).
get_futex_key() return inode based key.
sleep on the key
Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
Thread-B: write memory-region-A.
COW happen. This process's memory-region-A become related
to new COWed private (ie PageAnon=1) page.
Thread-B: call futex(FUETX_WAKE, memory-region-A).
get_futex_key() return mm based key.
IOW, we fail to wake up Thread-A.

It's unclear real world issue or not. But I hope everybody realize it.
So, Probably it would be great if you add some comments for this issue.

2.6.18 had an another trick. It used vma walk for avoiding this issue.
and, unfortunately, it's slow.


>
>> Where my tests on 2.6.18 show that this used to wait
>> indefinitely. Performing a FUTEX_WAIT on a RW private mapping waits
>> indefinitely in 2.6.18, 3.0.0, and with this patch applied. It is
>> unclear to me if this is a good thing or a bug.
>>
>> kernel/futex.c | 38 ++++++++++++++++++++++++++------------
>> 1 files changed, 26 insertions(+), 12 deletions(-)

2011-06-23 03:23:31

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

> Kosaki, do you have updated futextest patches or should I look at the
> ones you send previously?

I'm sorry. I missed your mail. I don't have any update. Please look at
privious one.

2011-06-23 03:58:35

by Shawn Bohrer

[permalink] [raw]
Subject: Re: [PATCH RFC] futex: Fix regression with read only mappings

On Wed, Jun 22, 2011 at 01:14:37PM -0700, Darren Hart wrote:
> Hi Shawn,
>
> Thanks for taking this up. Would you share your testcases as well?

I ran our proprietary application, and our test suite. Which I
unfortunately can't share. Our code uses FUTEX_WAIT in shared RO
mappings, and FUTEX_WAKE in shared RW mappings directly via syscalls.
We also use pthread mutexes and pthread conditions as synchronization
objects.

I also hacked up a quick little racy test below just to see what
worked and didn't on 2.6.18, 2.6.38, and with this patch. However, I
wouldn't consider our application or this little test a very thorough
testing of all of the futex functionality.

#include <sys/types.h>
#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
#include <linux/futex.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <stdio.h>

void child()
{
int fd, *futex, rc, val = 42;
struct timespec ts = {.tv_sec = 2, .tv_nsec = 0 };

fd = open("/tmp/futex_test", O_RDWR|O_CREAT, 0644);
write(fd, &val, sizeof(val));
futex = (int *)mmap(0, sizeof(int), PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, &ts, 0, 0);
printf("Test FUTEX_WAIT RW MAP_SHARED: rc=%d errno=%d\n", rc, errno);
printf("-----\n");

lseek(fd, 0, SEEK_SET);
write(fd, &val, sizeof(val));
munmap(futex, sizeof(int));
futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_SHARED, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, &ts, 0, 0);
printf("Test FUTEX_WAIT RO MAP_SHARED: rc=%d errno=%d\n", rc, errno);
printf("-----\n");

lseek(fd, 0, SEEK_SET);
write(fd, &val, sizeof(val));
munmap(futex, sizeof(int));
futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, &ts, 0, 0);
printf("Test FUTEX_WAIT RO MAP_PRIVATE: rc=%d errno=%d\n", rc, errno);
printf("-----\n");

lseek(fd, 0, SEEK_SET);
write(fd, &val, sizeof(val));
rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, &ts, 0, 0);
printf("Test FUTEX_WAIT RO MAP_PRIVATE: rc=%d errno=%d\n", rc, errno);
printf("-----\n");

lseek(fd, 0, SEEK_SET);
write(fd, &val, sizeof(val));
munmap(futex, sizeof(int));
futex = (int *)mmap(0, sizeof(int), PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, &ts, 0, 0);
printf("Test FUTEX_WAIT RW MAP_PRIVATE (EXPECT errno=110): rc=%d errno=%d\n", rc, errno);
printf("-----\n");

lseek(fd, 0, SEEK_SET);
write(fd, &val, sizeof(val));
munmap(futex, sizeof(int));
futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_SHARED, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, &ts, 0, 0);
printf("Test FUTEX_WAIT RO MAP_SHARED (EXPECT errno=110): rc=%d errno=%d\n", rc, errno);
printf("-----\n");

close(fd);
munmap(futex, sizeof(int));
}

void parent()
{
int fd, *futex, rc, val = 41;

sleep(1);
fd = open("/tmp/futex_test", O_RDWR|O_CREAT, 0644);
write(fd, &val, sizeof(val));
futex = (int *)mmap(0, sizeof(int), PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAKE, val, 0, 0, 0);
printf("Test FUTEX_WAKE RW MAP_SHARED: rc=%d errno=%d\n", rc, errno);

sleep(1);
lseek(fd, 0, SEEK_SET);
write(fd, &val, sizeof(val));
munmap(futex, sizeof(int));
futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_SHARED, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAKE, val, 0, 0, 0);
printf("Test FUTEX_WAKE RO MAP_SHARED: rc=%d errno=%d\n", rc, errno);

sleep(1);
lseek(fd, 0, SEEK_SET);
write(fd, &val, sizeof(val));
munmap(futex, sizeof(int));
futex = (int *)mmap(0, sizeof(int), PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAKE, val, 0, 0, 0);
printf("Test FUTEX_WAKE RW MAP_SHARED: rc=%d errno=%d\n", rc, errno);

sleep(1);
lseek(fd, 0, SEEK_SET);
write(fd, &val, sizeof(val));
munmap(futex, sizeof(int));
futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAKE, val, 0, 0, 0);
printf("Test FUTEX_WAKE RO MAP_PRIVATE: rc=%d errno=%d\n", rc, errno);

sleep(1);
lseek(fd, 0, SEEK_SET);
write(fd, &val, sizeof(val));
munmap(futex, sizeof(int));
futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAKE, val, 0, 0, 0);
printf("Test FUTEX_WAKE RO MAP_PRIVATE (EXPECT rc=0): rc=%d errno=%d\n", rc, errno);

sleep(2);
lseek(fd, 0, SEEK_SET);
write(fd, &val, sizeof(val));
munmap(futex, sizeof(int));
futex = (int *)mmap(0, sizeof(int), PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAKE, val, 0, 0, 0);
printf("Test FUTEX_WAKE RW MAP_PRIVATE (EXPECT rc=0): rc=%d errno=%d\n", rc, errno);

close(fd);
munmap(futex, sizeof(int));
}


int main(int argc, char *argv[])
{
pid_t pid;
int status;

pid = fork();
if (pid == -1)
perror("fork");
else if (!pid)
child();
else if (pid > 0)
parent();

wait(&status);
return 0;
}


> On 06/22/2011 12:19 PM, Shawn Bohrer wrote:
> > commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
> > parameter from get_futex_key()) in 2.6.33 introduced a user-mode
> > regression in that it additionally prevented futex operations on a
> > region within a read only memory mapped file. For example this breaks
> > workloads that have one or more reader processes doing a FUTEX_WAIT on a
> > futex within a read only shared mapping, and a writer processes that has
> > a writable mapping issuing the FUTEX_WAKE.
> >
> > This fixes the regression for futex operations that should be valid on
> > RO mappings by trying a RO get_user_pages_fast() when the RW
> > get_user_pages_fast() fails so as not to slow down the common path of
> > writable anonymous maps and bailing when we used the RO path on
> > anonymous memory.
> >
> > Patch based on Peter Zijlstra's initial patch with modifications to only
> > allow RO mappings for futex operations that need VERIFY_READ access.
> >
> > Signed-off-by: Shawn Bohrer <[email protected]>
> > ---
> >
> > Interestingly this patch also allows doing a FUTEX_WAIT on a RO private
> > mapping.
>
> I don't see any harm in this.
>
> > Where my tests on 2.6.18 show that this used to wait
> > indefinitely. Performing a FUTEX_WAIT on a RW private mapping waits
> > indefinitely in 2.6.18, 3.0.0, and with this patch applied. It is
> > unclear to me if this is a good thing or a bug.
> >
> > kernel/futex.c | 38 ++++++++++++++++++++++++++------------
> > 1 files changed, 26 insertions(+), 12 deletions(-)
> >
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index fe28dc2..e8cd532 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -218,6 +218,8 @@ static void drop_futex_key_refs(union futex_key *key)
> > * @uaddr: virtual address of the futex
> > * @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
> > * @key: address where result is stored.
> > + * @rw: mapping needs to be read/write (values: VERIFY_READ,
> > + * VERIFY_WRITE)
> > *
>
>
> I'm OK with this, but ideally I I'd prefer fshared and rw be replaced
> with flags. We already have a FLAGS_SHARED, adding a FLAGS_WRITE would
> be simple, and most call sites could be updated to send the bare flags
> rather than a logical and with FLAGS_SHARED as they do now.

Sure, I think that is a reasonable request.

> > * Returns a negative error code or 0
> > * The key words are stored in *key on success.
> > @@ -229,12 +231,12 @@ static void drop_futex_key_refs(union futex_key *key)
> > * lock_page() might sleep, the caller should not hold a spinlock.
> > */
> > static int
> > -get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> > +get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
> > {
> > unsigned long address = (unsigned long)uaddr;
> > struct mm_struct *mm = current->mm;
> > struct page *page, *page_head;
> > - int err;
> > + int err, ro = 0;
> >
> > /*
> > * The futex address must be "naturally" aligned.
> > @@ -262,6 +264,10 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> >
> > again:
> > err = get_user_pages_fast(address, 1, 1, &page);
> > + if (err == -EFAULT && rw == VERIFY_READ) {
> > + err = get_user_pages_fast(address, 1, 0, &page);
>
> I verified this is correct .... we ran into a little trouble a while
> back mixing up the parameters of get_user_pages_fast. This is correct :)
>
> > + ro = 1;
> > + }
> > if (err < 0)
> > return err;
> >
> > @@ -316,6 +322,11 @@ again:
> > * the object not the particular process.
> > */
> > if (PageAnon(page_head)) {
> > + if (ro) {
> > + err = -EFAULT;
> > + goto out;
> > + }
>
> This if block needs a comment as to why RO anonymous pages trigger a
> failure. In fact... I thought you said with this patch FUTEX_WAIT
> waits indefinitely on RO private mappings? This test suggests that
> it shouldn't. Are you testing this via glibc pthread_mutexes? If so,
> and if I remember correctly, glibc loops forever on -EFAULT here,
> unfortunately.

No with this patch FUTEX_WAIT works with RO private mappings. Pre
7485d0d3758 FUTEX_WAIT on RO private mappings would wait indefinitely.
RW still waits indefinitely both pre 7485d0d3758 and with this patch.

However, I have a confession to make. That if block was taken from
Peter's initial patch, and I originally thought it was preventing RO
private mappings. Now I'm not sure what it is doing. Will PageAnon()
ever return true on a RO private mapping? Is the page only in
anonymous memory after a COW has taken place? Sorry this isn't
exactly my area of expertise.

> > +
> > key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
> > key->private.mm = mm;
> > key->private.address = address;
> > @@ -327,9 +338,11 @@ again:
> >
> > get_futex_key_refs(key);
> >
> > + err = 0;
>
> Shouldn't this be 0 anyway? If it was non-zero, you would have jumped to
> out: earlier, right?

No, gup() should return the number of pages when successful. Of
course it would be more obvious what was going on here if we cleared
this closer to the call site.

I should be able to send a reworked patch sometime tomorrow.

--
Shawn


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-23 15:26:52

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH RFC] futex: Fix regression with read only mappings



On 06/22/2011 07:51 PM, KOSAKI Motohiro wrote:
> (2011/06/23 5:14), Darren Hart wrote:
>> Hi Shawn,
>>
>> Thanks for taking this up. Would you share your testcases as well?
>>
>> On 06/22/2011 12:19 PM, Shawn Bohrer wrote:
>>> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
>>> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
>>> regression in that it additionally prevented futex operations on a
>>> region within a read only memory mapped file. For example this breaks
>>> workloads that have one or more reader processes doing a FUTEX_WAIT on a
>>> futex within a read only shared mapping, and a writer processes that has
>>> a writable mapping issuing the FUTEX_WAKE.
>>>
>>> This fixes the regression for futex operations that should be valid on
>>> RO mappings by trying a RO get_user_pages_fast() when the RW
>>> get_user_pages_fast() fails so as not to slow down the common path of
>>> writable anonymous maps and bailing when we used the RO path on
>>> anonymous memory.
>>>
>>> Patch based on Peter Zijlstra's initial patch with modifications to only
>>> allow RO mappings for futex operations that need VERIFY_READ access.
>>>
>>> Signed-off-by: Shawn Bohrer <[email protected]>
>>> ---
>>>
>>> Interestingly this patch also allows doing a FUTEX_WAIT on a RO private
>>> mapping.
>>
>> I don't see any harm in this.
>
> Hi
>
> I have no objection. However I'd like to explain why I decided to
> prefer to refuse RO private mappings and used prefault.
>
> private mapping semantics is, to write access makes process private pages
> (ie PageAnon=1).
>
> So, this patch introduce following another corner case.
>
> Thread-A: call futex(FUTEX_WAIT, memory-region-A).
> get_futex_key() return inode based key.
> sleep on the key
> Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
> Thread-B: write memory-region-A.
> COW happen. This process's memory-region-A become related
> to new COWed private (ie PageAnon=1) page.
> Thread-B: call futex(FUETX_WAKE, memory-region-A).
> get_futex_key() return mm based key.
> IOW, we fail to wake up Thread-A.
>
> It's unclear real world issue or not. But I hope everybody realize it.
> So, Probably it would be great if you add some comments for this issue.
>
> 2.6.18 had an another trick. It used vma walk for avoiding this issue.
> and, unfortunately, it's slow.


Let me try and restate, please tell me if I have this correct:

RO file backed private mappings can be converted to a RW mapping between
FUTEX_WAIT and FUTEX_WAKE, resulting in the use of different keys (as
the RO mapping finds an inode key and the RW mapping returns an
anonymous key). The way to fix this is to use the virtual address
instead of the physical address, like shared futexes, but this has to be
done all the time as there is no way to know if the RO mapping will be
changed after a key is looked up, so it would eliminate the gains of the
private futexes.

A RO private mapping doesn't make a lot of sense to me since at least
one thread of the process will have to have write permission in order
for the mechanism to be useful, so the approach taken in the patch
(EFAULT on RO anonymous private mappings) seems reasonable.

Do I have this right?

--
Darren
>
>
>>
>>> Where my tests on 2.6.18 show that this used to wait
>>> indefinitely. Performing a FUTEX_WAIT on a RW private mapping waits
>>> indefinitely in 2.6.18, 3.0.0, and with this patch applied. It is
>>> unclear to me if this is a good thing or a bug.
>>>
>>> kernel/futex.c | 38 ++++++++++++++++++++++++++------------
>>> 1 files changed, 26 insertions(+), 12 deletions(-)
>

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-23 19:50:01

by Shawn Bohrer

[permalink] [raw]
Subject: Re: [PATCH RFC] futex: Fix regression with read only mappings

On Thu, Jun 23, 2011 at 08:26:49AM -0700, Darren Hart wrote:
>
>
> On 06/22/2011 07:51 PM, KOSAKI Motohiro wrote:
> > (2011/06/23 5:14), Darren Hart wrote:
> >> Hi Shawn,
> >>
> >> Thanks for taking this up. Would you share your testcases as well?
> >>
> >> On 06/22/2011 12:19 PM, Shawn Bohrer wrote:
> >>> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
> >>> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
> >>> regression in that it additionally prevented futex operations on a
> >>> region within a read only memory mapped file. For example this breaks
> >>> workloads that have one or more reader processes doing a FUTEX_WAIT on a
> >>> futex within a read only shared mapping, and a writer processes that has
> >>> a writable mapping issuing the FUTEX_WAKE.
> >>>
> >>> This fixes the regression for futex operations that should be valid on
> >>> RO mappings by trying a RO get_user_pages_fast() when the RW
> >>> get_user_pages_fast() fails so as not to slow down the common path of
> >>> writable anonymous maps and bailing when we used the RO path on
> >>> anonymous memory.
> >>>
> >>> Patch based on Peter Zijlstra's initial patch with modifications to only
> >>> allow RO mappings for futex operations that need VERIFY_READ access.
> >>>
> >>> Signed-off-by: Shawn Bohrer <[email protected]>
> >>> ---
> >>>
> >>> Interestingly this patch also allows doing a FUTEX_WAIT on a RO private
> >>> mapping.
> >>
> >> I don't see any harm in this.
> >
> > Hi
> >
> > I have no objection. However I'd like to explain why I decided to
> > prefer to refuse RO private mappings and used prefault.
> >
> > private mapping semantics is, to write access makes process private pages
> > (ie PageAnon=1).
> >
> > So, this patch introduce following another corner case.
> >
> > Thread-A: call futex(FUTEX_WAIT, memory-region-A).
> > get_futex_key() return inode based key.
> > sleep on the key
> > Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
> > Thread-B: write memory-region-A.
> > COW happen. This process's memory-region-A become related
> > to new COWed private (ie PageAnon=1) page.
> > Thread-B: call futex(FUETX_WAKE, memory-region-A).
> > get_futex_key() return mm based key.
> > IOW, we fail to wake up Thread-A.
> >
> > It's unclear real world issue or not. But I hope everybody realize it.
> > So, Probably it would be great if you add some comments for this issue.
> >
> > 2.6.18 had an another trick. It used vma walk for avoiding this issue.
> > and, unfortunately, it's slow.
>
>
> Let me try and restate, please tell me if I have this correct:
>
> RO file backed private mappings can be converted to a RW mapping between
> FUTEX_WAIT and FUTEX_WAKE, resulting in the use of different keys (as
> the RO mapping finds an inode key and the RW mapping returns an
> anonymous key). The way to fix this is to use the virtual address
> instead of the physical address, like shared futexes, but this has to be
> done all the time as there is no way to know if the RO mapping will be
> changed after a key is looked up, so it would eliminate the gains of the
> private futexes.
>
> A RO private mapping doesn't make a lot of sense to me since at least
> one thread of the process will have to have write permission in order
> for the mechanism to be useful,
<snip>

I agree that a RO private mapping doesn't make a lot of sense, even if
another thread/process has write access to the file. My mmap man page
says:

MAP_PRIVATE
Create a private copy-on-write mapping. Stores
to the region do not affect the original file. It
is unspecified whether changes made to the file
after the mmap() call are visible in the mapped
region.

So apparently it is unspecified whether changes to the underlying file
will even be visible in the mapped region. My quick tests so far
indicate that changes are visible, but it wouldn't surprise me if
there are cases where they are not.

> so the approach taken in the patch (EFAULT on RO anonymous private
> mappings) seems reasonable.

That sounds reasonable but the patch I sent doesn't appear to do this.
The patch I sent returns EFAULT on RO anonymous pages, but as KOSAKI
mentioned above (and from my testing) _write access_ makes the pages
of private mappings anonymous. So if the private mapping is marked
read only it keys off the inode.

I actually don't know of a case where you would have a RO anonymous
page. I tried using mmap with PROT_READ and MAP_ANONYMOUS, but I
believe this simply maps the zero page, and that does cause the
FUTEX_WAIT to hang with my patch. Perhaps you can use mprotect to
mark normal anonymous pages RO? And if you have a RO anonymous page
should FUTEX_WAIT return EFAULT?

I'd say a solution that allows RO shared mappings and rejects
RO private mappings would be ideal, but I currently don't see a way to
tell the difference. You could also argue that users of private
mappings should understand the risks.

--
Shawn


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-24 16:00:14

by Shawn Bohrer

[permalink] [raw]
Subject: [PATCH v2] futex: Fix regression with read only mappings

commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
parameter from get_futex_key()) in 2.6.33 introduced a user-mode
regression in that it additionally prevented futex operations on a
region within a read only memory mapped file. For example this breaks
workloads that have one or more reader processes doing a FUTEX_WAIT on a
futex within a read only shared mapping, and a writer processes that has
a writable mapping issuing the FUTEX_WAKE.

This fixes the regression for futex operations that should be valid on
RO mappings by trying a RO get_user_pages_fast() when the RW
get_user_pages_fast() fails so as not to slow down the common path of
writable anonymous maps and bailing when we used the RO path on
anonymous memory.

While fixing the regression this patch opens up two possible bad
scenarios as identified by KOSAKI Motohiro:

1) This patch also allows FUTEX_WAIT on RO private mappings which have
the following corner case.

Thread-A: call futex(FUTEX_WAIT, memory-region-A).
get_futex_key() return inode based key.
sleep on the key
Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
Thread-B: write memory-region-A.
COW happen. This process's memory-region-A become related
to new COWed private (ie PageAnon=1) page.
Thread-B: call futex(FUETX_WAKE, memory-region-A).
get_futex_key() return mm based key.
IOW, we fail to wake up Thread-A.

2) Current futex code doesn't handle zero page properly.

Read mode get_user_pages() can return zero page, but current futex
code doesn't handle it at all. Then, zero page makes infinite loop
internally.

This Patch is based on Peter Zijlstra's initial patch with modifications to
only allow RO mappings for futex operations that need VERIFY_READ access.

Reported-by: David Oliver <[email protected]>
Signed-off-by: Shawn Bohrer <[email protected]>
---

Consolidated fshared and rw into flags. Moved clearing of err closer to
get_user_pages_fast() call. Clairified corner cases that this patch
opens up in the commit log.

kernel/futex.c | 42 ++++++++++++++++++++++++++++--------------
1 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..6f828bc 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -75,6 +75,7 @@ int __read_mostly futex_cmpxchg_enabled;
#define FLAGS_SHARED 0x01
#define FLAGS_CLOCKRT 0x02
#define FLAGS_HAS_TIMEOUT 0x04
+#define FLAGS_WRITE 0x08

/*
* Priority Inheritance state:
@@ -216,7 +217,7 @@ static void drop_futex_key_refs(union futex_key *key)
/**
* get_futex_key() - Get parameters which are the keys for a futex
* @uaddr: virtual address of the futex
- * @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
+ * @flags: futex flags (FLAGS_SHARED, etc.)
* @key: address where result is stored.
*
* Returns a negative error code or 0
@@ -229,12 +230,12 @@ static void drop_futex_key_refs(union futex_key *key)
* lock_page() might sleep, the caller should not hold a spinlock.
*/
static int
-get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
+get_futex_key(u32 __user *uaddr, int flags, union futex_key *key)
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
struct page *page, *page_head;
- int err;
+ int err, ro = 0;

/*
* The futex address must be "naturally" aligned.
@@ -251,7 +252,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
* Note : We do have to check 'uaddr' is a valid user address,
* but access_ok() should be faster than find_vma()
*/
- if (!fshared) {
+ if (!(flags & FLAGS_SHARED)) {
if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
return -EFAULT;
key->private.mm = mm;
@@ -262,8 +263,14 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)

again:
err = get_user_pages_fast(address, 1, 1, &page);
+ if (err == -EFAULT && !(flags & FLAGS_WRITE)) {
+ err = get_user_pages_fast(address, 1, 0, &page);
+ ro = 1;
+ }
if (err < 0)
return err;
+ else
+ err = 0;

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
page_head = page;
@@ -316,6 +323,11 @@ again:
* the object not the particular process.
*/
if (PageAnon(page_head)) {
+ if (ro) {
+ err = -EFAULT;
+ goto out;
+ }
+
key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
key->private.mm = mm;
key->private.address = address;
@@ -327,9 +339,10 @@ again:

get_futex_key_refs(key);

+out:
unlock_page(page_head);
put_page(page_head);
- return 0;
+ return err;
}

static inline void put_futex_key(union futex_key *key)
@@ -940,7 +953,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
if (!bitset)
return -EINVAL;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
+ ret = get_futex_key(uaddr, flags, &key);
if (unlikely(ret != 0))
goto out;

@@ -986,10 +999,10 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
int ret, op_ret;

retry:
- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
+ ret = get_futex_key(uaddr1, flags, &key1);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, flags | FLAGS_WRITE, &key2);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1243,10 +1256,11 @@ retry:
pi_state = NULL;
}

- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
+ ret = get_futex_key(uaddr1, flags, &key1);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, requeue_pi ? flags | FLAGS_WRITE : flags,
+ &key2);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1790,7 +1804,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
* while the syscall executes.
*/
retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key);
+ ret = get_futex_key(uaddr, flags, &q->key);
if (unlikely(ret != 0))
return ret;

@@ -1941,7 +1955,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, int detect,
}

retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
+ ret = get_futex_key(uaddr, flags | FLAGS_WRITE, &q.key);
if (unlikely(ret != 0))
goto out;

@@ -2060,7 +2074,7 @@ retry:
if ((uval & FUTEX_TID_MASK) != vpid)
return -EPERM;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
+ ret = get_futex_key(uaddr, flags | FLAGS_WRITE, &key);
if (unlikely(ret != 0))
goto out;

@@ -2249,7 +2263,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
debug_rt_mutex_init_waiter(&rt_waiter);
rt_waiter.task = NULL;

- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, flags | FLAGS_WRITE, &key2);
if (unlikely(ret != 0))
goto out;

--
1.6.5.2



---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-25 00:00:52

by Darren Hart

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Hi Eric,

I'm finally getting time to review this in depth and try to help Shawn
get his fix upstream. Trying to make sure I have all the facets of this
straight in my head... or on paper at least ;-)

On 06/06/2011 11:27 AM, Eric Dumazet wrote:
> Le lundi 06 juin 2011 à 20:23 +0200, Peter Zijlstra a écrit :
>
>>
>> That's really not the point, what do we do when the COW happens during
>> the FUTEX_WAIT? At that point the process vaddr changes mapping and we
>> cannot continue the wait on the old page, since that would expose
>> invisible information, nor can we switch to the new page since we queued
>> on the old page.
>>
>> Therefore we have to force the COW and queue on the private copy, it
>> really is the only semi sane semantic.
>
> The point is we dont necessarly have to COW the page. If you attempt
> this COW, you shoot on user that did not expect to have a COW.
>
> Take this program : COW is not allowed, still this worked on 2.6.18 (it
> waits until another process change the value in file and call
> futex_wait())
>
> Using PROT_READ | PROT_WRITE instead of PROT_READ was OK too.
>
> (If we use PROT_READ | PROT_WRITE, then after your patch, program doesnt
> work anymore since this process gets a private page after your hidden
> COW : It'll wait forever)


As I understand MMAP(2), this is working due to undefined behavior as
Stephen pointed out earlier:

"It is unspecified whether changes made to the file after the mmap()
call are visible in the mapped region."

I don't think we are under any obligation to keep that working.

--
Darren

>
> #include <errno.h>
> #include <fcntl.h>
> #include <stdint.h>
> typedef uint32_t u32; // for futex.h
> #include <linux/futex.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <unistd.h>
>
>
> int main(int argc, char *argv[]) {
> int fd, *futex, rc, val = 42;
>
> fd = open("/tmp/futex_test", O_RDWR|O_CREAT, 0644);
> write(fd, &val, 4);
> futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE, fd, 0);
> rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, 0, 0, 0);
> printf("rc=%d errno=%d\n", rc, errno);
> }
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-25 00:37:27

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH v2] futex: Fix regression with read only mappings

Hi Shawn,

OK, I've been doing some digging. Earlier I had been confusing
MAP_PRIVATE private mappings with !FLAGS_SHARED private futexes, and
that really complicated things!

For the sake of this discussion we are only referring to FLAGS_SHARED
futexes, otherwise all the keys would be mm based anyway and the inode
vs mm based key wouldn't come into the picture. Duh.

I believe this is about ready. Please see a few comments below with one
remaining concern.

On 06/24/2011 08:59 AM, Shawn Bohrer wrote:
> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
> regression in that it additionally prevented futex operations on a
> region within a read only memory mapped file. For example this breaks
> workloads that have one or more reader processes doing a FUTEX_WAIT on a
> futex within a read only shared mapping, and a writer processes that has
> a writable mapping issuing the FUTEX_WAKE.
>
> This fixes the regression for futex operations that should be valid on
> RO mappings by trying a RO get_user_pages_fast() when the RW
> get_user_pages_fast() fails so as not to slow down the common path of
> writable anonymous maps and bailing when we used the RO path on
> anonymous memory.

The goal here is to bail with EFAULT when we use a RO MAP_PRIVATE
mapping on file-backed memory correct?

Final concern:
Are shared memory segments (shmget, etc.) considered "file backed?

>
> While fixing the regression this patch opens up two possible bad
> scenarios as identified by KOSAKI Motohiro:
>
> 1) This patch also allows FUTEX_WAIT on RO private mappings which have
> the following corner case.
>
> Thread-A: call futex(FUTEX_WAIT, memory-region-A).
> get_futex_key() return inode based key.
> sleep on the key
> Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
> Thread-B: write memory-region-A.
> COW happen. This process's memory-region-A become related
> to new COWed private (ie PageAnon=1) page.
> Thread-B: call futex(FUETX_WAKE, memory-region-A).
> get_futex_key() return mm based key.
> IOW, we fail to wake up Thread-A.
>
> 2) Current futex code doesn't handle zero page properly.
>
> Read mode get_user_pages() can return zero page, but current futex
> code doesn't handle it at all. Then, zero page makes infinite loop
> internally.
>
> This Patch is based on Peter Zijlstra's initial patch with modifications to
> only allow RO mappings for futex operations that need VERIFY_READ access.
>
> Reported-by: David Oliver <[email protected]>
> Signed-off-by: Shawn Bohrer <[email protected]>
> ---
>
> Consolidated fshared and rw into flags. Moved clearing of err closer to
> get_user_pages_fast() call. Clairified corner cases that this patch
> opens up in the commit log.
>
> kernel/futex.c | 42 ++++++++++++++++++++++++++++--------------
> 1 files changed, 28 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fe28dc2..6f828bc 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -75,6 +75,7 @@ int __read_mostly futex_cmpxchg_enabled;
> #define FLAGS_SHARED 0x01
> #define FLAGS_CLOCKRT 0x02
> #define FLAGS_HAS_TIMEOUT 0x04
> +#define FLAGS_WRITE 0x08


Looking at the way this is used, we never care if it's a RW mapping, we
care if it is a RO mapping. So all the logic is inverted below, "if
!(flags & FLAGS_WRITE)" instead of "if flags&FLAGS_RO". I think the
latter might be more intuitive - it's a minor thing, but I think it
would make the code easier to read and the logic easier to follow.

I'd suggest FLAGS_RO.

>
> /*
> * Priority Inheritance state:
> @@ -216,7 +217,7 @@ static void drop_futex_key_refs(union futex_key *key)
> /**
> * get_futex_key() - Get parameters which are the keys for a futex
> * @uaddr: virtual address of the futex
> - * @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
> + * @flags: futex flags (FLAGS_SHARED, etc.)


As we only support SHARED and WRITE, let's go ahead and list them both
here so users know what is valid.

* @ flags: futex flags: FLAGS_SHARED and FLAGS_WRITE are valid.


> * @key: address where result is stored.
> *
> * Returns a negative error code or 0
> @@ -229,12 +230,12 @@ static void drop_futex_key_refs(union futex_key *key)
> * lock_page() might sleep, the caller should not hold a spinlock.
> */
> static int
> -get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> +get_futex_key(u32 __user *uaddr, int flags, union futex_key *key)


flags type should be "unsigned int" to be consistent. Was there no
compiler warning about converting from uint to int?


> {
> unsigned long address = (unsigned long)uaddr;
> struct mm_struct *mm = current->mm;
> struct page *page, *page_head;
> - int err;
> + int err, ro = 0;
>
> /*
> * The futex address must be "naturally" aligned.
> @@ -251,7 +252,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> * Note : We do have to check 'uaddr' is a valid user address,
> * but access_ok() should be faster than find_vma()
> */
> - if (!fshared) {
> + if (!(flags & FLAGS_SHARED)) {
> if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
> return -EFAULT;
> key->private.mm = mm;
> @@ -262,8 +263,14 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
>
> again:
> err = get_user_pages_fast(address, 1, 1, &page);

This bit needs a comment, maybe:

/*
* If write access is not required (eg. FUTEX_WAIT), try
* and get read-only access.
*/
> + if (err == -EFAULT && !(flags & FLAGS_WRITE)) {
> + err = get_user_pages_fast(address, 1, 0, &page);
> + ro = 1;
> + }
> if (err < 0)
> return err;
> + else
> + err = 0;


Better, thanks.


>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> page_head = page;
> @@ -316,6 +323,11 @@ again:
> * the object not the particular process.
> */
> if (PageAnon(page_head)) {

This bit needs a comment too (unless I am the only one to whom this was
non-obvious), maybe:


/*
* A read-only anonymous page implies a COW on a
* MAP_PRIVATE mapping. There is no sane use-case
* for this scenario, return -EFAULT to userspace.
*/
> + if (ro) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
> key->private.mm = mm;
> key->private.address = address;
> @@ -327,9 +339,10 @@ again:
>
> get_futex_key_refs(key);
>
> +out:
> unlock_page(page_head);
> put_page(page_head);
> - return 0;
> + return err;
> }
>
> static inline void put_futex_key(union futex_key *key)
> @@ -940,7 +953,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
> if (!bitset)
> return -EINVAL;
>
> - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
> + ret = get_futex_key(uaddr, flags, &key);
> if (unlikely(ret != 0))
> goto out;
>
> @@ -986,10 +999,10 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
> int ret, op_ret;
>
> retry:
> - ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
> + ret = get_futex_key(uaddr1, flags, &key1);
> if (unlikely(ret != 0))
> goto out;
> - ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
> + ret = get_futex_key(uaddr2, flags | FLAGS_WRITE, &key2);
> if (unlikely(ret != 0))
> goto out_put_key1;
>
> @@ -1243,10 +1256,11 @@ retry:
> pi_state = NULL;
> }
>
> - ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
> + ret = get_futex_key(uaddr1, flags, &key1);
> if (unlikely(ret != 0))
> goto out;
> - ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
> + ret = get_futex_key(uaddr2, requeue_pi ? flags | FLAGS_WRITE : flags,
> + &key2);
> if (unlikely(ret != 0))
> goto out_put_key1;
>
> @@ -1790,7 +1804,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
> * while the syscall executes.
> */
> retry:
> - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key);
> + ret = get_futex_key(uaddr, flags, &q->key);
> if (unlikely(ret != 0))
> return ret;
>
> @@ -1941,7 +1955,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, int detect,
> }
>
> retry:
> - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
> + ret = get_futex_key(uaddr, flags | FLAGS_WRITE, &q.key);
> if (unlikely(ret != 0))
> goto out;
>
> @@ -2060,7 +2074,7 @@ retry:
> if ((uval & FUTEX_TID_MASK) != vpid)
> return -EPERM;
>
> - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
> + ret = get_futex_key(uaddr, flags | FLAGS_WRITE, &key);
> if (unlikely(ret != 0))
> goto out;
>
> @@ -2249,7 +2263,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
> debug_rt_mutex_init_waiter(&rt_waiter);
> rt_waiter.task = NULL;
>
> - ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
> + ret = get_futex_key(uaddr2, flags | FLAGS_WRITE, &key2);
> if (unlikely(ret != 0))
> goto out;
>

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-25 15:11:15

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v2] futex: Fix regression with read only mappings

>> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
>> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
>> regression in that it additionally prevented futex operations on a
>> region within a read only memory mapped file. ?For example this breaks
>> workloads that have one or more reader processes doing a FUTEX_WAIT on a
>> futex within a read only shared mapping, and a writer processes that has
>> a writable mapping issuing the FUTEX_WAKE.
>>
>> This fixes the regression for futex operations that should be valid on
>> RO mappings by trying a RO get_user_pages_fast() when the RW
>> get_user_pages_fast() fails so as not to slow down the common path of
>> writable anonymous maps and bailing when we used the RO path on
>> anonymous memory.
>
> The goal here is to bail with EFAULT when we use a RO MAP_PRIVATE
> mapping on file-backed memory correct?

The problem is, mapping knowledge is contained vma and it's required mma_sem
(ie performance hurt). Therefore we have only three ways, maybe.

1) always take mmap_sem and walk vma, likes 2.6.18. (cons. slow)
2) prefault, likes current linus-tree. (cons don't work RO mappings)
3) combination prefault and mmap_sem. likes my previous post.
http://groups.google.com/group/linux.kernel/msg/c41c819207e519c2?dmode=source

In other words, you hope to care RO private mapping thing, you can't avoid
mmap_sem anymore. unfortunatelly. So, we have to select either cons.

> Final concern:
> Are shared memory segments (shmget, etc.) considered "file backed?

It's another corner case. It similar with shared file mapping, but,

PageSwapBacked(page) => 1
PageAnon(page) => 0

Thanks.

2011-06-27 16:40:28

by Shawn Bohrer

[permalink] [raw]
Subject: Re: [PATCH v2] futex: Fix regression with read only mappings

On Fri, Jun 24, 2011 at 05:37:23PM -0700, Darren Hart wrote:
> Hi Shawn,
>
> OK, I've been doing some digging. Earlier I had been confusing
> MAP_PRIVATE private mappings with !FLAGS_SHARED private futexes, and
> that really complicated things!
>
> For the sake of this discussion we are only referring to FLAGS_SHARED
> futexes, otherwise all the keys would be mm based anyway and the inode
> vs mm based key wouldn't come into the picture. Duh.
>
> I believe this is about ready. Please see a few comments below with one
> remaining concern.
>
> On 06/24/2011 08:59 AM, Shawn Bohrer wrote:
> > commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
> > parameter from get_futex_key()) in 2.6.33 introduced a user-mode
> > regression in that it additionally prevented futex operations on a
> > region within a read only memory mapped file. For example this breaks
> > workloads that have one or more reader processes doing a FUTEX_WAIT on a
> > futex within a read only shared mapping, and a writer processes that has
> > a writable mapping issuing the FUTEX_WAKE.
> >
> > This fixes the regression for futex operations that should be valid on
> > RO mappings by trying a RO get_user_pages_fast() when the RW
> > get_user_pages_fast() fails so as not to slow down the common path of
> > writable anonymous maps and bailing when we used the RO path on
> > anonymous memory.
>
> The goal here is to bail with EFAULT when we use a RO MAP_PRIVATE
> mapping on file-backed memory correct?

No, this patch does not do that. See KOSAKI's follow up on ways that
could be done if that is what is desired. Also see some more of my
comments below.

> Final concern:
> Are shared memory segments (shmget, etc.) considered "file backed?
>
> >
> > While fixing the regression this patch opens up two possible bad
> > scenarios as identified by KOSAKI Motohiro:
> >
> > 1) This patch also allows FUTEX_WAIT on RO private mappings which have
> > the following corner case.
> >
> > Thread-A: call futex(FUTEX_WAIT, memory-region-A).
> > get_futex_key() return inode based key.
> > sleep on the key
> > Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
> > Thread-B: write memory-region-A.
> > COW happen. This process's memory-region-A become related
> > to new COWed private (ie PageAnon=1) page.
> > Thread-B: call futex(FUETX_WAKE, memory-region-A).
> > get_futex_key() return mm based key.
> > IOW, we fail to wake up Thread-A.
> >
> > 2) Current futex code doesn't handle zero page properly.
> >
> > Read mode get_user_pages() can return zero page, but current futex
> > code doesn't handle it at all. Then, zero page makes infinite loop
> > internally.
> >
> > This Patch is based on Peter Zijlstra's initial patch with modifications to
> > only allow RO mappings for futex operations that need VERIFY_READ access.
> >
> > Reported-by: David Oliver <[email protected]>
> > Signed-off-by: Shawn Bohrer <[email protected]>
> > ---
> >
> > Consolidated fshared and rw into flags. Moved clearing of err closer to
> > get_user_pages_fast() call. Clairified corner cases that this patch
> > opens up in the commit log.
> >
> > kernel/futex.c | 42 ++++++++++++++++++++++++++++--------------
> > 1 files changed, 28 insertions(+), 14 deletions(-)
> >
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index fe28dc2..6f828bc 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -75,6 +75,7 @@ int __read_mostly futex_cmpxchg_enabled;
> > #define FLAGS_SHARED 0x01
> > #define FLAGS_CLOCKRT 0x02
> > #define FLAGS_HAS_TIMEOUT 0x04
> > +#define FLAGS_WRITE 0x08
>
>
> Looking at the way this is used, we never care if it's a RW mapping, we
> care if it is a RO mapping. So all the logic is inverted below, "if
> !(flags & FLAGS_WRITE)" instead of "if flags&FLAGS_RO". I think the
> latter might be more intuitive - it's a minor thing, but I think it
> would make the code easier to read and the logic easier to follow.
>
> I'd suggest FLAGS_RO.

I can do that.

> >
> > /*
> > * Priority Inheritance state:
> > @@ -216,7 +217,7 @@ static void drop_futex_key_refs(union futex_key *key)
> > /**
> > * get_futex_key() - Get parameters which are the keys for a futex
> > * @uaddr: virtual address of the futex
> > - * @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
> > + * @flags: futex flags (FLAGS_SHARED, etc.)
>
>
> As we only support SHARED and WRITE, let's go ahead and list them both
> here so users know what is valid.
>
> * @ flags: futex flags: FLAGS_SHARED and FLAGS_WRITE are valid.
>

Yep, will update.

> > * @key: address where result is stored.
> > *
> > * Returns a negative error code or 0
> > @@ -229,12 +230,12 @@ static void drop_futex_key_refs(union futex_key *key)
> > * lock_page() might sleep, the caller should not hold a spinlock.
> > */
> > static int
> > -get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> > +get_futex_key(u32 __user *uaddr, int flags, union futex_key *key)
>
>
> flags type should be "unsigned int" to be consistent. Was there no
> compiler warning about converting from uint to int?
>

Good catch. Gcc 4.4.3 didn't complain.

> > {
> > unsigned long address = (unsigned long)uaddr;
> > struct mm_struct *mm = current->mm;
> > struct page *page, *page_head;
> > - int err;
> > + int err, ro = 0;
> >
> > /*
> > * The futex address must be "naturally" aligned.
> > @@ -251,7 +252,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> > * Note : We do have to check 'uaddr' is a valid user address,
> > * but access_ok() should be faster than find_vma()
> > */
> > - if (!fshared) {
> > + if (!(flags & FLAGS_SHARED)) {
> > if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
> > return -EFAULT;
> > key->private.mm = mm;
> > @@ -262,8 +263,14 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> >
> > again:
> > err = get_user_pages_fast(address, 1, 1, &page);
>
> This bit needs a comment, maybe:
>
> /*
> * If write access is not required (eg. FUTEX_WAIT), try
> * and get read-only access.
> */
> > + if (err == -EFAULT && !(flags & FLAGS_WRITE)) {
> > + err = get_user_pages_fast(address, 1, 0, &page);
> > + ro = 1;
> > + }
> > if (err < 0)
> > return err;
> > + else
> > + err = 0;
>
>
> Better, thanks.
>
>
> >
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > page_head = page;
> > @@ -316,6 +323,11 @@ again:
> > * the object not the particular process.
> > */
> > if (PageAnon(page_head)) {
>
> This bit needs a comment too (unless I am the only one to whom this was
> non-obvious), maybe:
>
>
> /*
> * A read-only anonymous page implies a COW on a
> * MAP_PRIVATE mapping. There is no sane use-case
> * for this scenario, return -EFAULT to userspace.
> */

Your comment is wrong. Unfortunately the code is completly
non-obvious to me as well, and I have no idea why it is there. This
little snippet came from Peter's suggested fix in:

https://lkml.org/lkml/2011/6/6/368

Sadly Peter's gone silent and I'm left wondering if he knew some
corner case that should return -EFAULT with a RO anonymous page or if
he _thought_ this was preventing RO MAP_PRIVATE mappings. If it is
the latter then this block can be removed because it does NOT do that.
> > + if (ro) {
> > + err = -EFAULT;
> > + goto out;
> > + }
> > +
> > key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
> > key->private.mm = mm;
> > key->private.address = address;
> > @@ -327,9 +339,10 @@ again:
> >
> > get_futex_key_refs(key);
> >
> > +out:
> > unlock_page(page_head);
> > put_page(page_head);
> > - return 0;
> > + return err;
> > }
> >
> > static inline void put_futex_key(union futex_key *key)
> > @@ -940,7 +953,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
> > if (!bitset)
> > return -EINVAL;
> >
> > - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
> > + ret = get_futex_key(uaddr, flags, &key);
> > if (unlikely(ret != 0))
> > goto out;
> >
> > @@ -986,10 +999,10 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
> > int ret, op_ret;
> >
> > retry:
> > - ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
> > + ret = get_futex_key(uaddr1, flags, &key1);
> > if (unlikely(ret != 0))
> > goto out;
> > - ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
> > + ret = get_futex_key(uaddr2, flags | FLAGS_WRITE, &key2);
> > if (unlikely(ret != 0))
> > goto out_put_key1;
> >
> > @@ -1243,10 +1256,11 @@ retry:
> > pi_state = NULL;
> > }
> >
> > - ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
> > + ret = get_futex_key(uaddr1, flags, &key1);
> > if (unlikely(ret != 0))
> > goto out;
> > - ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
> > + ret = get_futex_key(uaddr2, requeue_pi ? flags | FLAGS_WRITE : flags,
> > + &key2);
> > if (unlikely(ret != 0))
> > goto out_put_key1;
> >
> > @@ -1790,7 +1804,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
> > * while the syscall executes.
> > */
> > retry:
> > - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key);
> > + ret = get_futex_key(uaddr, flags, &q->key);
> > if (unlikely(ret != 0))
> > return ret;
> >
> > @@ -1941,7 +1955,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, int detect,
> > }
> >
> > retry:
> > - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
> > + ret = get_futex_key(uaddr, flags | FLAGS_WRITE, &q.key);
> > if (unlikely(ret != 0))
> > goto out;
> >
> > @@ -2060,7 +2074,7 @@ retry:
> > if ((uval & FUTEX_TID_MASK) != vpid)
> > return -EPERM;
> >
> > - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
> > + ret = get_futex_key(uaddr, flags | FLAGS_WRITE, &key);
> > if (unlikely(ret != 0))
> > goto out;
> >
> > @@ -2249,7 +2263,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
> > debug_rt_mutex_init_waiter(&rt_waiter);
> > rt_waiter.task = NULL;
> >
> > - ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
> > + ret = get_futex_key(uaddr2, flags | FLAGS_WRITE, &key2);
> > if (unlikely(ret != 0))
> > goto out;
> >
>
> --
> Darren Hart
> Intel Open Source Technology Center
> Yocto Project - Linux Kernel


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-27 16:57:05

by Shawn Bohrer

[permalink] [raw]
Subject: Re: Change in functionality of futex() system call.

Hey Eric,

I have one question below.

On Mon, Jun 06, 2011 at 08:27:51PM +0200, Eric Dumazet wrote:
> Le lundi 06 juin 2011 ? 20:23 +0200, Peter Zijlstra a ?crit :
>
> >
> > That's really not the point, what do we do when the COW happens during
> > the FUTEX_WAIT? At that point the process vaddr changes mapping and we
> > cannot continue the wait on the old page, since that would expose
> > invisible information, nor can we switch to the new page since we queued
> > on the old page.
> >
> > Therefore we have to force the COW and queue on the private copy, it
> > really is the only semi sane semantic.
>
> The point is we dont necessarly have to COW the page. If you attempt
> this COW, you shoot on user that did not expect to have a COW.
>
> Take this program : COW is not allowed, still this worked on 2.6.18 (it
> waits until another process change the value in file and call
> futex_wait())
>
> Using PROT_READ | PROT_WRITE instead of PROT_READ was OK too.

Did this actually work on 2.6.18? I'm testing on RHEL 5 and both
PROT_READ, and PROT_READ|PROT_WRITE with MAP_PRIVATE wait forever.
I'm not sure if that is from Red Hat backports or not.

Peter's change on top of 3.0 actually make the PROT_READ MAP_PRIVATE
case work, those as you point out below PROT_READ|PROT_WRITE waits
forever.

> (If we use PROT_READ | PROT_WRITE, then after your patch, program doesnt
> work anymore since this process gets a private page after your hidden
> COW : It'll wait forever)
>
> #include <errno.h>
> #include <fcntl.h>
> #include <stdint.h>
> typedef uint32_t u32; // for futex.h
> #include <linux/futex.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <unistd.h>
>
>
> int main(int argc, char *argv[]) {
> int fd, *futex, rc, val = 42;
>
> fd = open("/tmp/futex_test", O_RDWR|O_CREAT, 0644);
> write(fd, &val, 4);
> futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE, fd, 0);
> rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, 0, 0, 0);
> printf("rc=%d errno=%d\n", rc, errno);
> }
>
>
>


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-27 18:18:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] futex: Fix regression with read only mappings

On Mon, 2011-06-27 at 11:40 -0500, Shawn Bohrer wrote:
> > > if (PageAnon(page_head)) {
> >
> > This bit needs a comment too (unless I am the only one to whom this
> was
> > non-obvious), maybe:
> >
> >
> > /*
> > * A read-only anonymous page implies a COW on a
> > * MAP_PRIVATE mapping. There is no sane use-case
> > * for this scenario, return -EFAULT to userspace.
> > */
>
> Your comment is wrong. Unfortunately the code is completly
> non-obvious to me as well, and I have no idea why it is there. This
> little snippet came from Peter's suggested fix in:
>
> https://lkml.org/lkml/2011/6/6/368
>
> Sadly Peter's gone silent and I'm left wondering if he knew some
> corner case that should return -EFAULT with a RO anonymous page or if
> he _thought_ this was preventing RO MAP_PRIVATE mappings. If it is
> the latter then this block can be removed because it does NOT do that.
> > > + if (ro) {
> > > + err = -EFAULT;
> > > + goto out;
> > > + }
> > > +

Peter simply gets too much email.. anyway, the reason I put that there
is that a RO Anon page will never change and is thus a little pointless
to use for futex ops.

2011-06-27 20:43:08

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH v2] futex: Fix regression with read only mappings



On 06/27/2011 11:15 AM, Peter Zijlstra wrote:
> On Mon, 2011-06-27 at 11:40 -0500, Shawn Bohrer wrote:
>>>> if (PageAnon(page_head)) {
>>>
>>> This bit needs a comment too (unless I am the only one to whom this
>> was
>>> non-obvious), maybe:
>>>
>>>
>>> /*
>>> * A read-only anonymous page implies a COW on a
>>> * MAP_PRIVATE mapping. There is no sane use-case
>>> * for this scenario, return -EFAULT to userspace.
>>> */
>>
>> Your comment is wrong. Unfortunately the code is completly
>> non-obvious to me as well, and I have no idea why it is there. This
>> little snippet came from Peter's suggested fix in:
>>
>> https://lkml.org/lkml/2011/6/6/368
>>
>> Sadly Peter's gone silent and I'm left wondering if he knew some
>> corner case that should return -EFAULT with a RO anonymous page or if
>> he _thought_ this was preventing RO MAP_PRIVATE mappings. If it is
>> the latter then this block can be removed because it does NOT do that.
>>>> + if (ro) {
>>>> + err = -EFAULT;
>>>> + goto out;
>>>> + }
>>>> +
>
> Peter simply gets too much email.. anyway, the reason I put that there
> is that a RO Anon page will never change and is thus a little pointless
> to use for futex ops.
>

Right, and that was the logic I was trying to document. Shawn, how is my
comment above wrong? A read-only anonymous page but itself doesn't imply
a COW, but it does it does in the context of this code from my reading.


--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-27 21:09:17

by Shawn Bohrer

[permalink] [raw]
Subject: Re: [PATCH v2] futex: Fix regression with read only mappings

On Mon, Jun 27, 2011 at 01:41:12PM -0700, Darren Hart wrote:
>
>
> On 06/27/2011 11:15 AM, Peter Zijlstra wrote:
> > On Mon, 2011-06-27 at 11:40 -0500, Shawn Bohrer wrote:
> >>>> if (PageAnon(page_head)) {
> >>>
> >>> This bit needs a comment too (unless I am the only one to whom this
> >> was
> >>> non-obvious), maybe:
> >>>
> >>>
> >>> /*
> >>> * A read-only anonymous page implies a COW on a
> >>> * MAP_PRIVATE mapping. There is no sane use-case
> >>> * for this scenario, return -EFAULT to userspace.
> >>> */
> >>
> >> Your comment is wrong. Unfortunately the code is completly
> >> non-obvious to me as well, and I have no idea why it is there. This
> >> little snippet came from Peter's suggested fix in:
> >>
> >> https://lkml.org/lkml/2011/6/6/368
> >>
> >> Sadly Peter's gone silent and I'm left wondering if he knew some
> >> corner case that should return -EFAULT with a RO anonymous page or if
> >> he _thought_ this was preventing RO MAP_PRIVATE mappings. If it is
> >> the latter then this block can be removed because it does NOT do that.
> >>>> + if (ro) {
> >>>> + err = -EFAULT;
> >>>> + goto out;
> >>>> + }
> >>>> +
> >
> > Peter simply gets too much email.. anyway, the reason I put that there
> > is that a RO Anon page will never change and is thus a little pointless
> > to use for futex ops.
> >
>
> Right, and that was the logic I was trying to document. Shawn, how is my
> comment above wrong? A read-only anonymous page but itself doesn't imply
> a COW, but it does it does in the context of this code from my reading.

All I can tell you is from my testing is a PROT_READ, MAP_PRIVATE
page isn't an anonymous page. In other words.

futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE, fd, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, 0, 0, 0);

Works just fine with my patch and does NOT return EFAULT. Your
comment indicates the opposite.

--
Shawn


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-27 21:41:07

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH v2] futex: Fix regression with read only mappings



On 06/27/2011 02:08 PM, Shawn Bohrer wrote:
> On Mon, Jun 27, 2011 at 01:41:12PM -0700, Darren Hart wrote:
>>
>>
>> On 06/27/2011 11:15 AM, Peter Zijlstra wrote:
>>> On Mon, 2011-06-27 at 11:40 -0500, Shawn Bohrer wrote:
>>>>>> if (PageAnon(page_head)) {
>>>>>
>>>>> This bit needs a comment too (unless I am the only one to whom this
>>>> was
>>>>> non-obvious), maybe:
>>>>>
>>>>>
>>>>> /*
>>>>> * A read-only anonymous page implies a COW on a
>>>>> * MAP_PRIVATE mapping. There is no sane use-case
>>>>> * for this scenario, return -EFAULT to userspace.
>>>>> */
>>>>
>>>> Your comment is wrong. Unfortunately the code is completly
>>>> non-obvious to me as well, and I have no idea why it is there. This
>>>> little snippet came from Peter's suggested fix in:
>>>>
>>>> https://lkml.org/lkml/2011/6/6/368
>>>>
>>>> Sadly Peter's gone silent and I'm left wondering if he knew some
>>>> corner case that should return -EFAULT with a RO anonymous page or if
>>>> he _thought_ this was preventing RO MAP_PRIVATE mappings. If it is
>>>> the latter then this block can be removed because it does NOT do that.
>>>>>> + if (ro) {
>>>>>> + err = -EFAULT;
>>>>>> + goto out;
>>>>>> + }
>>>>>> +
>>>
>>> Peter simply gets too much email.. anyway, the reason I put that there
>>> is that a RO Anon page will never change and is thus a little pointless
>>> to use for futex ops.
>>>
>>
>> Right, and that was the logic I was trying to document. Shawn, how is my
>> comment above wrong? A read-only anonymous page but itself doesn't imply
>> a COW, but it does it does in the context of this code from my reading.
>
> All I can tell you is from my testing is a PROT_READ, MAP_PRIVATE
> page isn't an anonymous page. In other words.
>
> futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE, fd, 0);
> rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, 0, 0, 0);
>
> Works just fine with my patch and does NOT return EFAULT. Your
> comment indicates the opposite.

I see. That would suggest to me then that the get_user_pages doesn't
force the COW when for read-only access when write access is requested.
This makes sense from a get_user_pages perspective.

Kosaki pointed out that the mapping information is contained in the VMA.
We could test for this only if the RW get_user_pages fails, that would
leave the common case fast, but would hurt the valid RO SHARED case. The
alternative it seems is to just let RW private users hang themselves.

I'm tempted to accept the latter and document it in the futex.c file and
then in the man page.

Thoughts?

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-27 22:15:09

by Shawn Bohrer

[permalink] [raw]
Subject: Re: [PATCH v2] futex: Fix regression with read only mappings

On Mon, Jun 27, 2011 at 02:39:31PM -0700, Darren Hart wrote:
>
>
> On 06/27/2011 02:08 PM, Shawn Bohrer wrote:
> > On Mon, Jun 27, 2011 at 01:41:12PM -0700, Darren Hart wrote:
> >>
> >>
> >> On 06/27/2011 11:15 AM, Peter Zijlstra wrote:
> >>> On Mon, 2011-06-27 at 11:40 -0500, Shawn Bohrer wrote:
> >>>>>> if (PageAnon(page_head)) {
> >>>>>
> >>>>> This bit needs a comment too (unless I am the only one to whom this
> >>>> was
> >>>>> non-obvious), maybe:
> >>>>>
> >>>>>
> >>>>> /*
> >>>>> * A read-only anonymous page implies a COW on a
> >>>>> * MAP_PRIVATE mapping. There is no sane use-case
> >>>>> * for this scenario, return -EFAULT to userspace.
> >>>>> */
> >>>>
> >>>> Your comment is wrong. Unfortunately the code is completly
> >>>> non-obvious to me as well, and I have no idea why it is there. This
> >>>> little snippet came from Peter's suggested fix in:
> >>>>
> >>>> https://lkml.org/lkml/2011/6/6/368
> >>>>
> >>>> Sadly Peter's gone silent and I'm left wondering if he knew some
> >>>> corner case that should return -EFAULT with a RO anonymous page or if
> >>>> he _thought_ this was preventing RO MAP_PRIVATE mappings. If it is
> >>>> the latter then this block can be removed because it does NOT do that.
> >>>>>> + if (ro) {
> >>>>>> + err = -EFAULT;
> >>>>>> + goto out;
> >>>>>> + }
> >>>>>> +
> >>>
> >>> Peter simply gets too much email.. anyway, the reason I put that there
> >>> is that a RO Anon page will never change and is thus a little pointless
> >>> to use for futex ops.
> >>>
> >>
> >> Right, and that was the logic I was trying to document. Shawn, how is my
> >> comment above wrong? A read-only anonymous page but itself doesn't imply
> >> a COW, but it does it does in the context of this code from my reading.
> >
> > All I can tell you is from my testing is a PROT_READ, MAP_PRIVATE
> > page isn't an anonymous page. In other words.
> >
> > futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE, fd, 0);
> > rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, 0, 0, 0);
> >
> > Works just fine with my patch and does NOT return EFAULT. Your
> > comment indicates the opposite.
>
> I see. That would suggest to me then that the get_user_pages doesn't
> force the COW when for read-only access when write access is requested.
> This makes sense from a get_user_pages perspective.
>
> Kosaki pointed out that the mapping information is contained in the VMA.
> We could test for this only if the RW get_user_pages fails, that would
> leave the common case fast, but would hurt the valid RO SHARED case. The
> alternative it seems is to just let RW private users hang themselves.

RW private users? I believe that RW private users always have been
able and still can use a futex within their mapping between threads.
The RW get_user_pages() does force a COW which means that another
process updating the underlying file won't wake up the process, but
once again this seems to have been the behavior on 2.6.18-128.7.1.el5,
2.6.32.41 and probably all other versions.

> I'm tempted to accept the latter and document it in the futex.c file and
> then in the man page.
>
> Thoughts?

So yes I vote for the patch I've been sending with perhaps a futex man
page cleanup. It does open some corner cases for RO private mappings.
You can hang on the zero page, or hang if you change the permissions
on the mapping after the fact with mprotect, but both of those things
seem very obscure to me thus I vote they should simply be documented
(I did add them to the patch description).

--
Shawn


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-27 22:24:58

by Shawn Bohrer

[permalink] [raw]
Subject: [PATCH v3] futex: Fix regression with read only mappings

commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
parameter from get_futex_key()) in 2.6.33 introduced a user-mode
regression in that it additionally prevented futex operations on a
region within a read only memory mapped file. For example this breaks
workloads that have one or more reader processes doing a FUTEX_WAIT on a
futex within a read only shared mapping, and a writer processes that has
a writable mapping issuing the FUTEX_WAKE.

This fixes the regression for futex operations that should be valid on
RO mappings by trying a RO get_user_pages_fast() when the RW
get_user_pages_fast() fails so as not to slow down the common path of
writable anonymous maps and bailing when we used the RO path on
anonymous memory.

While fixing the regression this patch opens up two possible bad
scenarios as identified by KOSAKI Motohiro:

1) This patch also allows FUTEX_WAIT on RO private mappings which have
the following corner case.

Thread-A: call futex(FUTEX_WAIT, memory-region-A).
get_futex_key() return inode based key.
sleep on the key
Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
Thread-B: write memory-region-A.
COW happen. This process's memory-region-A become related
to new COWed private (ie PageAnon=1) page.
Thread-B: call futex(FUETX_WAKE, memory-region-A).
get_futex_key() return mm based key.
IOW, we fail to wake up Thread-A.

2) Current futex code doesn't handle zero page properly.

Read mode get_user_pages() can return zero page, but current futex
code doesn't handle it at all. Then, zero page makes infinite loop
internally.

This Patch is based on Peter Zijlstra's initial patch with modifications to
only allow RO mappings for futex operations that need VERIFY_READ access.

Reported-by: David Oliver <[email protected]>
Signed-off-by: Shawn Bohrer <[email protected]>
---

Changes FLAGS_WRITE to FLAGS_RO. Made flags an unsigned int instead of
int. Added/cleaned up comments.

kernel/futex.c | 50 ++++++++++++++++++++++++++++++++++++--------------
1 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..1737a66 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -75,6 +75,7 @@ int __read_mostly futex_cmpxchg_enabled;
#define FLAGS_SHARED 0x01
#define FLAGS_CLOCKRT 0x02
#define FLAGS_HAS_TIMEOUT 0x04
+#define FLAGS_RO 0x08

/*
* Priority Inheritance state:
@@ -216,7 +217,7 @@ static void drop_futex_key_refs(union futex_key *key)
/**
* get_futex_key() - Get parameters which are the keys for a futex
* @uaddr: virtual address of the futex
- * @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
+ * @flags: futex flags: FLAGS_SHARED and FLAGS_RO are valid.
* @key: address where result is stored.
*
* Returns a negative error code or 0
@@ -229,12 +230,12 @@ static void drop_futex_key_refs(union futex_key *key)
* lock_page() might sleep, the caller should not hold a spinlock.
*/
static int
-get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
+get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key)
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
struct page *page, *page_head;
- int err;
+ int err, ro = 0;

/*
* The futex address must be "naturally" aligned.
@@ -251,7 +252,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
* Note : We do have to check 'uaddr' is a valid user address,
* but access_ok() should be faster than find_vma()
*/
- if (!fshared) {
+ if (!(flags & FLAGS_SHARED)) {
if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
return -EFAULT;
key->private.mm = mm;
@@ -262,8 +263,18 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)

again:
err = get_user_pages_fast(address, 1, 1, &page);
+ /*
+ * If write access is not required (eg. FUTEX_WAIT), try
+ * and get read-only access.
+ */
+ if (err == -EFAULT && flags & FLAGS_RO) {
+ err = get_user_pages_fast(address, 1, 0, &page);
+ ro = 1;
+ }
if (err < 0)
return err;
+ else
+ err = 0;

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
page_head = page;
@@ -316,6 +327,15 @@ again:
* the object not the particular process.
*/
if (PageAnon(page_head)) {
+ /*
+ * A RO anonymous page will never change and thus doesn't make
+ * sense for futex operations.
+ */
+ if (ro) {
+ err = -EFAULT;
+ goto out;
+ }
+
key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
key->private.mm = mm;
key->private.address = address;
@@ -327,9 +347,10 @@ again:

get_futex_key_refs(key);

+out:
unlock_page(page_head);
put_page(page_head);
- return 0;
+ return err;
}

static inline void put_futex_key(union futex_key *key)
@@ -940,7 +961,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
if (!bitset)
return -EINVAL;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
+ ret = get_futex_key(uaddr, flags | FLAGS_RO, &key);
if (unlikely(ret != 0))
goto out;

@@ -986,10 +1007,10 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
int ret, op_ret;

retry:
- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
+ ret = get_futex_key(uaddr1, flags | FLAGS_RO, &key1);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, flags, &key2);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1243,10 +1264,11 @@ retry:
pi_state = NULL;
}

- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
+ ret = get_futex_key(uaddr1, flags | FLAGS_RO, &key1);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, requeue_pi ? flags : flags | FLAGS_RO,
+ &key2);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1790,7 +1812,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
* while the syscall executes.
*/
retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key);
+ ret = get_futex_key(uaddr, flags | FLAGS_RO, &q->key);
if (unlikely(ret != 0))
return ret;

@@ -1941,7 +1963,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, int detect,
}

retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
+ ret = get_futex_key(uaddr, flags, &q.key);
if (unlikely(ret != 0))
goto out;

@@ -2060,7 +2082,7 @@ retry:
if ((uval & FUTEX_TID_MASK) != vpid)
return -EPERM;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
+ ret = get_futex_key(uaddr, flags, &key);
if (unlikely(ret != 0))
goto out;

@@ -2249,7 +2271,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
debug_rt_mutex_init_waiter(&rt_waiter);
rt_waiter.task = NULL;

- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, flags, &key2);
if (unlikely(ret != 0))
goto out;

--
1.6.5.2



---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-27 23:20:32

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH v2] futex: Fix regression with read only mappings



On 06/27/2011 03:14 PM, Shawn Bohrer wrote:
> On Mon, Jun 27, 2011 at 02:39:31PM -0700, Darren Hart wrote:
>>
>>
>> On 06/27/2011 02:08 PM, Shawn Bohrer wrote:
>>> On Mon, Jun 27, 2011 at 01:41:12PM -0700, Darren Hart wrote:
>>>>
>>>>
>>>> On 06/27/2011 11:15 AM, Peter Zijlstra wrote:
>>>>> On Mon, 2011-06-27 at 11:40 -0500, Shawn Bohrer wrote:
>>>>>>>> if (PageAnon(page_head)) {
>>>>>>>
>>>>>>> This bit needs a comment too (unless I am the only one to whom this
>>>>>> was
>>>>>>> non-obvious), maybe:
>>>>>>>
>>>>>>>
>>>>>>> /*
>>>>>>> * A read-only anonymous page implies a COW on a
>>>>>>> * MAP_PRIVATE mapping. There is no sane use-case
>>>>>>> * for this scenario, return -EFAULT to userspace.
>>>>>>> */
>>>>>>
>>>>>> Your comment is wrong. Unfortunately the code is completly
>>>>>> non-obvious to me as well, and I have no idea why it is there. This
>>>>>> little snippet came from Peter's suggested fix in:
>>>>>>
>>>>>> https://lkml.org/lkml/2011/6/6/368
>>>>>>
>>>>>> Sadly Peter's gone silent and I'm left wondering if he knew some
>>>>>> corner case that should return -EFAULT with a RO anonymous page or if
>>>>>> he _thought_ this was preventing RO MAP_PRIVATE mappings. If it is
>>>>>> the latter then this block can be removed because it does NOT do that.
>>>>>>>> + if (ro) {
>>>>>>>> + err = -EFAULT;
>>>>>>>> + goto out;
>>>>>>>> + }
>>>>>>>> +
>>>>>
>>>>> Peter simply gets too much email.. anyway, the reason I put that there
>>>>> is that a RO Anon page will never change and is thus a little pointless
>>>>> to use for futex ops.
>>>>>
>>>>
>>>> Right, and that was the logic I was trying to document. Shawn, how is my
>>>> comment above wrong? A read-only anonymous page but itself doesn't imply
>>>> a COW, but it does it does in the context of this code from my reading.
>>>
>>> All I can tell you is from my testing is a PROT_READ, MAP_PRIVATE
>>> page isn't an anonymous page. In other words.
>>>
>>> futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE, fd, 0);
>>> rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, 0, 0, 0);
>>>
>>> Works just fine with my patch and does NOT return EFAULT. Your
>>> comment indicates the opposite.
>>
>> I see. That would suggest to me then that the get_user_pages doesn't
>> force the COW when for read-only access when write access is requested.
>> This makes sense from a get_user_pages perspective.
>>
>> Kosaki pointed out that the mapping information is contained in the VMA.
>> We could test for this only if the RW get_user_pages fails, that would
>> leave the common case fast, but would hurt the valid RO SHARED case. The
>> alternative it seems is to just let RW private users hang themselves.
>
> RW private users?

Sorry, I meant "RO private users"

I believe that RW private users always have been
> able and still can use a futex within their mapping between threads.
> The RW get_user_pages() does force a COW which means that another
> process updating the underlying file won't wake up the process, but
> once again this seems to have been the behavior on 2.6.18-128.7.1.el5,
> 2.6.32.41 and probably all other versions.
>
>> I'm tempted to accept the latter and document it in the futex.c file and
>> then in the man page.
>>
>> Thoughts?
>
> So yes I vote for the patch I've been sending with perhaps a futex man
> page cleanup. It does open some corner cases for RO private mappings.
> You can hang on the zero page, or hang if you change the permissions
> on the mapping after the fact with mprotect, but both of those things
> seem very obscure to me thus I vote they should simply be documented
> (I did add them to the patch description).


I'm concerned about the zero page. I need to go re-read all of Kosaki's
last round of patches related to the zero page. I'll look at this
tonight or tomorrow, but if I remember correctly, the zero page left a
DOS opportunity - we can't re-introduce something like that.

Thanks,

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-28 10:53:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] futex: Fix regression with read only mappings

On Mon, 2011-06-27 at 14:39 -0700, Darren Hart wrote:
> The alternative it seems is to just let RW private users hang
> themselves.

Yeah, screw those, that's insane anyway.

2011-06-28 10:56:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3] futex: Fix regression with read only mappings

On Mon, 2011-06-27 at 17:22 -0500, Shawn Bohrer wrote:
> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
> regression in that it additionally prevented futex operations on a
> region within a read only memory mapped file. For example this breaks
> workloads that have one or more reader processes doing a FUTEX_WAIT on a
> futex within a read only shared mapping, and a writer processes that has
> a writable mapping issuing the FUTEX_WAKE.
>
> This fixes the regression for futex operations that should be valid on
> RO mappings by trying a RO get_user_pages_fast() when the RW
> get_user_pages_fast() fails so as not to slow down the common path of
> writable anonymous maps and bailing when we used the RO path on
> anonymous memory.
>
> While fixing the regression this patch opens up two possible bad
> scenarios as identified by KOSAKI Motohiro:
>
> 1) This patch also allows FUTEX_WAIT on RO private mappings which have
> the following corner case.
>
> Thread-A: call futex(FUTEX_WAIT, memory-region-A).
> get_futex_key() return inode based key.
> sleep on the key
> Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
> Thread-B: write memory-region-A.
> COW happen. This process's memory-region-A become related
> to new COWed private (ie PageAnon=1) page.
> Thread-B: call futex(FUETX_WAKE, memory-region-A).
> get_futex_key() return mm based key.
> IOW, we fail to wake up Thread-A.
>
> 2) Current futex code doesn't handle zero page properly.
>
> Read mode get_user_pages() can return zero page, but current futex
> code doesn't handle it at all. Then, zero page makes infinite loop
> internally.
>
> This Patch is based on Peter Zijlstra's initial patch with modifications to
> only allow RO mappings for futex operations that need VERIFY_READ access.
>
> Reported-by: David Oliver <[email protected]>
> Signed-off-by: Shawn Bohrer <[email protected]>

Acked-by: Peter Zijlstra <[email protected]>

2011-06-28 14:21:23

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH v2] futex: Fix regression with read only mappings



On 06/28/2011 03:50 AM, Peter Zijlstra wrote:
> On Mon, 2011-06-27 at 14:39 -0700, Darren Hart wrote:
>> The alternative it seems is to just let RW private users hang
>> themselves.
>
> Yeah, screw those, that's insane anyway.

And I meant RO private users (for those reading just this message and
getting really confused ;-)

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-28 14:25:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] futex: Fix regression with read only mappings

On Tue, 2011-06-28 at 07:19 -0700, Darren Hart wrote:
>
> On 06/28/2011 03:50 AM, Peter Zijlstra wrote:
> > On Mon, 2011-06-27 at 14:39 -0700, Darren Hart wrote:
> >> The alternative it seems is to just let RW private users hang
> >> themselves.
> >
> > Yeah, screw those, that's insane anyway.
>
> And I meant RO private users (for those reading just this message and
> getting really confused ;-)

RW private is crazy too, there is no guarantee you'll stay shared and
thus relying on the shared part of the semantics is broken. As is
clearly documented in the mmap() man page as pointed out by someone
somewhere in this thread.

2011-06-28 14:53:25

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH v3] futex: Fix regression with read only mappings



On 06/28/2011 03:54 AM, Peter Zijlstra wrote:
> On Mon, 2011-06-27 at 17:22 -0500, Shawn Bohrer wrote:
>> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
>> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
>> regression in that it additionally prevented futex operations on a
>> region within a read only memory mapped file. For example this breaks
>> workloads that have one or more reader processes doing a FUTEX_WAIT on a
>> futex within a read only shared mapping, and a writer processes that has
>> a writable mapping issuing the FUTEX_WAKE.
>>
>> This fixes the regression for futex operations that should be valid on
>> RO mappings by trying a RO get_user_pages_fast() when the RW
>> get_user_pages_fast() fails so as not to slow down the common path of
>> writable anonymous maps and bailing when we used the RO path on
>> anonymous memory.
>>
>> While fixing the regression this patch opens up two possible bad
>> scenarios as identified by KOSAKI Motohiro:
>>
>> 1) This patch also allows FUTEX_WAIT on RO private mappings which have
>> the following corner case.
>>
>> Thread-A: call futex(FUTEX_WAIT, memory-region-A).
>> get_futex_key() return inode based key.
>> sleep on the key
>> Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
>> Thread-B: write memory-region-A.
>> COW happen. This process's memory-region-A become related
>> to new COWed private (ie PageAnon=1) page.
>> Thread-B: call futex(FUETX_WAKE, memory-region-A).
>> get_futex_key() return mm based key.
>> IOW, we fail to wake up Thread-A.
>>
>> 2) Current futex code doesn't handle zero page properly.
>>
>> Read mode get_user_pages() can return zero page, but current futex
>> code doesn't handle it at all. Then, zero page makes infinite loop
>> internally.
>>
>> This Patch is based on Peter Zijlstra's initial patch with modifications to
>> only allow RO mappings for futex operations that need VERIFY_READ access.
>>
>> Reported-by: David Oliver <[email protected]>
>> Signed-off-by: Shawn Bohrer <[email protected]>
>
> Acked-by: Peter Zijlstra <[email protected]>

I think we need to address #2 above first.

1) Is the loop killable?
2) If not, and we try to catch zero_page scenario, we need to ensure RO
sparse file mapping work. Peter notes that these are probably fine,
filemap.c doesn't seem to use the zero page - but we need to test.

It's agreed, I believe, between myself, Peter, and Shawn that we will
document Private RW mappings as unsupported.

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-28 17:40:39

by Shawn Bohrer

[permalink] [raw]
Subject: Re: [PATCH v3] futex: Fix regression with read only mappings

On Tue, Jun 28, 2011 at 07:52:06AM -0700, Darren Hart wrote:
>
>
> On 06/28/2011 03:54 AM, Peter Zijlstra wrote:
> > On Mon, 2011-06-27 at 17:22 -0500, Shawn Bohrer wrote:
> >> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
> >> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
> >> regression in that it additionally prevented futex operations on a
> >> region within a read only memory mapped file. For example this breaks
> >> workloads that have one or more reader processes doing a FUTEX_WAIT on a
> >> futex within a read only shared mapping, and a writer processes that has
> >> a writable mapping issuing the FUTEX_WAKE.
> >>
> >> This fixes the regression for futex operations that should be valid on
> >> RO mappings by trying a RO get_user_pages_fast() when the RW
> >> get_user_pages_fast() fails so as not to slow down the common path of
> >> writable anonymous maps and bailing when we used the RO path on
> >> anonymous memory.
> >>
> >> While fixing the regression this patch opens up two possible bad
> >> scenarios as identified by KOSAKI Motohiro:
> >>
> >> 1) This patch also allows FUTEX_WAIT on RO private mappings which have
> >> the following corner case.
> >>
> >> Thread-A: call futex(FUTEX_WAIT, memory-region-A).
> >> get_futex_key() return inode based key.
> >> sleep on the key
> >> Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
> >> Thread-B: write memory-region-A.
> >> COW happen. This process's memory-region-A become related
> >> to new COWed private (ie PageAnon=1) page.
> >> Thread-B: call futex(FUETX_WAKE, memory-region-A).
> >> get_futex_key() return mm based key.
> >> IOW, we fail to wake up Thread-A.
> >>
> >> 2) Current futex code doesn't handle zero page properly.
> >>
> >> Read mode get_user_pages() can return zero page, but current futex
> >> code doesn't handle it at all. Then, zero page makes infinite loop
> >> internally.
> >>
> >> This Patch is based on Peter Zijlstra's initial patch with modifications to
> >> only allow RO mappings for futex operations that need VERIFY_READ access.
> >>
> >> Reported-by: David Oliver <[email protected]>
> >> Signed-off-by: Shawn Bohrer <[email protected]>
> >
> > Acked-by: Peter Zijlstra <[email protected]>
>
> I think we need to address #2 above first.

I believe the following contrived case triggers the zero page problem:

#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
#include <linux/futex.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <stdio.h>


int main(int argc, char *argv[])
{
int fd, *futex, rc, val = 42;
struct timespec ts = {.tv_sec = 2, .tv_nsec = 0 };

fd = open("/tmp/futex_test", O_RDWR|O_CREAT, 0644);
write(fd, &val, 4);
futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, &ts, 0, 0);
printf("rc=%d errno=%d\n", rc, errno);
}

Running the program above with my patch applied will spin in the
kernel at 100% CPU usage.

> 1) Is the loop killable?

Yes, SIGINT causes the program to return.

> 2) If not, and we try to catch zero_page scenario, we need to ensure RO
> sparse file mapping work. Peter notes that these are probably fine,
> filemap.c doesn't seem to use the zero page - but we need to test.
>
> It's agreed, I believe, between myself, Peter, and Shawn that we will
> document Private RW mappings as unsupported.
>
> --
> Darren Hart
> Intel Open Source Technology Center
> Yocto Project - Linux Kernel


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-28 20:58:57

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH v3] futex: Fix regression with read only mappings



On 06/28/2011 10:38 AM, Shawn Bohrer wrote:
> On Tue, Jun 28, 2011 at 07:52:06AM -0700, Darren Hart wrote:
>>
>>
>> On 06/28/2011 03:54 AM, Peter Zijlstra wrote:
>>> On Mon, 2011-06-27 at 17:22 -0500, Shawn Bohrer wrote:
>>>> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
>>>> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
>>>> regression in that it additionally prevented futex operations on a
>>>> region within a read only memory mapped file. For example this breaks
>>>> workloads that have one or more reader processes doing a FUTEX_WAIT on a
>>>> futex within a read only shared mapping, and a writer processes that has
>>>> a writable mapping issuing the FUTEX_WAKE.
>>>>
>>>> This fixes the regression for futex operations that should be valid on
>>>> RO mappings by trying a RO get_user_pages_fast() when the RW
>>>> get_user_pages_fast() fails so as not to slow down the common path of
>>>> writable anonymous maps and bailing when we used the RO path on
>>>> anonymous memory.
>>>>
>>>> While fixing the regression this patch opens up two possible bad
>>>> scenarios as identified by KOSAKI Motohiro:
>>>>
>>>> 1) This patch also allows FUTEX_WAIT on RO private mappings which have
>>>> the following corner case.
>>>>
>>>> Thread-A: call futex(FUTEX_WAIT, memory-region-A).
>>>> get_futex_key() return inode based key.
>>>> sleep on the key
>>>> Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
>>>> Thread-B: write memory-region-A.
>>>> COW happen. This process's memory-region-A become related
>>>> to new COWed private (ie PageAnon=1) page.
>>>> Thread-B: call futex(FUETX_WAKE, memory-region-A).
>>>> get_futex_key() return mm based key.
>>>> IOW, we fail to wake up Thread-A.
>>>>
>>>> 2) Current futex code doesn't handle zero page properly.
>>>>
>>>> Read mode get_user_pages() can return zero page, but current futex
>>>> code doesn't handle it at all. Then, zero page makes infinite loop
>>>> internally.
>>>>
>>>> This Patch is based on Peter Zijlstra's initial patch with modifications to
>>>> only allow RO mappings for futex operations that need VERIFY_READ access.
>>>>
>>>> Reported-by: David Oliver <[email protected]>
>>>> Signed-off-by: Shawn Bohrer <[email protected]>
>>>
>>> Acked-by: Peter Zijlstra <[email protected]>
>>
>> I think we need to address #2 above first.
>
> I believe the following contrived case triggers the zero page problem:
>
> #include <errno.h>
> #include <fcntl.h>
> #include <stdint.h>
> #include <linux/futex.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <unistd.h>
> #include <stdio.h>
>
>
> int main(int argc, char *argv[])
> {
> int fd, *futex, rc, val = 42;
> struct timespec ts = {.tv_sec = 2, .tv_nsec = 0 };
>
> fd = open("/tmp/futex_test", O_RDWR|O_CREAT, 0644);
> write(fd, &val, 4);
> futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, &ts, 0, 0);
> printf("rc=%d errno=%d\n", rc, errno);
> }
>
> Running the program above with my patch applied will spin in the
> kernel at 100% CPU usage.
>
>> 1) Is the loop killable?
>
> Yes, SIGINT causes the program to return.
>


Confirmed.


>> 2) If not, and we try to catch zero_page scenario, we need to ensure RO
>> sparse file mapping work. Peter notes that these are probably fine,
>> filemap.c doesn't seem to use the zero page - but we need to test.

I felt it was worth testing sparse files anyway. The following works with a
sparse mapped file created as:

$ dd if=/dev/null of=/tmp/futex_test_sparse bs=1k seek=5120

$ ls -l /tmp/futex_test_sparse
-rw-r--r-- 1 dvhart dvhart 5242880 2011-06-28 13:44 /tmp/futex_test_sparse

$ du -B1 /tmp/futex_test_sparse
4096 /tmp/futex_test_sparse

$ ./futex-zero-page-sparse
fd=3, errno=0
futex=0x7f6b86248000, offset=40960, errno=0
rc=-1 errno=110

Successful ETIMEDOUT return.


#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
#include <linux/futex.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <stdio.h>


int main(int argc, char *argv[])
{
struct timespec ts = {.tv_sec = 2, .tv_nsec = 0 };
int fd, *futex, rc;
off_t offset;

offset = sysconf(_SC_PAGESIZE) * 10;

fd = open("/tmp/futex_test_sparse", O_RDONLY, 0644);
printf("fd=%d, errno=%d\n", fd, errno);
futex = (int *)mmap(NULL, sizeof(int), PROT_READ, MAP_SHARED, fd, offset);
printf("futex=%p, offset=%d, errno=%d\n", futex, offset, errno);
rc = syscall(SYS_futex, futex, FUTEX_WAIT, 0, &ts, 0, 0);
printf("rc=%d errno=%d\n", rc, errno);
}



>> It's agreed, I believe, between myself, Peter, and Shawn that we will
>> document Private RW mappings as unsupported.
>>
>> --
>> Darren Hart
>> Intel Open Source Technology Center
>> Yocto Project - Linux Kernel
>
>
> ---------------------------------------------------------------
> This email, along with any attachments, is confidential. If you
> believe you received this message in error, please contact the
> sender immediately and delete all copies of the message.
> Thank you.
>

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-28 23:56:13

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH v3] futex: Fix regression with read only mappings



On 06/28/2011 10:38 AM, Shawn Bohrer wrote:
> On Tue, Jun 28, 2011 at 07:52:06AM -0700, Darren Hart wrote:
>>
>>
>> On 06/28/2011 03:54 AM, Peter Zijlstra wrote:
>>> On Mon, 2011-06-27 at 17:22 -0500, Shawn Bohrer wrote:
>>>> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
>>>> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
>>>> regression in that it additionally prevented futex operations on a
>>>> region within a read only memory mapped file. For example this breaks
>>>> workloads that have one or more reader processes doing a FUTEX_WAIT on a
>>>> futex within a read only shared mapping, and a writer processes that has
>>>> a writable mapping issuing the FUTEX_WAKE.
>>>>
>>>> This fixes the regression for futex operations that should be valid on
>>>> RO mappings by trying a RO get_user_pages_fast() when the RW
>>>> get_user_pages_fast() fails so as not to slow down the common path of
>>>> writable anonymous maps and bailing when we used the RO path on
>>>> anonymous memory.
>>>>
>>>> While fixing the regression this patch opens up two possible bad
>>>> scenarios as identified by KOSAKI Motohiro:
>>>>
>>>> 1) This patch also allows FUTEX_WAIT on RO private mappings which have
>>>> the following corner case.
>>>>
>>>> Thread-A: call futex(FUTEX_WAIT, memory-region-A).
>>>> get_futex_key() return inode based key.
>>>> sleep on the key
>>>> Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
>>>> Thread-B: write memory-region-A.
>>>> COW happen. This process's memory-region-A become related
>>>> to new COWed private (ie PageAnon=1) page.
>>>> Thread-B: call futex(FUETX_WAKE, memory-region-A).
>>>> get_futex_key() return mm based key.
>>>> IOW, we fail to wake up Thread-A.
>>>>
>>>> 2) Current futex code doesn't handle zero page properly.
>>>>
>>>> Read mode get_user_pages() can return zero page, but current futex
>>>> code doesn't handle it at all. Then, zero page makes infinite loop
>>>> internally.
>>>>
>>>> This Patch is based on Peter Zijlstra's initial patch with modifications to
>>>> only allow RO mappings for futex operations that need VERIFY_READ access.
>>>>
>>>> Reported-by: David Oliver <[email protected]>
>>>> Signed-off-by: Shawn Bohrer <[email protected]>
>>>
>>> Acked-by: Peter Zijlstra <[email protected]>
>>
>> I think we need to address #2 above first.
>
> I believe the following contrived case triggers the zero page problem:
>
> #include <errno.h>
> #include <fcntl.h>
> #include <stdint.h>
> #include <linux/futex.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <unistd.h>
> #include <stdio.h>
>
>
> int main(int argc, char *argv[])
> {
> int fd, *futex, rc, val = 42;
> struct timespec ts = {.tv_sec = 2, .tv_nsec = 0 };
>
> fd = open("/tmp/futex_test", O_RDWR|O_CREAT, 0644);
> write(fd, &val, 4);
> futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, &ts, 0, 0);
> printf("rc=%d errno=%d\n", rc, errno);
> }
>
> Running the program above with my patch applied will spin in the
> kernel at 100% CPU usage.
>
>> 1) Is the loop killable?
>
> Yes, SIGINT causes the program to return.

OK, so while it is killable... I'd really rather not introduce a busy loop in the
kernel. Especially not after we removed it with:

http://lkml.org/lkml/2010/1/13/136

How about adding something like this. I _think_ the only way to get a ZERO_PAGE
here now is with the contrived testcase below.

$ git diff
diff --git a/kernel/futex.c b/kernel/futex.c
index 1737a66..a5417c2 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -316,6 +316,13 @@ again:
if (!page_head->mapping) {
unlock_page(page_head);
put_page(page_head);
+ /*
+ * ZERO_PAGE pages don't have a mapping. Avoid a busy loop
+ * trying to find one. RW mapping would have COW'd (and thus
+ * have a mapping) so this page is RO and won't ever change.
+ */
+ if ((page_head == ZERO_PAGE(address)))
+ return -EFAULT;
goto again;
}


This returns EFAULT for the following testcase which spun in get_futex_key with
your V3 patch.


#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
#include <linux/futex.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <stdio.h>


int main(int argc, char *argv[])
{
int fd, *futex, rc;
struct timespec ts = {.tv_sec = 2, .tv_nsec = 0 };

futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
printf("futex @ %p\n", futex);
rc = syscall(SYS_futex, futex, FUTEX_WAIT, 0, &ts, 0, 0);
printf("rc=%d errno=%d\n", rc, errno);
}

Thanks,

Darren

>
>> 2) If not, and we try to catch zero_page scenario, we need to ensure RO
>> sparse file mapping work. Peter notes that these are probably fine,
>> filemap.c doesn't seem to use the zero page - but we need to test.
>>
>> It's agreed, I believe, between myself, Peter, and Shawn that we will
>> document Private RW mappings as unsupported.
>>
>> --
>> Darren Hart
>> Intel Open Source Technology Center
>> Yocto Project - Linux Kernel
>
>
> ---------------------------------------------------------------
> This email, along with any attachments, is confidential. If you
> believe you received this message in error, please contact the
> sender immediately and delete all copies of the message.
> Thank you.
>

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-29 14:56:59

by Shawn Bohrer

[permalink] [raw]
Subject: Re: [PATCH v3] futex: Fix regression with read only mappings

On Tue, Jun 28, 2011 at 04:55:59PM -0700, Darren Hart wrote:
>
>
> On 06/28/2011 10:38 AM, Shawn Bohrer wrote:
> > On Tue, Jun 28, 2011 at 07:52:06AM -0700, Darren Hart wrote:
> >>
> >>
> >> On 06/28/2011 03:54 AM, Peter Zijlstra wrote:
> >>> On Mon, 2011-06-27 at 17:22 -0500, Shawn Bohrer wrote:
> >>>> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
> >>>> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
> >>>> regression in that it additionally prevented futex operations on a
> >>>> region within a read only memory mapped file. For example this breaks
> >>>> workloads that have one or more reader processes doing a FUTEX_WAIT on a
> >>>> futex within a read only shared mapping, and a writer processes that has
> >>>> a writable mapping issuing the FUTEX_WAKE.
> >>>>
> >>>> This fixes the regression for futex operations that should be valid on
> >>>> RO mappings by trying a RO get_user_pages_fast() when the RW
> >>>> get_user_pages_fast() fails so as not to slow down the common path of
> >>>> writable anonymous maps and bailing when we used the RO path on
> >>>> anonymous memory.
> >>>>
> >>>> While fixing the regression this patch opens up two possible bad
> >>>> scenarios as identified by KOSAKI Motohiro:
> >>>>
> >>>> 1) This patch also allows FUTEX_WAIT on RO private mappings which have
> >>>> the following corner case.
> >>>>
> >>>> Thread-A: call futex(FUTEX_WAIT, memory-region-A).
> >>>> get_futex_key() return inode based key.
> >>>> sleep on the key
> >>>> Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
> >>>> Thread-B: write memory-region-A.
> >>>> COW happen. This process's memory-region-A become related
> >>>> to new COWed private (ie PageAnon=1) page.
> >>>> Thread-B: call futex(FUETX_WAKE, memory-region-A).
> >>>> get_futex_key() return mm based key.
> >>>> IOW, we fail to wake up Thread-A.
> >>>>
> >>>> 2) Current futex code doesn't handle zero page properly.
> >>>>
> >>>> Read mode get_user_pages() can return zero page, but current futex
> >>>> code doesn't handle it at all. Then, zero page makes infinite loop
> >>>> internally.
> >>>>
> >>>> This Patch is based on Peter Zijlstra's initial patch with modifications to
> >>>> only allow RO mappings for futex operations that need VERIFY_READ access.
> >>>>
> >>>> Reported-by: David Oliver <[email protected]>
> >>>> Signed-off-by: Shawn Bohrer <[email protected]>
> >>>
> >>> Acked-by: Peter Zijlstra <[email protected]>
> >>
> >> I think we need to address #2 above first.
> >
> > I believe the following contrived case triggers the zero page problem:
> >
> > #include <errno.h>
> > #include <fcntl.h>
> > #include <stdint.h>
> > #include <linux/futex.h>
> > #include <sys/mman.h>
> > #include <sys/syscall.h>
> > #include <unistd.h>
> > #include <stdio.h>
> >
> >
> > int main(int argc, char *argv[])
> > {
> > int fd, *futex, rc, val = 42;
> > struct timespec ts = {.tv_sec = 2, .tv_nsec = 0 };
> >
> > fd = open("/tmp/futex_test", O_RDWR|O_CREAT, 0644);
> > write(fd, &val, 4);
> > futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> > rc = syscall(SYS_futex, futex, FUTEX_WAIT, val, &ts, 0, 0);
> > printf("rc=%d errno=%d\n", rc, errno);
> > }
> >
> > Running the program above with my patch applied will spin in the
> > kernel at 100% CPU usage.
> >
> >> 1) Is the loop killable?
> >
> > Yes, SIGINT causes the program to return.
>
> OK, so while it is killable... I'd really rather not introduce a busy loop in the
> kernel. Especially not after we removed it with:
>
> http://lkml.org/lkml/2010/1/13/136
>
> How about adding something like this. I _think_ the only way to get a ZERO_PAGE
> here now is with the contrived testcase below.
>
> $ git diff
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 1737a66..a5417c2 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -316,6 +316,13 @@ again:
> if (!page_head->mapping) {
> unlock_page(page_head);
> put_page(page_head);
> + /*
> + * ZERO_PAGE pages don't have a mapping. Avoid a busy loop
> + * trying to find one. RW mapping would have COW'd (and thus
> + * have a mapping) so this page is RO and won't ever change.
> + */
> + if ((page_head == ZERO_PAGE(address)))
> + return -EFAULT;
> goto again;
> }
>
>
> This returns EFAULT for the following testcase which spun in get_futex_key with
> your V3 patch.

I've tested and confirmed this as well. I can send a v4 patch with
this included.

>
> #include <errno.h>
> #include <fcntl.h>
> #include <stdint.h>
> #include <linux/futex.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <unistd.h>
> #include <stdio.h>
>
>
> int main(int argc, char *argv[])
> {
> int fd, *futex, rc;
> struct timespec ts = {.tv_sec = 2, .tv_nsec = 0 };
>
> futex = (int *)mmap(0, sizeof(int), PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> printf("futex @ %p\n", futex);
> rc = syscall(SYS_futex, futex, FUTEX_WAIT, 0, &ts, 0, 0);
> printf("rc=%d errno=%d\n", rc, errno);
> }
>
> Thanks,
>
> Darren
>
> >
> >> 2) If not, and we try to catch zero_page scenario, we need to ensure RO
> >> sparse file mapping work. Peter notes that these are probably fine,
> >> filemap.c doesn't seem to use the zero page - but we need to test.
> >>
> >> It's agreed, I believe, between myself, Peter, and Shawn that we will
> >> document Private RW mappings as unsupported.

--
Shawn


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-29 15:18:07

by Shawn Bohrer

[permalink] [raw]
Subject: [PATCH v4] futex: Fix regression with read only mappings

commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
parameter from get_futex_key()) in 2.6.33 introduced a user-mode
regression in that it additionally prevented futex operations on a
region within a read only memory mapped file. For example this breaks
workloads that have one or more reader processes doing a FUTEX_WAIT on a
futex within a read only shared mapping, and a writer processes that has
a writable mapping issuing the FUTEX_WAKE.

This fixes the regression for futex operations that should be valid on
RO mappings by trying a RO get_user_pages_fast() when the RW
get_user_pages_fast() fails so as not to slow down the common path of
writable anonymous maps and bailing when we used the RO path on
anonymous memory.

While fixing the regression this patch opens up a possible bad
scenarios as identified by KOSAKI Motohiro:

This patch also allows FUTEX_WAIT on RO private mappings which have
the following corner case.

Thread-A: call futex(FUTEX_WAIT, memory-region-A).
get_futex_key() return inode based key.
sleep on the key
Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
Thread-B: write memory-region-A.
COW happen. This process's memory-region-A become related
to new COWed private (ie PageAnon=1) page.
Thread-B: call futex(FUETX_WAKE, memory-region-A).
get_futex_key() return mm based key.
IOW, we fail to wake up Thread-A.

This Patch is based on Peter Zijlstra's initial patch with modifications to
only allow RO mappings for futex operations that need VERIFY_READ access.

Reported-by: David Oliver <[email protected]>
Signed-off-by: Shawn Bohrer <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
---

Included Darren Hart's fix to infinite loop in kernel on zero page.

kernel/futex.c | 57 ++++++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..d919cc0 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -75,6 +75,7 @@ int __read_mostly futex_cmpxchg_enabled;
#define FLAGS_SHARED 0x01
#define FLAGS_CLOCKRT 0x02
#define FLAGS_HAS_TIMEOUT 0x04
+#define FLAGS_RO 0x08

/*
* Priority Inheritance state:
@@ -216,7 +217,7 @@ static void drop_futex_key_refs(union futex_key *key)
/**
* get_futex_key() - Get parameters which are the keys for a futex
* @uaddr: virtual address of the futex
- * @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
+ * @flags: futex flags: FLAGS_SHARED and FLAGS_RO are valid.
* @key: address where result is stored.
*
* Returns a negative error code or 0
@@ -229,12 +230,12 @@ static void drop_futex_key_refs(union futex_key *key)
* lock_page() might sleep, the caller should not hold a spinlock.
*/
static int
-get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
+get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key)
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
struct page *page, *page_head;
- int err;
+ int err, ro = 0;

/*
* The futex address must be "naturally" aligned.
@@ -251,7 +252,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
* Note : We do have to check 'uaddr' is a valid user address,
* but access_ok() should be faster than find_vma()
*/
- if (!fshared) {
+ if (!(flags & FLAGS_SHARED)) {
if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
return -EFAULT;
key->private.mm = mm;
@@ -262,8 +263,18 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)

again:
err = get_user_pages_fast(address, 1, 1, &page);
+ /*
+ * If write access is not required (eg. FUTEX_WAIT), try
+ * and get read-only access.
+ */
+ if (err == -EFAULT && flags & FLAGS_RO) {
+ err = get_user_pages_fast(address, 1, 0, &page);
+ ro = 1;
+ }
if (err < 0)
return err;
+ else
+ err = 0;

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
page_head = page;
@@ -305,6 +316,13 @@ again:
if (!page_head->mapping) {
unlock_page(page_head);
put_page(page_head);
+ /*
+ * ZERO_PAGE pages don't have a mapping. Avoid a busy loop
+ * trying to find one. RW mapping would have COW'd (and thus
+ * have a mapping) so this page is RO and won't ever change.
+ */
+ if ((page_head == ZERO_PAGE(address)))
+ return -EFAULT;
goto again;
}

@@ -316,6 +334,15 @@ again:
* the object not the particular process.
*/
if (PageAnon(page_head)) {
+ /*
+ * A RO anonymous page will never change and thus doesn't make
+ * sense for futex operations.
+ */
+ if (ro) {
+ err = -EFAULT;
+ goto out;
+ }
+
key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
key->private.mm = mm;
key->private.address = address;
@@ -327,9 +354,10 @@ again:

get_futex_key_refs(key);

+out:
unlock_page(page_head);
put_page(page_head);
- return 0;
+ return err;
}

static inline void put_futex_key(union futex_key *key)
@@ -940,7 +968,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
if (!bitset)
return -EINVAL;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
+ ret = get_futex_key(uaddr, flags | FLAGS_RO, &key);
if (unlikely(ret != 0))
goto out;

@@ -986,10 +1014,10 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
int ret, op_ret;

retry:
- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
+ ret = get_futex_key(uaddr1, flags | FLAGS_RO, &key1);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, flags, &key2);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1243,10 +1271,11 @@ retry:
pi_state = NULL;
}

- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
+ ret = get_futex_key(uaddr1, flags | FLAGS_RO, &key1);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, requeue_pi ? flags : flags | FLAGS_RO,
+ &key2);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1790,7 +1819,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
* while the syscall executes.
*/
retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key);
+ ret = get_futex_key(uaddr, flags | FLAGS_RO, &q->key);
if (unlikely(ret != 0))
return ret;

@@ -1941,7 +1970,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, int detect,
}

retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
+ ret = get_futex_key(uaddr, flags, &q.key);
if (unlikely(ret != 0))
goto out;

@@ -2060,7 +2089,7 @@ retry:
if ((uval & FUTEX_TID_MASK) != vpid)
return -EPERM;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
+ ret = get_futex_key(uaddr, flags, &key);
if (unlikely(ret != 0))
goto out;

@@ -2249,7 +2278,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
debug_rt_mutex_init_waiter(&rt_waiter);
rt_waiter.task = NULL;

- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, flags, &key2);
if (unlikely(ret != 0))
goto out;

--
1.6.5.2



---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-29 18:41:33

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH v4] futex: Fix regression with read only mappings



On 06/29/2011 08:17 AM, Shawn Bohrer wrote:
> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
> regression in that it additionally prevented futex operations on a
> region within a read only memory mapped file. For example this breaks
> workloads that have one or more reader processes doing a FUTEX_WAIT on a
> futex within a read only shared mapping, and a writer processes that has
> a writable mapping issuing the FUTEX_WAKE.
>
> This fixes the regression for futex operations that should be valid on
> RO mappings by trying a RO get_user_pages_fast() when the RW
> get_user_pages_fast() fails so as not to slow down the common path of
> writable anonymous maps and bailing when we used the RO path on
> anonymous memory.
>
> While fixing the regression this patch opens up a possible bad
> scenarios as identified by KOSAKI Motohiro:
>
> This patch also allows FUTEX_WAIT on RO private mappings which have
> the following corner case.
>
> Thread-A: call futex(FUTEX_WAIT, memory-region-A).
> get_futex_key() return inode based key.
> sleep on the key
> Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
> Thread-B: write memory-region-A.
> COW happen. This process's memory-region-A become related
> to new COWed private (ie PageAnon=1) page.
> Thread-B: call futex(FUETX_WAKE, memory-region-A).
> get_futex_key() return mm based key.
> IOW, we fail to wake up Thread-A.
>
> This Patch is based on Peter Zijlstra's initial patch with modifications to
> only allow RO mappings for futex operations that need VERIFY_READ access.
>
> Reported-by: David Oliver <[email protected]>
> Signed-off-by: Shawn Bohrer <[email protected]>
> Acked-by: Peter Zijlstra <[email protected]>
> ---
>
> Included Darren Hart's fix to infinite loop in kernel on zero page.


Thanks for sticking with it Shawn!

Signed-off-by: Darren Hart <[email protected]>


> kernel/futex.c | 57 ++++++++++++++++++++++++++++++++++++++++++-------------
> 1 files changed, 43 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fe28dc2..d919cc0 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -75,6 +75,7 @@ int __read_mostly futex_cmpxchg_enabled;
> #define FLAGS_SHARED 0x01
> #define FLAGS_CLOCKRT 0x02
> #define FLAGS_HAS_TIMEOUT 0x04
> +#define FLAGS_RO 0x08
>
> /*
> * Priority Inheritance state:
> @@ -216,7 +217,7 @@ static void drop_futex_key_refs(union futex_key *key)
> /**
> * get_futex_key() - Get parameters which are the keys for a futex
> * @uaddr: virtual address of the futex
> - * @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
> + * @flags: futex flags: FLAGS_SHARED and FLAGS_RO are valid.
> * @key: address where result is stored.
> *
> * Returns a negative error code or 0
> @@ -229,12 +230,12 @@ static void drop_futex_key_refs(union futex_key *key)
> * lock_page() might sleep, the caller should not hold a spinlock.
> */
> static int
> -get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> +get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key)
> {
> unsigned long address = (unsigned long)uaddr;
> struct mm_struct *mm = current->mm;
> struct page *page, *page_head;
> - int err;
> + int err, ro = 0;
>
> /*
> * The futex address must be "naturally" aligned.
> @@ -251,7 +252,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
> * Note : We do have to check 'uaddr' is a valid user address,
> * but access_ok() should be faster than find_vma()
> */
> - if (!fshared) {
> + if (!(flags & FLAGS_SHARED)) {
> if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
> return -EFAULT;
> key->private.mm = mm;
> @@ -262,8 +263,18 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
>
> again:
> err = get_user_pages_fast(address, 1, 1, &page);
> + /*
> + * If write access is not required (eg. FUTEX_WAIT), try
> + * and get read-only access.
> + */
> + if (err == -EFAULT && flags & FLAGS_RO) {
> + err = get_user_pages_fast(address, 1, 0, &page);
> + ro = 1;
> + }
> if (err < 0)
> return err;
> + else
> + err = 0;
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> page_head = page;
> @@ -305,6 +316,13 @@ again:
> if (!page_head->mapping) {
> unlock_page(page_head);
> put_page(page_head);
> + /*
> + * ZERO_PAGE pages don't have a mapping. Avoid a busy loop
> + * trying to find one. RW mapping would have COW'd (and thus
> + * have a mapping) so this page is RO and won't ever change.
> + */
> + if ((page_head == ZERO_PAGE(address)))
> + return -EFAULT;
> goto again;
> }
>
> @@ -316,6 +334,15 @@ again:
> * the object not the particular process.
> */
> if (PageAnon(page_head)) {
> + /*
> + * A RO anonymous page will never change and thus doesn't make
> + * sense for futex operations.
> + */
> + if (ro) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
> key->private.mm = mm;
> key->private.address = address;
> @@ -327,9 +354,10 @@ again:
>
> get_futex_key_refs(key);
>
> +out:
> unlock_page(page_head);
> put_page(page_head);
> - return 0;
> + return err;
> }
>
> static inline void put_futex_key(union futex_key *key)
> @@ -940,7 +968,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
> if (!bitset)
> return -EINVAL;
>
> - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
> + ret = get_futex_key(uaddr, flags | FLAGS_RO, &key);

> if (unlikely(ret != 0))
> goto out;
>
> @@ -986,10 +1014,10 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
> int ret, op_ret;
>
> retry:
> - ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
> + ret = get_futex_key(uaddr1, flags | FLAGS_RO, &key1);
> if (unlikely(ret != 0))
> goto out;
> - ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
> + ret = get_futex_key(uaddr2, flags, &key2);
> if (unlikely(ret != 0))
> goto out_put_key1;
>
> @@ -1243,10 +1271,11 @@ retry:
> pi_state = NULL;
> }
>
> - ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
> + ret = get_futex_key(uaddr1, flags | FLAGS_RO, &key1);
> if (unlikely(ret != 0))
> goto out;
> - ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
> + ret = get_futex_key(uaddr2, requeue_pi ? flags : flags | FLAGS_RO,
> + &key2);
> if (unlikely(ret != 0))
> goto out_put_key1;
>
> @@ -1790,7 +1819,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
> * while the syscall executes.
> */
> retry:
> - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key);
> + ret = get_futex_key(uaddr, flags | FLAGS_RO, &q->key);
> if (unlikely(ret != 0))
> return ret;
>
> @@ -1941,7 +1970,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, int detect,
> }
>
> retry:
> - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
> + ret = get_futex_key(uaddr, flags, &q.key);
> if (unlikely(ret != 0))
> goto out;
>
> @@ -2060,7 +2089,7 @@ retry:
> if ((uval & FUTEX_TID_MASK) != vpid)
> return -EPERM;
>
> - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
> + ret = get_futex_key(uaddr, flags, &key);
> if (unlikely(ret != 0))
> goto out;
>
> @@ -2249,7 +2278,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
> debug_rt_mutex_init_waiter(&rt_waiter);
> rt_waiter.task = NULL;
>
> - ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
> + ret = get_futex_key(uaddr2, flags, &key2);
> if (unlikely(ret != 0))
> goto out;
>

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-29 23:38:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v4] futex: Fix regression with read only mappings

On Wed, 29 Jun 2011, Shawn Bohrer wrote:
>
> While fixing the regression this patch opens up a possible bad
> scenarios as identified by KOSAKI Motohiro:
>
> This patch also allows FUTEX_WAIT on RO private mappings which have
> the following corner case.

These two sentences make no sense at all. We really need a very
accurate description of this change. That code is subtle and we really
want to have a very clear and understandable changelog.

Your changelog fails the basic test by mentioning "corner case" simply
because the whole futex code consists only of corner cases.

Thanks,

tglx

2011-06-30 04:19:22

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH v4] futex: Fix regression with read only mappings



On 06/29/2011 04:38 PM, Thomas Gleixner wrote:
> On Wed, 29 Jun 2011, Shawn Bohrer wrote:
>>
>> While fixing the regression this patch opens up a possible bad
>> scenarios as identified by KOSAKI Motohiro:
>>
>> This patch also allows FUTEX_WAIT on RO private mappings which have
>> the following corner case.
>
> These two sentences make no sense at all. We really need a very
> accurate description of this change. That code is subtle and we really
> want to have a very clear and understandable changelog.
>
> Your changelog fails the basic test by mentioning "corner case" simply
> because the whole futex code consists only of corner cases.
>
> Thanks,
>
> tglx

Yeah, those messages are quotes from Kosaki, but that isn't apparent without
having all the context. They are confusing. The language needs to be cleaned up
a bit as well.

Shawn, I apologize for this as it was my idea to begin with, but after rereading
all of the previous patches from Kosaki, I realized that the rw parameter was
part of the original design (and not newly introduced by your patch) and that
cramming the FLAGS_RO flag into the flags variables muddies the meaning of
flags. flags is meant to modify a particular futex call and ensure that call is
correctly restarted after a signal. The RO/RW bit pertains to the calling
function and does not vary from call to call. We should revert back to the rw
parameter using VERIFY_READ and VERIFY_WRITE.

How about this for a header (I'll leave it to Shawn to
incorporate, adjust, and resend):

commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw parameter
from get_futex_key()) in 2.6.33 introduced a user-mode regression in that it
broke futex operations on read-only memory maps in addition to preventing a loop
when encountering a ZERO_PAGE. For example, this breaks workloads that have one
or more reader processes doing a FUTEX_WAIT on a futex within a read only shared
file mapping, and a writer processes that has a writable mapping issuing the
FUTEX_WAKE.

This fixes the regression for valid futex operations on RO mappings by trying a
RO get_user_pages_fast() when the RW get_user_pages_fast() fails. This change
makes it necessary to also check for invalid use cases, such as anonymous RO
mappings (which can never change) and the ZERO_PAGE which the commit referenced
above was written to address.

This patch does restore the original behavior with RO private mappings which
suffer from the following corner case.

Thread-A: call futex(FUTEX_WAIT, memory-region-A).
get_futex_key() returns an inode based key.
sleep on the key
Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
Thread-B: write memory-region-A.
This process's memory-region-A gets remapped to a
COWed PageAnon=1 page.
Thread-B: call futex(FUETX_WAKE, memory-region-A).
get_futex_key() returns an mm based key.
Thread-A is never woken as it is waiting on a different key.

Checking for a private mapping requires walking the vmas and was deemed too
costly to avoid a userspace hang in a nonsensical use case.

This Patch is based on Peter Zijlstra's initial patch with modifications to only
allow RO mappings for futex operations that need VERIFY_READ access.


--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-30 14:03:10

by David C. Oliver

[permalink] [raw]
Subject: Re: [PATCH v4] futex: Fix regression with read only mappings


On Jun 29, 2011, at 11:19 PM, Darren Hart wrote:

>
>
> On 06/29/2011 04:38 PM, Thomas Gleixner wrote:
>> On Wed, 29 Jun 2011, Shawn Bohrer wrote:
>>>
>>> While fixing the regression this patch opens up a possible bad
>>> scenarios as identified by KOSAKI Motohiro:
>>>
>>> This patch also allows FUTEX_WAIT on RO private mappings which have
>>> the following corner case.
>>
>> These two sentences make no sense at all. We really need a very
>> accurate description of this change. That code is subtle and we really
>> want to have a very clear and understandable changelog.
>>
>> Your changelog fails the basic test by mentioning "corner case" simply
>> because the whole futex code consists only of corner cases.
>>
>> Thanks,
>>
>> tglx
>
> Yeah, those messages are quotes from Kosaki, but that isn't apparent without
> having all the context. They are confusing. The language needs to be cleaned up
> a bit as well.
>
> Shawn, I apologize for this as it was my idea to begin with, but after rereading
> all of the previous patches from Kosaki, I realized that the rw parameter was
> part of the original design (and not newly introduced by your patch) and that
> cramming the FLAGS_RO flag into the flags variables muddies the meaning of
> flags. flags is meant to modify a particular futex call and ensure that call is
> correctly restarted after a signal. The RO/RW bit pertains to the calling
> function and does not vary from call to call. We should revert back to the rw
> parameter using VERIFY_READ and VERIFY_WRITE.
>
> How about this for a header (I'll leave it to Shawn to
> incorporate, adjust, and resend):
>
> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw parameter
> from get_futex_key()) in 2.6.33 introduced a user-mode regression in that it
> broke futex operations on read-only memory maps in addition to preventing a loop
> when encountering a ZERO_PAGE. For example, this breaks workloads that have one
> or more reader processes doing a FUTEX_WAIT on a futex within a read only shared
> file mapping, and a writer processes that has a writable mapping issuing the
> FUTEX_WAKE.
>
> This fixes the regression for valid futex operations on RO mappings by trying a
> RO get_user_pages_fast() when the RW get_user_pages_fast() fails. This change
> makes it necessary to also check for invalid use cases, such as anonymous RO
> mappings (which can never change) and the ZERO_PAGE which the commit referenced
> above was written to address.
>
> This patch does restore the original behavior with RO private mappings which
> suffer from the following corner case.
>
> Thread-A: call futex(FUTEX_WAIT, memory-region-A).
> get_futex_key() returns an inode based key.
> sleep on the key
> Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
> Thread-B: write memory-region-A.
> This process's memory-region-A gets remapped to a
> COWed PageAnon=1 page.
> Thread-B: call futex(FUETX_WAKE, memory-region-A).
s/FUETEX_WAKE/FUTEX_WAIT/
> get_futex_key() returns an mm based key.
> Thread-A is never woken as it is waiting on a different key.

If I follow this correctly, it implies that a FUTEX_WAIT on an RO private mapped futex will never return normally, as changing the futex's value will cause the COW behavior (leaving the value unchanged before doing the FUTEX_WAIT will not wake the waiter).

> Checking for a private mapping requires walking the vmas and was deemed too
> costly to avoid a userspace hang in a nonsensical use case.
>
> This Patch is based on Peter Zijlstra's initial patch with modifications to only
> allow RO mappings for futex operations that need VERIFY_READ access.
>
>
> --
> Darren Hart
> Intel Open Source Technology Center
> Yocto Project - Linux Kernel

Cheers!

David.

---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2011-06-30 15:41:12

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH v4] futex: Fix regression with read only mappings



On 06/30/2011 07:02 AM, David C. Oliver wrote:
>
> On Jun 29, 2011, at 11:19 PM, Darren Hart wrote:
>
>>
>>
>> On 06/29/2011 04:38 PM, Thomas Gleixner wrote:
>>> On Wed, 29 Jun 2011, Shawn Bohrer wrote:
>>>>
>>>> While fixing the regression this patch opens up a possible bad
>>>> scenarios as identified by KOSAKI Motohiro:
>>>>
>>>> This patch also allows FUTEX_WAIT on RO private mappings which
>>>> have the following corner case.
>>>
>>> These two sentences make no sense at all. We really need a very
>>> accurate description of this change. That code is subtle and we
>>> really want to have a very clear and understandable changelog.
>>>
>>> Your changelog fails the basic test by mentioning "corner case"
>>> simply because the whole futex code consists only of corner
>>> cases.
>>>
>>> Thanks,
>>>
>>> tglx
>>
>> Yeah, those messages are quotes from Kosaki, but that isn't
>> apparent without having all the context. They are confusing. The
>> language needs to be cleaned up a bit as well.
>>
>> Shawn, I apologize for this as it was my idea to begin with, but
>> after rereading all of the previous patches from Kosaki, I realized
>> that the rw parameter was part of the original design (and not
>> newly introduced by your patch) and that cramming the FLAGS_RO flag
>> into the flags variables muddies the meaning of flags. flags is
>> meant to modify a particular futex call and ensure that call is
>> correctly restarted after a signal. The RO/RW bit pertains to the
>> calling function and does not vary from call to call. We should
>> revert back to the rw parameter using VERIFY_READ and
>> VERIFY_WRITE.
>>
>> How about this for a header (I'll leave it to Shawn to incorporate,
>> adjust, and resend):
>>
>> commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
>> parameter from get_futex_key()) in 2.6.33 introduced a user-mode
>> regression in that it broke futex operations on read-only memory
>> maps in addition to preventing a loop when encountering a
>> ZERO_PAGE. For example, this breaks workloads that have one or
>> more reader processes doing a FUTEX_WAIT on a futex within a read
>> only shared file mapping, and a writer processes that has a
>> writable mapping issuing the FUTEX_WAKE.
>>
>> This fixes the regression for valid futex operations on RO mappings
>> by trying a RO get_user_pages_fast() when the RW
>> get_user_pages_fast() fails. This change makes it necessary to also
>> check for invalid use cases, such as anonymous RO mappings (which
>> can never change) and the ZERO_PAGE which the commit referenced
>> above was written to address.
>>
>> This patch does restore the original behavior with RO private
>> mappings which suffer from the following corner case.
>>
>> Thread-A: call futex(FUTEX_WAIT, memory-region-A). get_futex_key()
>> returns an inode based key. sleep on the key Thread-B: call
>> mprotect(PROT_READ|PROT_WRITE, memory-region-A) Thread-B: write
>> memory-region-A. This process's memory-region-A gets remapped to a
>> COWed PageAnon=1 page. Thread-B: call futex(FUETX_WAKE,
>> memory-region-A).
> s/FUETEX_WAKE/FUTEX_WAIT/
>> get_futex_key() returns an mm based key. Thread-A is never woken as
>> it is waiting on a different key.
>
> If I follow this correctly, it implies that a FUTEX_WAIT on an RO
> private mapped futex will never return normally, as changing the
> futex's value will cause the COW behavior (leaving the value
> unchanged before doing the FUTEX_WAIT will not wake the waiter).


That is correct. Hopefully that isn't just implied. ;-) However, because
it is private, before the futex value can change, you also have to
change the RW policy. This "use case" makes no sense, and it was
determined that processes doing this deserve what they get.

--
Darren

>> Checking for a private mapping requires walking the vmas and was
>> deemed too costly to avoid a userspace hang in a nonsensical use
>> case.
>>
>> This Patch is based on Peter Zijlstra's initial patch with
>> modifications to only allow RO mappings for futex operations that
>> need VERIFY_READ access.
>>
>>
>> -- Darren Hart Intel Open Source Technology Center Yocto Project -
>> Linux Kernel
>
> Cheers!
>
> David.
>
> --------------------------------------------------------------- This
> email, along with any attachments, is confidential. If you believe
> you received this message in error, please contact the sender
> immediately and delete all copies of the message. Thank you.
>

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

2011-06-30 16:21:50

by Shawn Bohrer

[permalink] [raw]
Subject: [PATCH v5] futex: Fix regression with read only mappings

commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
parameter from get_futex_key()) in 2.6.33 fixed two problems: First, It
prevented a loop when encountering a ZERO_PAGE. Second, it fixed RW
MAP_PRIVATE futex operations by forcing the COW to occur by
unconditionally performing a write access get_user_pages_fast() to get
the page. The commit also introduced a user-mode regression in that it
broke futex operations on read-only memory maps. For example, this
breaks workloads that have one or more reader processes doing a
FUTEX_WAIT on a futex within a read only shared file mapping, and a
writer processes that has a writable mapping issuing the FUTEX_WAKE.

This fixes the regression for valid futex operations on RO mappings by
trying a RO get_user_pages_fast() when the RW get_user_pages_fast()
fails. This change makes it necessary to also check for invalid use
cases, such as anonymous RO mappings (which can never change) and the
ZERO_PAGE which the commit referenced above was written to address.

This patch does restore the original behavior with RO MAP_PRIVATE
mappings, which have inherent user-mode usage problems and don't really
make sense. With this patch performing a FUTEX_WAIT within a RO
MAP_PRIVATE mapping will be successfully woken provided another process
updates the region of the underlying mapped file. However, the mmap()
man page states that for a MAP_PRIVATE mapping:

It is unspecified whether changes made to the file after
the mmap() call are visible in the mapped region.

So user-mode users attempting to use futex operations on RO MAP_PRIVATE
mappings are depending on unspecified behavior. Additionally a
RO MAP_PRIVATE mapping could fail to wake up in the following case.

Thread-A: call futex(FUTEX_WAIT, memory-region-A).
get_futex_key() return inode based key.
sleep on the key
Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
Thread-B: write memory-region-A.
COW happen. This process's memory-region-A become related
to new COWed private (ie PageAnon=1) page.
Thread-B: call futex(FUETX_WAKE, memory-region-A).
get_futex_key() return mm based key.
IOW, we fail to wake up Thread-A.

Once again doing something like this is just silly and users who do
something like this get what they deserve.

While RO MAP_PRIVATE mappings are nonsensical, checking for a private
mapping requires walking the vmas and was deemed too costly to avoid a
userspace hang.

This Patch is based on Peter Zijlstra's initial patch with modifications to
only allow RO mappings for futex operations that need VERIFY_READ access.

Reported-by: David Oliver <[email protected]>
Signed-off-by: Shawn Bohrer <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Darren Hart <[email protected]>
---

Clairified commit description. Changed get_futex_key() back to having a
fshared and rw parameter, per Darren's request.

kernel/futex.c | 54 ++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..70bb54b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -218,6 +218,8 @@ static void drop_futex_key_refs(union futex_key *key)
* @uaddr: virtual address of the futex
* @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
* @key: address where result is stored.
+ * @rw: mapping needs to be read/write (values: VERIFY_READ,
+ * VERIFY_WRITE)
*
* Returns a negative error code or 0
* The key words are stored in *key on success.
@@ -229,12 +231,12 @@ static void drop_futex_key_refs(union futex_key *key)
* lock_page() might sleep, the caller should not hold a spinlock.
*/
static int
-get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
+get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
struct page *page, *page_head;
- int err;
+ int err, ro = 0;

/*
* The futex address must be "naturally" aligned.
@@ -262,8 +264,18 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)

again:
err = get_user_pages_fast(address, 1, 1, &page);
+ /*
+ * If write access is not required (eg. FUTEX_WAIT), try
+ * and get read-only access.
+ */
+ if (err == -EFAULT && rw == VERIFY_READ) {
+ err = get_user_pages_fast(address, 1, 0, &page);
+ ro = 1;
+ }
if (err < 0)
return err;
+ else
+ err = 0;

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
page_head = page;
@@ -305,6 +317,13 @@ again:
if (!page_head->mapping) {
unlock_page(page_head);
put_page(page_head);
+ /*
+ * ZERO_PAGE pages don't have a mapping. Avoid a busy loop
+ * trying to find one. RW mapping would have COW'd (and thus
+ * have a mapping) so this page is RO and won't ever change.
+ */
+ if ((page_head == ZERO_PAGE(address)))
+ return -EFAULT;
goto again;
}

@@ -316,6 +335,15 @@ again:
* the object not the particular process.
*/
if (PageAnon(page_head)) {
+ /*
+ * A RO anonymous page will never change and thus doesn't make
+ * sense for futex operations.
+ */
+ if (ro) {
+ err = -EFAULT;
+ goto out;
+ }
+
key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
key->private.mm = mm;
key->private.address = address;
@@ -327,9 +355,10 @@ again:

get_futex_key_refs(key);

+out:
unlock_page(page_head);
put_page(page_head);
- return 0;
+ return err;
}

static inline void put_futex_key(union futex_key *key)
@@ -940,7 +969,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
if (!bitset)
return -EINVAL;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_READ);
if (unlikely(ret != 0))
goto out;

@@ -986,10 +1015,10 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
int ret, op_ret;

retry:
- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
+ ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1243,10 +1272,11 @@ retry:
pi_state = NULL;
}

- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
+ ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2,
+ requeue_pi ? VERIFY_WRITE : VERIFY_READ);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1790,7 +1820,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
* while the syscall executes.
*/
retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key, VERIFY_READ);
if (unlikely(ret != 0))
return ret;

@@ -1941,7 +1971,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, int detect,
}

retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE);
if (unlikely(ret != 0))
goto out;

@@ -2060,7 +2090,7 @@ retry:
if ((uval & FUTEX_TID_MASK) != vpid)
return -EPERM;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
if (unlikely(ret != 0))
goto out;

@@ -2249,7 +2279,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
debug_rt_mutex_init_waiter(&rt_waiter);
rt_waiter.task = NULL;

- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
+ ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
if (unlikely(ret != 0))
goto out;

--
1.6.5.2



---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.