2009-03-27 02:12:33

by Minchan Kim

[permalink] [raw]
Subject: Question about PRIVATE_FUTEX

Hi, Peter and Eric.

I am not expert about futex.
I am sorry if this is dumb question.

If we use private futex, get_futex_key don't call get_user_pages_fast
which pins page at page table.
Then, get_futex_value_locked calls __cpy_from_user_inatomic with
pagefault_disable.

Who make sure the user page is mapped at app's page table ?

--
Kinds regards,
Minchan Kim


2009-03-27 04:32:19

by Minchan Kim

[permalink] [raw]
Subject: Re: Question about PRIVATE_FUTEX

Hmm, It seems even shared futex, too.

After calling get_user_pages_fast, get_futex_key calls unlock, put_page, too.
Then futex_wait calls get_futex_value_locked.
Wouldn't kernel reclaim the page between get_fuex_key and
get_futex_value_locked ?

How do we make sure this race condition ?


On Fri, Mar 27, 2009 at 11:12 AM, Minchan Kim <[email protected]> wrote:
> Hi, Peter and Eric.
>
> I am not expert about futex.
> I am sorry if this is dumb question.
>
> If we use private futex, get_futex_key don't call get_user_pages_fast
> which pins page at page table.
> Then, get_futex_value_locked calls __cpy_from_user_inatomic with
> pagefault_disable.
>
> Who make sure the user page is mapped at app's page table ?
>
> --
> Kinds regards,
> Minchan Kim
>



--
Kinds regards,
Minchan Kim

2009-03-27 05:01:34

by Eric Dumazet

[permalink] [raw]
Subject: Re: Question about PRIVATE_FUTEX

Minchan Kim a écrit :
> Hi, Peter and Eric.
>
> I am not expert about futex.
> I am sorry if this is dumb question.
>
> If we use private futex, get_futex_key don't call get_user_pages_fast
> which pins page at page table.
> Then, get_futex_value_locked calls __cpy_from_user_inatomic with
> pagefault_disable.
>
> Who make sure the user page is mapped at app's page table ?
>

Nothing makes sure user page is mapped, as we dont have to (for private futexes
at least, since the 'key' is a combination of the futex virtual address (not
depending on the underlying physical page) and the task mm (sort of a static
offset per task)

If no page is mapped, a normal error should be returned to user, since
access to futex location will trigger a fault.

2009-03-27 05:20:28

by Minchan Kim

[permalink] [raw]
Subject: Re: Question about PRIVATE_FUTEX

Thanks for kind explanation.

On Fri, Mar 27, 2009 at 1:56 PM, Eric Dumazet <[email protected]> wrote:
> Minchan Kim a écrit :
>> Hi, Peter and Eric.
>>
>> I am not expert about futex.
>> I am sorry if this is dumb question.
>>
>> If we use private futex, get_futex_key don't call get_user_pages_fast
>> which pins page at page table.
>> Then, get_futex_value_locked calls __cpy_from_user_inatomic with
>> pagefault_disable.
>>
>> Who make sure the user page is mapped at app's page table ?
>>
>
> Nothing makes sure user page is mapped, as we dont have to (for private futexes
> at least, since the 'key' is a combination of the futex virtual address (not
> depending on the underlying physical page) and the task mm (sort of a static
> offset per task)
> If no page is mapped, a normal error should be returned to user, since
> access to futex location will trigger a fault.
>

I mean as follows.
It seems even shared futex case.

After calling get_user_pages_fast, get_futex_key calls unlock_page and
put_page, too. Then futex_wait calls get_futex_value_locked.

Generally, current page->count is one and nolocked.
I think kernel reclaimer can reclaim the page.

Wouldn't kernel reclaim the page between get_fuex_key and
get_futex_value_locked ?
If kernel reclaimed the page, __copy_from_user_inatomic can happens
page fault although pagefault_disable is on.

How do we make sure this race condition ?
Do I miss something ?

--
Kinds regards,
Minchan Kim

2009-03-27 05:51:27

by Eric Dumazet

[permalink] [raw]
Subject: Re: Question about PRIVATE_FUTEX

Minchan Kim a écrit :
> Thanks for kind explanation.
>
> On Fri, Mar 27, 2009 at 1:56 PM, Eric Dumazet <[email protected]> wrote:
>> Minchan Kim a écrit :
>>> Hi, Peter and Eric.
>>>
>>> I am not expert about futex.
>>> I am sorry if this is dumb question.
>>>
>>> If we use private futex, get_futex_key don't call get_user_pages_fast
>>> which pins page at page table.
>>> Then, get_futex_value_locked calls __cpy_from_user_inatomic with
>>> pagefault_disable.
>>>
>>> Who make sure the user page is mapped at app's page table ?
>>>
>> Nothing makes sure user page is mapped, as we dont have to (for private futexes
>> at least, since the 'key' is a combination of the futex virtual address (not
>> depending on the underlying physical page) and the task mm (sort of a static
>> offset per task)
>> If no page is mapped, a normal error should be returned to user, since
>> access to futex location will trigger a fault.
>>
>
> I mean as follows.
> It seems even shared futex case.
>
> After calling get_user_pages_fast, get_futex_key calls unlock_page and
> put_page, too. Then futex_wait calls get_futex_value_locked.
>
> Generally, current page->count is one and nolocked.
> I think kernel reclaimer can reclaim the page.
>
> Wouldn't kernel reclaim the page between get_fuex_key and
> get_futex_value_locked ?
> If kernel reclaimed the page, __copy_from_user_inatomic can happens
> page fault although pagefault_disable is on.
>
> How do we make sure this race condition ?
> Do I miss something ?
>

Hmmm, so your question is not about PRIVATE futexes, but shared ones.

I guess if page is no more present, its not a problem since
get_futex_value_locked() returns an error. We then take a slow
path, calling get_user() and retrying whole futex logic.

However, comment at line 1213 is misleading I guess, since
we dont hold mmap semaphore anymore ?

* for shared futexes, we hold the mmap semaphore, so the mapping
* cannot have changed since we looked it up in get_futex_key.
*/
ret = get_futex_value_locked(&uval, uaddr);

So if page was un-mapped by another thread, and re-mapped to another physical
page, then this thread might sleep on 'kernel futex' not anymore reachable...

User error, as it is not supposed to happen in a sane program, undefined
result...


2009-03-27 06:20:47

by Minchan Kim

[permalink] [raw]
Subject: Re: Question about PRIVATE_FUTEX

On Fri, Mar 27, 2009 at 2:50 PM, Eric Dumazet <[email protected]> wrote:
> Minchan Kim a écrit :
>> Thanks for kind explanation.
>>
>> On Fri, Mar 27, 2009 at 1:56 PM, Eric Dumazet <[email protected]> wrote:
>>> Minchan Kim a écrit :
>>>> Hi, Peter and Eric.
>>>>
>>>> I am not expert about futex.
>>>> I am sorry if this is dumb question.
>>>>
>>>> If we use private futex, get_futex_key don't call get_user_pages_fast
>>>> which pins page at page table.
>>>> Then, get_futex_value_locked calls __cpy_from_user_inatomic with
>>>> pagefault_disable.
>>>>
>>>> Who make sure the user page is mapped at app's page table ?
>>>>
>>> Nothing makes sure user page is mapped, as we dont have to (for private futexes
>>> at least, since the 'key' is a combination of the futex virtual address (not
>>> depending on the underlying physical page) and the task mm (sort of a static
>>> offset per task)
>>> If no page is mapped, a normal error should be returned to user, since
>>> access to futex location will trigger a fault.
>>>
>>
>> I mean as follows.
>> It seems even shared futex case.
>>
>> After calling get_user_pages_fast, get_futex_key calls unlock_page and
>> put_page, too.  Then futex_wait  calls get_futex_value_locked.
>>
>> Generally, current page->count is one and nolocked.
>> I think kernel reclaimer can reclaim the page.
>>
>> Wouldn't kernel reclaim the page between get_fuex_key and
>> get_futex_value_locked ?
>> If kernel reclaimed the page, __copy_from_user_inatomic can happens
>> page fault although pagefault_disable is on.
>>
>> How do we make sure this race condition ?
>> Do I miss something ?
>>
>
> Hmmm, so your question is not about PRIVATE futexes, but shared ones.
>
> I guess if page is no more present, its not a problem since
> get_futex_value_locked() returns an error. We then take a slow
> path, calling get_user() and retrying whole futex logic.

Indeed.
I misunderstood about __copy_from_user_inatomic.
It never sleep.

> However, comment at line 1213 is misleading I guess, since
> we dont hold mmap semaphore anymore ?
>
>         * for shared futexes, we hold the mmap semaphore, so the mapping
>         * cannot have changed since we looked it up in get_futex_key.
>         */
>        ret = get_futex_value_locked(&uval, uaddr);
>
> So if page was un-mapped by another thread, and re-mapped to another physical
> page, then this thread might sleep on 'kernel futex' not anymore reachable...
>
> User error, as it is not supposed to happen in a sane program, undefined
> result...

Yes. How about removing confusing comments ?

Thanks for great explanation :)

--
Kinds regards,
Minchan Kim

2009-03-27 08:49:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Question about PRIVATE_FUTEX

On Fri, 2009-03-27 at 11:12 +0900, Minchan Kim wrote:
> Hi, Peter and Eric.
>
> I am not expert about futex.
> I am sorry if this is dumb question.
>
> If we use private futex, get_futex_key don't call get_user_pages_fast
> which pins page at page table.

But also drops that page ref at the end of get_futex_key(). The whole
and only purpose of using get_user_pages_fast() is to get at the mapping
data without having to obtain the mmap_sem.

> Then, get_futex_value_locked calls __cpy_from_user_inatomic with
> pagefault_disable.
>
> Who make sure the user page is mapped at app's page table ?

Nobody, all uses of get_futex_value_locked() have to deal with it
returning -EFAULT.

Most of this is legacy btw, from when futex ops were done under the
mmap_sem. Back then we couldn't fault because that would cause mmap_sem
recursion. Howver, now that we don't hold mmap_sem anymore we could use
a faulting user access like get_user().

Darren has been working on patches to clean that up, some of those are
already merged in the -tip tree.

HTH

2009-03-27 10:56:53

by Minchan Kim

[permalink] [raw]
Subject: Re: Question about PRIVATE_FUTEX

Hi, Perter.
Thanks for joining this thread.

My concern is page reclaimer can reclaim the user page which have
futex between get_fuex_key and get_futex_value_locked.

On Fri, Mar 27, 2009 at 5:49 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2009-03-27 at 11:12 +0900, Minchan Kim wrote:
>> Hi, Peter and Eric.
>>
>> I am not expert about futex.
>> I am sorry if this is dumb question.
>>
>> If we use private futex, get_futex_key don't call get_user_pages_fast
>> which pins page at page table.
>
> But also drops that page ref at the end of get_futex_key(). The whole
> and only purpose of using get_user_pages_fast() is to get at the mapping
> data without having to obtain the mmap_sem.

Thanks.
I understand. It's not only private futex.

>
>> Then, get_futex_value_locked calls __cpy_from_user_inatomic with
>> pagefault_disable.
>>
>> Who make sure the user page is mapped at app's page table ?
>
> Nobody, all uses of get_futex_value_locked() have to deal with it
> returning -EFAULT.

Does It mean that __copy_from_user_inatomic in get_futex_value_locked
would be failed rather than sleep?
In fact, I don't make sure _copy_from_user_inatomic function's meaning.
As far as I understand, It never sleep. It just can be failed in case
of user page isn't mapped. Is right ?
Otherwise, it can be scheduled with pagefault_disable which increments
preempt_count. It is a atomic bug.
If my assume is right, it can be failed rather than sleep.
At this case, other architecture implements __copy_from_user_inatomic
with __copy_from_user which can be scheduled. It also can be bug.

Hmm, Now I am confusing.

> Most of this is legacy btw, from when futex ops were done under the
> mmap_sem. Back then we couldn't fault because that would cause mmap_sem
> recursion. Howver, now that we don't hold mmap_sem anymore we could use
> a faulting user access like get_user().
> Darren has been working on patches to clean that up, some of those are
> already merged in the -tip tree.

Thanks for good information.
It will be very desirable way to enhance kernel performance.

> HTH
>



--
Kinds regards,
Minchan Kim

2009-03-27 11:15:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Question about PRIVATE_FUTEX

On Fri, 2009-03-27 at 19:56 +0900, Minchan Kim wrote:

> >> Then, get_futex_value_locked calls __cpy_from_user_inatomic with
> >> pagefault_disable.
> >>
> >> Who make sure the user page is mapped at app's page table ?
> >
> > Nobody, all uses of get_futex_value_locked() have to deal with it
> > returning -EFAULT.
>
> Does It mean that __copy_from_user_inatomic in get_futex_value_locked
> would be failed rather than sleep?

Correct.

> In fact, I don't make sure _copy_from_user_inatomic function's meaning.
> As far as I understand, It never sleep. It just can be failed in case
> of user page isn't mapped. Is right ?

Correct.

> Otherwise, it can be scheduled with pagefault_disable which increments
> preempt_count. It is a atomic bug.
> If my assume is right, it can be failed rather than sleep.
> At this case, other architecture implements __copy_from_user_inatomic
> with __copy_from_user which can be scheduled. It also can be bug.
>
> Hmm, Now I am confusing.

Confused I guess ;-)

The trick is in the in_atomic() check in the pagefault handler and the
fixup section of the copy routines.

#define __copy_user(to, from, size) \
do { \
int __d0, __d1, __d2; \
__asm__ __volatile__( \
" cmp $7,%0\n" \
" jbe 1f\n" \
" movl %1,%0\n" \
" negl %0\n" \
" andl $7,%0\n" \
" subl %0,%3\n" \
"4: rep; movsb\n" \
" movl %3,%0\n" \
" shrl $2,%0\n" \
" andl $3,%3\n" \
" .align 2,0x90\n" \
"0: rep; movsl\n" \
" movl %3,%0\n" \
"1: rep; movsb\n" \
"2:\n" \
".section .fixup,\"ax\"\n" \
"5: addl %3,%0\n" \
" jmp 2b\n" \
"3: lea 0(%3,%0,4),%0\n" \
" jmp 2b\n" \
".previous\n" \
".section __ex_table,\"a\"\n" \
" .align 4\n" \
" .long 4b,5b\n" \
" .long 0b,3b\n" \
" .long 1b,2b\n" \
".previous" \
: "=&c"(size), "=&D" (__d0), "=&S" (__d1), "=r"(__d2) \
: "3"(size), "0"(size), "1"(to), "2"(from) \
: "memory"); \
} while (0)

see that __ex_table section, it tells the fault handler where to
continue in case of an atomic fault.

> > Most of this is legacy btw, from when futex ops were done under the
> > mmap_sem. Back then we couldn't fault because that would cause mmap_sem
> > recursion. Howver, now that we don't hold mmap_sem anymore we could use
> > a faulting user access like get_user().
> > Darren has been working on patches to clean that up, some of those are
> > already merged in the -tip tree.
>
> Thanks for good information.
> It will be very desirable way to enhance kernel performance.

I doubt it'll make a measurable difference, if you need to fault
performance sucks anyway. If you don't, the current code is just as
fast.

2009-03-27 11:37:50

by Minchan Kim

[permalink] [raw]
Subject: Re: Question about PRIVATE_FUTEX

On Fri, Mar 27, 2009 at 8:14 PM, Peter Zijlstra <[email protected]> wrote:> On Fri, 2009-03-27 at 19:56 +0900, Minchan Kim wrote:>>> >> Then, get_futex_value_locked calls __cpy_from_user_inatomic with>> >> pagefault_disable.>> >>>> >> Who make sure the user page is mapped at app's page table ?>> >>> > Nobody, all uses of get_futex_value_locked() have to deal with it>> > returning -EFAULT.>>>> Does It mean that __copy_from_user_inatomic in get_futex_value_locked>> would be failed rather than sleep?>> Correct.>>> In fact, I don't make sure _copy_from_user_inatomic function's meaning.>> As far as I understand, It never sleep. It just can be failed in case>> of user page isn't mapped. Is right ?>> Correct.>>> Otherwise, it can be scheduled with pagefault_disable which increments>> preempt_count. It is a atomic bug.>> If my assume is right, it can be failed rather than sleep.>> At this case, other architecture implements __copy_from_user_inatomic>> with __copy_from_user which can be scheduled. It also can be bug.>>>> Hmm, Now I am confusing.>> Confused I guess ;-)> The trick is in the in_atomic() check in the pagefault handler and the> fixup section of the copy routines.
Whew~, There was good hidden trick.I will dive into this assembly.I always thanks for your kindness. :)
> #define __copy_user(to, from, size)                                     \> do {                                                                    \>        int __d0, __d1, __d2;                                           \>        __asm__ __volatile__(                                           \>                "       cmp  $7,%0\n"                                   \>                "       jbe  1f\n"                                      \>                "       movl %1,%0\n"                                   \>                "       negl %0\n"                                      \>                "       andl $7,%0\n"                                   \>                "       subl %0,%3\n"                                   \>                "4:     rep; movsb\n"                                   \>                "       movl %3,%0\n"                                   \>                "       shrl $2,%0\n"                                   \>                "       andl $3,%3\n"                                   \>                "       .align 2,0x90\n"                                \>                "0:     rep; movsl\n"                                   \>                "       movl %3,%0\n"                                   \>                "1:     rep; movsb\n"                                   \>                "2:\n"                                                  \>                ".section .fixup,\"ax\"\n"                              \>                "5:     addl %3,%0\n"                                   \>                "       jmp 2b\n"                                       \>                "3:     lea 0(%3,%0,4),%0\n"                            \>                "       jmp 2b\n"                                       \>                ".previous\n"                                           \>                ".section __ex_table,\"a\"\n"                           \>                "       .align 4\n"                                     \>                "       .long 4b,5b\n"                                  \>                "       .long 0b,3b\n"                                  \>                "       .long 1b,2b\n"                                  \>                ".previous"                                             \>                : "=&c"(size), "=&D" (__d0), "=&S" (__d1), "=r"(__d2)   \>                : "3"(size), "0"(size), "1"(to), "2"(from)              \>                : "memory");                                            \> } while (0)>> see that __ex_table section, it tells the fault handler where to> continue in case of an atomic fault.>>> > Most of this is legacy btw, from when futex ops were done under the>> > mmap_sem. Back then we couldn't fault because that would cause mmap_sem>> > recursion. Howver, now that we don't hold mmap_sem anymore we could use>> > a faulting user access like get_user().>> > Darren has been working on patches to clean that up, some of those are>> > already merged in the -tip tree.>>>> Thanks for good information.>> It will be very desirable way to enhance kernel performance.>> I doubt it'll make a measurable difference, if you need to fault> performance sucks anyway. If you don't, the current code is just as> fast.>


-- Kinds regards,Minchan Kim????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2009-03-27 15:43:32

by Darren Hart

[permalink] [raw]
Subject: Re: Question about PRIVATE_FUTEX

Minchan Kim wrote:
> On Fri, Mar 27, 2009 at 8:14 PM, Peter Zijlstra <[email protected]> wrote:
>> On Fri, 2009-03-27 at 19:56 +0900, Minchan Kim wrote:
>>
>>>>> Then, get_futex_value_locked calls __cpy_from_user_inatomic with
>>>>> pagefault_disable.
>>>>>
>>>>> Who make sure the user page is mapped at app's page table ?
>>>> Nobody, all uses of get_futex_value_locked() have to deal with it
>>>> returning -EFAULT.
>>> Does It mean that __copy_from_user_inatomic in get_futex_value_locked
>>> would be failed rather than sleep?
>> Correct.
>>
>>> In fact, I don't make sure _copy_from_user_inatomic function's meaning.
>>> As far as I understand, It never sleep. It just can be failed in case
>>> of user page isn't mapped. Is right ?
>> Correct.
>>
>>> Otherwise, it can be scheduled with pagefault_disable which increments
>>> preempt_count. It is a atomic bug.
>>> If my assume is right, it can be failed rather than sleep.
>>> At this case, other architecture implements __copy_from_user_inatomic
>>> with __copy_from_user which can be scheduled. It also can be bug.
>>>
>>> Hmm, Now I am confusing.
>> Confused I guess ;-)
>> The trick is in the in_atomic() check in the pagefault handler and the
>> fixup section of the copy routines.
>
> Whew~, There was good hidden trick.
> I will dive into this assembly.
> I always thanks for your kindness. :)
>
>> #define __copy_user(to, from, size) \
>> do { \
>> int __d0, __d1, __d2; \
>> __asm__ __volatile__( \
>> " cmp $7,%0\n" \
>> " jbe 1f\n" \
>> " movl %1,%0\n" \
>> " negl %0\n" \
>> " andl $7,%0\n" \
>> " subl %0,%3\n" \
>> "4: rep; movsb\n" \
>> " movl %3,%0\n" \
>> " shrl $2,%0\n" \
>> " andl $3,%3\n" \
>> " .align 2,0x90\n" \
>> "0: rep; movsl\n" \
>> " movl %3,%0\n" \
>> "1: rep; movsb\n" \
>> "2:\n" \
>> ".section .fixup,\"ax\"\n" \
>> "5: addl %3,%0\n" \
>> " jmp 2b\n" \
>> "3: lea 0(%3,%0,4),%0\n" \
>> " jmp 2b\n" \
>> ".previous\n" \
>> ".section __ex_table,\"a\"\n" \
>> " .align 4\n" \
>> " .long 4b,5b\n" \
>> " .long 0b,3b\n" \
>> " .long 1b,2b\n" \
>> ".previous" \
>> : "=&c"(size), "=&D" (__d0), "=&S" (__d1), "=r"(__d2) \
>> : "3"(size), "0"(size), "1"(to), "2"(from) \
>> : "memory"); \
>> } while (0)
>>
>> see that __ex_table section, it tells the fault handler where to
>> continue in case of an atomic fault.
>>
>>>> Most of this is legacy btw, from when futex ops were done under the
>>>> mmap_sem. Back then we couldn't fault because that would cause mmap_sem
>>>> recursion. Howver, now that we don't hold mmap_sem anymore we could use
>>>> a faulting user access like get_user().
>>>> Darren has been working on patches to clean that up, some of those are
>>>> already merged in the -tip tree.

I'm a little late to the party I guess. Minchan, a lot of the fault
logic has been cleaned up in the tip tree, core/futexes branch. The
removes a lot of the legacy complication from the faulting paths.
However, the get_futex_key code remains the same if I remember correctly.

>>> Thanks for good information.
>>> It will be very desirable way to enhance kernel performance.
>> I doubt it'll make a measurable difference, if you need to fault
>> performance sucks anyway. If you don't, the current code is just as
>> fast.
>>

Agreed. If you are suffering performance hits from excessive paging,
consider locking your memory.


--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team