2009-01-17 21:56:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + lock_page_killable-avoid-lost-wakeups.patch added to -mm tree

I think the patch is correct, just a question,

> int __lock_page_killable(struct page *page)
> {
> DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> + int ret;
>
> - return __wait_on_bit_lock(page_waitqueue(page), &wait,
> + ret = __wait_on_bit_lock(page_waitqueue(page), &wait,
> sync_page_killable, TASK_KILLABLE);
> + /*
> + * wait_on_bit_lock uses prepare_to_wait_exclusive, so if multiple
> + * procs were waiting on this page, we were the only proc woken up.
> + *
> + * if ret != 0, we didn't actually get the lock. We need to
> + * make sure any other waiters don't sleep forever.
> + */
> + if (ret)
> + wake_up_page(page, PG_locked);

This patch assumes that nobody else calls __wait_on_bit_lock() with
action which can return !0. Currently this is correct, but perhaps
it makes sense to move this wake_up_page() into __wait_on_bit_lock ?

Note that we need to "transfer" the wakeup only if wake_up_page()
has already removed us from page_waitqueue(page), this means we
don't need to check ret != 0 twice in __wait_on_bit_lock(), afaics
we can do

if ((ret = (*action)(q->key.flags))) {
__wake_up_bit(wq, q->key.flags, q->key.bit_nr);
// or just __wake_up(wq, TASK_NORMAL, 1, &q->key);
break;
}

IOW, imho __wait_on_bit_lock() is buggy, not __lock_page_killable(),
no?

Oleg.


2009-01-18 01:39:18

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH v3] wait: prevent waiter starvation in __wait_on_bit_lock

[added linux-mm to CC]

On Sat, Jan 17, 2009 at 10:51:10PM +0100, Oleg Nesterov wrote:
> I think the patch is correct, just a question,
>
> > int __lock_page_killable(struct page *page)
> > {
> > DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> > + int ret;
> >
> > - return __wait_on_bit_lock(page_waitqueue(page), &wait,
> > + ret = __wait_on_bit_lock(page_waitqueue(page), &wait,
> > sync_page_killable, TASK_KILLABLE);
> > + /*
> > + * wait_on_bit_lock uses prepare_to_wait_exclusive, so if multiple
> > + * procs were waiting on this page, we were the only proc woken up.
> > + *
> > + * if ret != 0, we didn't actually get the lock. We need to
> > + * make sure any other waiters don't sleep forever.
> > + */
> > + if (ret)
> > + wake_up_page(page, PG_locked);
>
> This patch assumes that nobody else calls __wait_on_bit_lock() with
> action which can return !0. Currently this is correct, but perhaps
> it makes sense to move this wake_up_page() into __wait_on_bit_lock ?
>
> Note that we need to "transfer" the wakeup only if wake_up_page()
> has already removed us from page_waitqueue(page), this means we
> don't need to check ret != 0 twice in __wait_on_bit_lock(), afaics
> we can do
>
> if ((ret = (*action)(q->key.flags))) {
> __wake_up_bit(wq, q->key.flags, q->key.bit_nr);
> // or just __wake_up(wq, TASK_NORMAL, 1, &q->key);
> break;
> }
>
> IOW, imho __wait_on_bit_lock() is buggy, not __lock_page_killable(),
> no?

I agree with you, already replied with a patch to linux-mm where Chris
posted it originally.

Peter noted that we have a spurious wake up in the case where A holds
the page lock, B and C wait, B gets killed and does a wake up, then A
unlocks and does a wake up. Your proposal has this problem too,
right? For example when C is killed it will wake up B without reason.

I included an extra test_bit() to check if it's really up to us to
either lock or wake the next contender.

Hannes

---

__wait_on_bit_lock() employs exclusive waiters, which means that every
contender has to make sure to wake up the next one in the queue after
releasing the lock.

If the passed in action() returns a non-zero value, the lock is not
taken but the next waiter is not woken up either, leading to endless
waiting on an unlocked lock.

This has been observed with lock_page_killable() as a user which
passes an action function that can fail.

Fix it in __wait_on_bit_lock() by waking up the next contender if
necessary when we abort the acquisition.

Reported-by: Chris Mason <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
kernel/wait.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

v3: check ret only once per Oleg Nesterov and don't do unnecessary
wake ups per Peter Zijlstra

v2: v1 fixed something unrelated. duh.

--- a/kernel/wait.c
+++ b/kernel/wait.c
@@ -182,8 +182,20 @@ __wait_on_bit_lock(wait_queue_head_t *wq
do {
prepare_to_wait_exclusive(wq, &q->wait, mode);
if (test_bit(q->key.bit_nr, q->key.flags)) {
- if ((ret = (*action)(q->key.flags)))
+ ret = action(q->key.flags);
+ if (ret) {
+ /*
+ * Contenders are woken exclusively. If
+ * we do not take the lock when woken up
+ * from an unlock, we have to make sure to
+ * wake the next waiter in line or noone
+ * will and shkle will wait forever.
+ */
+ if (!test_bit(q->key.bit_nr, q->key.flags))
+ __wake_up_bit(wq, q->key.flags,
+ q->key.bit_nr);
break;
+ }
}
} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
finish_wait(wq, &q->wait);

2009-01-18 02:37:16

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v3] wait: prevent waiter starvation in __wait_on_bit_lock

On 01/18, Johannes Weiner wrote:
>
> On Sat, Jan 17, 2009 at 10:51:10PM +0100, Oleg Nesterov wrote:
> >
> > if ((ret = (*action)(q->key.flags))) {
> > __wake_up_bit(wq, q->key.flags, q->key.bit_nr);
> > // or just __wake_up(wq, TASK_NORMAL, 1, &q->key);
> > break;
> > }
> >
> > IOW, imho __wait_on_bit_lock() is buggy, not __lock_page_killable(),
> > no?
>
> I agree with you, already replied with a patch to linux-mm where Chris
> posted it originally.
>
> Peter noted that we have a spurious wake up in the case where A holds
> the page lock, B and C wait, B gets killed and does a wake up, then A
> unlocks and does a wake up. Your proposal has this problem too,
> right?

Yes sure. But I can't see how it is possible to avoid the false
wakeup for sure, please see below.

> @@ -182,8 +182,20 @@ __wait_on_bit_lock(wait_queue_head_t *wq
> do {
> prepare_to_wait_exclusive(wq, &q->wait, mode);
> if (test_bit(q->key.bit_nr, q->key.flags)) {
> - if ((ret = (*action)(q->key.flags)))
> + ret = action(q->key.flags);
> + if (ret) {
> + /*
> + * Contenders are woken exclusively. If
> + * we do not take the lock when woken up
> + * from an unlock, we have to make sure to
> + * wake the next waiter in line or noone
> + * will and shkle will wait forever.
> + */
> + if (!test_bit(q->key.bit_nr, q->key.flags))
> + __wake_up_bit(wq, q->key.flags,

Afaics, the spurious wake up is still possible if SIGKILL and
unlock_page() happen "at the same time".

__wait_on_bit_lock: unlock_page:

clear_bit_unlock()
test_bit() == T

__wake_up_bit() wake_up_page()

Note that sync_page_killable() returns with ->state == TASK_RUNNING,
__wake_up() will "ignore" us.

But, more importantly, I'm afraid we can also have the false negative,
this "if (!test_bit())" test lacks the barriers. This can't happen with
sync_page_killable() because it always calls schedule(). But let's
suppose we modify it to check signal_pending() first:

static int sync_page_killable(void *word)
{
if (fatal_signal_pending(current))
return -EINTR;
return sync_page(word);
}

It is still correct, but unless I missed something now __wait_on_bit_lock()
has problems again.

But don't get me wrong, I think you are right and it is better to minimize
the possibility of the false wakeup.

Oleg.

2009-01-20 20:32:31

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3] wait: prevent waiter starvation in __wait_on_bit_lock

On Sun, Jan 18, 2009 at 03:32:11AM +0100, Oleg Nesterov wrote:
> On 01/18, Johannes Weiner wrote:
> >
> > On Sat, Jan 17, 2009 at 10:51:10PM +0100, Oleg Nesterov wrote:
> > >
> > > if ((ret = (*action)(q->key.flags))) {
> > > __wake_up_bit(wq, q->key.flags, q->key.bit_nr);
> > > // or just __wake_up(wq, TASK_NORMAL, 1, &q->key);
> > > break;
> > > }
> > >
> > > IOW, imho __wait_on_bit_lock() is buggy, not __lock_page_killable(),
> > > no?
> >
> > I agree with you, already replied with a patch to linux-mm where Chris
> > posted it originally.
> >
> > Peter noted that we have a spurious wake up in the case where A holds
> > the page lock, B and C wait, B gets killed and does a wake up, then A
> > unlocks and does a wake up. Your proposal has this problem too,
> > right?
>
> Yes sure. But I can't see how it is possible to avoid the false
> wakeup for sure, please see below.
>
> > @@ -182,8 +182,20 @@ __wait_on_bit_lock(wait_queue_head_t *wq
> > do {
> > prepare_to_wait_exclusive(wq, &q->wait, mode);
> > if (test_bit(q->key.bit_nr, q->key.flags)) {
> > - if ((ret = (*action)(q->key.flags)))
> > + ret = action(q->key.flags);
> > + if (ret) {
> > + /*
> > + * Contenders are woken exclusively. If
> > + * we do not take the lock when woken up
> > + * from an unlock, we have to make sure to
> > + * wake the next waiter in line or noone
> > + * will and shkle will wait forever.
> > + */
> > + if (!test_bit(q->key.bit_nr, q->key.flags))
> > + __wake_up_bit(wq, q->key.flags,
>
> Afaics, the spurious wake up is still possible if SIGKILL and
> unlock_page() happen "at the same time".
>
> __wait_on_bit_lock: unlock_page:
>
> clear_bit_unlock()
> test_bit() == T
>
> __wake_up_bit() wake_up_page()
>
> Note that sync_page_killable() returns with ->state == TASK_RUNNING,
> __wake_up() will "ignore" us.

Yeah, we are not completely out of the woods. But it's getting
better. The code is still pretty clear, the major bug is fixed and we
even managed to get the race window for false wake ups pretty small.
Isn't that great? :)

> But, more importantly, I'm afraid we can also have the false negative,
> this "if (!test_bit())" test lacks the barriers. This can't happen with
> sync_page_killable() because it always calls schedule(). But let's
> suppose we modify it to check signal_pending() first:
>
> static int sync_page_killable(void *word)
> {
> if (fatal_signal_pending(current))
> return -EINTR;
> return sync_page(word);
> }
>
> It is still correct, but unless I missed something now __wait_on_bit_lock()
> has problems again.

Hm, this would require the lock bit to be set without someone else
doing the wakeup. How could this happen?

I could think of wake_up_page() happening BEFORE clear_bit_unlock()
and we have to be on the front of the waitqueue. Then we are already
running, the wake up is a nop, the !test_bit() is false and noone
wakes up the next real contender.

But the wake up side uses a smp barrier after clearing the bit, so if
the bit is not cleared we can expect a wake up, no?

Or do we still need a read-side barrier before the test bit? Damned
coherency...

> But don't get me wrong, I think you are right and it is better to minimize
> the possibility of the false wakeup.
>
> Oleg.

Thanks,
Hannes

2009-01-21 14:41:01

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v3] wait: prevent waiter starvation in __wait_on_bit_lock

On 01/20, Johannes Weiner wrote:
>
> > But, more importantly, I'm afraid we can also have the false negative,
> > this "if (!test_bit())" test lacks the barriers. This can't happen with
> > sync_page_killable() because it always calls schedule(). But let's
> > suppose we modify it to check signal_pending() first:
> >
> > static int sync_page_killable(void *word)
> > {
> > if (fatal_signal_pending(current))
> > return -EINTR;
> > return sync_page(word);
> > }
> >
> > It is still correct, but unless I missed something now __wait_on_bit_lock()
> > has problems again.
>
> Hm, this would require the lock bit to be set without someone else
> doing the wakeup. How could this happen?
>
> I could think of wake_up_page() happening BEFORE clear_bit_unlock()
> and we have to be on the front of the waitqueue. Then we are already
> running, the wake up is a nop, the !test_bit() is false and noone
> wakes up the next real contender.
>
> But the wake up side uses a smp barrier after clearing the bit, so if
> the bit is not cleared we can expect a wake up, no?

Yes we have the barriers on the "wakeup", but this doesn't mean the
woken task must see the result of clear_bit() (unless it was really
unscheduled of course).

> Or do we still need a read-side barrier before the test bit?

Even this can't help afaics.

Because the the whole clear_bit + wakeup sequence can happen after
the "if (!test_bit()) check and before finish_wait(). Please note
that from the waker's pov we are sleeping in TASK_KILLABLE state,
it will wake up us if we are at the front of the waitqueue.

(to clarify, I am talking about the imaginary sync_page_killable()
above).

Oleg.

2009-01-21 21:39:25

by Johannes Weiner

[permalink] [raw]
Subject: [RFC v4] wait: prevent waiter starvation in __wait_on_bit_lock

On Wed, Jan 21, 2009 at 03:36:02PM +0100, Oleg Nesterov wrote:
> On 01/20, Johannes Weiner wrote:
> >
> > > But, more importantly, I'm afraid we can also have the false negative,
> > > this "if (!test_bit())" test lacks the barriers. This can't happen with
> > > sync_page_killable() because it always calls schedule(). But let's
> > > suppose we modify it to check signal_pending() first:
> > >
> > > static int sync_page_killable(void *word)
> > > {
> > > if (fatal_signal_pending(current))
> > > return -EINTR;
> > > return sync_page(word);
> > > }
> > >
> > > It is still correct, but unless I missed something now __wait_on_bit_lock()
> > > has problems again.
> >
> > Hm, this would require the lock bit to be set without someone else
> > doing the wakeup. How could this happen?
> >
> > I could think of wake_up_page() happening BEFORE clear_bit_unlock()
> > and we have to be on the front of the waitqueue. Then we are already
> > running, the wake up is a nop, the !test_bit() is false and noone
> > wakes up the next real contender.
> >
> > But the wake up side uses a smp barrier after clearing the bit, so if
> > the bit is not cleared we can expect a wake up, no?
>
> Yes we have the barriers on the "wakeup", but this doesn't mean the
> woken task must see the result of clear_bit() (unless it was really
> unscheduled of course).
>
> > Or do we still need a read-side barrier before the test bit?
>
> Even this can't help afaics.
>
> Because the the whole clear_bit + wakeup sequence can happen after
> the "if (!test_bit()) check and before finish_wait(). Please note
> that from the waker's pov we are sleeping in TASK_KILLABLE state,
> it will wake up us if we are at the front of the waitqueue.

Yes, you are right. Thanks for the explanation!

I redid the patch and I *think* it is safe now and the race window for
a false wake up is pretty small. But we check @ret twice... Oh well.

If the wake up happens now before the test_bit(), the added reader
barrier ensures we see the bit clear and do the wake up as needed. If
it happens afterwards, we are definitely off the queue (finish_wait()
has removed us and it also comes with a barrier in form of a spin
lock/unlock).

And only in the second scenario there is a small chance of a bogus
wake up, when the next contender hasn't set the bit again yet.

Does this all make sense? :)

---
__wait_on_bit_lock() employs exclusive waiters, which means that every
contender has to make sure to wake up the next one in the queue after
releasing the lock.

The current implementation does not do this for failed acquisitions.
If the passed in action() returns a non-zero value, the lock is not
taken but the next waiter is not woken up either, leading to endless
waiting on an unlocked lock.

This failure mode was observed with lock_page_killable() as a user
which passes an action function that can fail and thereby prevent lock
acquisition.

Fix it in __wait_on_bit_lock() by passing on to the next contender
when we were woken by an unlock but won't take the lock ourselves.

Reported-by: Chris Mason <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---

diff --git a/kernel/wait.c b/kernel/wait.c
index cd87131..e3aaa81 100644
--- a/kernel/wait.c
+++ b/kernel/wait.c
@@ -187,6 +187,31 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
}
} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
finish_wait(wq, &q->wait);
+ if (unlikely(ret)) {
+ /*
+ * Contenders are woken exclusively. If we were woken
+ * by an unlock we have to take the lock ourselves and
+ * wake the next contender on unlock. But the waiting
+ * function failed, we do not take the lock and won't
+ * unlock in the future. Make sure the next contender
+ * does not wait forever on an unlocked bit.
+ *
+ * We can also get here without being woken through
+ * the waitqueue, so there is a small chance of doing a
+ * bogus wake up between an unlock clearing the bit and
+ * the next contender being woken up and setting it again.
+ *
+ * It does no harm, though, the scheduler will ignore it
+ * as the process in question is already running.
+ *
+ * The unlock path clears the bit and then wakes up the
+ * next contender. If the next contender is us, the
+ * barrier makes sure we also see the bit cleared.
+ */
+ smp_rmb();
+ if (!test_bit(q->key.bit_nr, q->key.flags)))
+ __wake_up_bit(wq, q->key.flags, q->key.bit_nr);
+ }
return ret;
}
EXPORT_SYMBOL(__wait_on_bit_lock);

2009-01-22 20:31:20

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC v4] wait: prevent waiter starvation in __wait_on_bit_lock

On 01/21, Johannes Weiner wrote:
>
> @@ -187,6 +187,31 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> }
> } while (test_and_set_bit(q->key.bit_nr, q->key.flags));
> finish_wait(wq, &q->wait);
> + if (unlikely(ret)) {
> + /*
> + * Contenders are woken exclusively. If we were woken
> + * by an unlock we have to take the lock ourselves and
> + * wake the next contender on unlock. But the waiting
> + * function failed, we do not take the lock and won't
> + * unlock in the future. Make sure the next contender
> + * does not wait forever on an unlocked bit.
> + *
> + * We can also get here without being woken through
> + * the waitqueue, so there is a small chance of doing a
> + * bogus wake up between an unlock clearing the bit and
> + * the next contender being woken up and setting it again.
> + *
> + * It does no harm, though, the scheduler will ignore it
> + * as the process in question is already running.
> + *
> + * The unlock path clears the bit and then wakes up the
> + * next contender. If the next contender is us, the
> + * barrier makes sure we also see the bit cleared.
> + */
> + smp_rmb();
> + if (!test_bit(q->key.bit_nr, q->key.flags)))
> + __wake_up_bit(wq, q->key.flags, q->key.bit_nr);

I think this is correct, and (unfortunately ;) you are right:
we need rmb() even after finish_wait().

And we have to check ret twice, and the false wakeup is still
possible. This is minor, but just for discussion, can't we do
this differently?

int finish_wait_xxx(wait_queue_head_t *q, wait_queue_t *wait)
{
unsigned long flags;
int woken;

__set_current_state(TASK_RUNNING);
spin_lock_irqsave(&q->lock, flags);
woken = list_empty(&wait->task_list);
list_del_init(&wait->task_list);
spin_unlock_irqrestore(&q->lock, flags);

return woken;
}

Now, __wait_on_bit_lock() does:

if (test_bit(q->key.bit_nr, q->key.flags)) {
if ((ret = (*action)(q->key.flags))) {
if (finish_wait_xxx(...))
__wake_up_bit(...);
return ret;
}
}

Or we can introduce

int finish_wait_yyy(wait_queue_head_t *q, wait_queue_t *wait,
int mode, void *key)
{
unsigned long flags;
int woken;

__set_current_state(TASK_RUNNING);
spin_lock_irqsave(&q->lock, flags);
woken = list_empty(&wait->task_list);
if (woken)
__wake_up_common(q, mode, 1, key);
else
list_del_init(&wait->task_list);
spin_unlock_irqrestore(&q->lock, flags);

return woken;
}

Perhaps a bit too much for this particular case, but I am thinking
about other cases when we need to abort the exclusive wait.

For example, don't we have the similar problems with
wait_event_interruptible_exclusive() ?

Oleg.

2009-01-23 00:26:40

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [RFC v4] wait: prevent waiter starvation in __wait_on_bit_lock

2009/1/22 Oleg Nesterov <[email protected]>:
> On 01/21, Johannes Weiner wrote:
>>
>> @@ -187,6 +187,31 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
>> }
>> } while (test_and_set_bit(q->key.bit_nr, q->key.flags));
>> finish_wait(wq, &q->wait);
>> + if (unlikely(ret)) {
>> + /*
>> + * Contenders are woken exclusively. If we were woken
>> + * by an unlock we have to take the lock ourselves and
>> + * wake the next contender on unlock. But the waiting
>> + * function failed, we do not take the lock and won't
>> + * unlock in the future. Make sure the next contender
>> + * does not wait forever on an unlocked bit.
>> + *
>> + * We can also get here without being woken through
>> + * the waitqueue, so there is a small chance of doing a
>> + * bogus wake up between an unlock clearing the bit and
>> + * the next contender being woken up and setting it again.
>> + *
>> + * It does no harm, though, the scheduler will ignore it
>> + * as the process in question is already running.
>> + *
>> + * The unlock path clears the bit and then wakes up the
>> + * next contender. If the next contender is us, the
>> + * barrier makes sure we also see the bit cleared.
>> + */
>> + smp_rmb();
>> + if (!test_bit(q->key.bit_nr, q->key.flags)))
>> + __wake_up_bit(wq, q->key.flags, q->key.bit_nr);
>
> I think this is correct, and (unfortunately ;) you are right:
> we need rmb() even after finish_wait().

Hum, I think it's actually not necessary in this particular case when
(1) "the next contender is us" and (2) we are in the "ret != 0" path
so that the only thing we really care about -- if we were exclusivly
woken up, then wake up somebody else [*].

"the next contender is us" implies that we were still on the 'wq'
queue when __wake_up_bit() -> __wake_up() has been called, meaning
that wq->lock has also been taken (in __wake_up()).

Now, on our side, we are definitely on the 'wq' queue before calling
finish_wait(), meaning that we also take the wq->lock.

In short, wq->lock is a sync. mechanism in this case. The scheme is as follows:

our side:

[ finish_wait() ]

lock(wq->lock);
delete us from the 'wq'
unlock(wq->lock);

test_bit() [ read a bit ]


waker's side:

clear_bit()
smp_mb__after_clear_bit() --- is a must to ensure that we fetch the
'wq' (and do a waitqueue_active(wq) check) in __wake_up_bit() _only_
after clearing the bit.

[ __wake_up_bit(); -> __wake_up() ] --> we are on the 'wq' (see conditions [*])
lock(wq->lock);
wake 'us' up here
unlock(wq->lock);


Now the point is, without smp_rmb() on the side of wait_on_bit(),
test_bit() [ which is a LOAD op ]can get _into_ the wq->lock section,
smth like this:

[ finish_wait() ]

lock(wq->lock);
test_bit() [ read a bit ]
delete us from the 'wq'
unlock(wq->lock);

If (1) is true (we were woken up indeed), it means that __wake_up()
(from __wake_up_bit()) has been executed before we were able to enter
finish_wait().

By the moment __wake_up_bit() was executed (we were woken up), the bit
was already cleared -- that's guaranteed by a full MB on the
wake_up_bit side (in our case [*] wq->lock would do it even without
the MB) -> meaning that we don't miss !test_bit() int this particular
case [*].

p.s. if the explanation is vague or heh even wrong, it's definitely
due to the lack of sleep ;-))

>
> For example, don't we have the similar problems with
> wait_event_interruptible_exclusive() ?

yes, I think so.


>
> Oleg.
>

--
Best regards,
Dmitry Adamushko

2009-01-23 00:51:58

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC v4] wait: prevent waiter starvation in __wait_on_bit_lock

On 01/23, Dmitry Adamushko wrote:
>
> 2009/1/22 Oleg Nesterov <[email protected]>:
> >
> > I think this is correct, and (unfortunately ;) you are right:
> > we need rmb() even after finish_wait().
>
> Hum, I think it's actually not necessary in this particular case when
> (1) "the next contender is us" and (2) we are in the "ret != 0" path
> so that the only thing we really care about -- if we were exclusivly
> woken up, then wake up somebody else [*].
>
> "the next contender is us" implies that we were still on the 'wq'
> queue when __wake_up_bit() -> __wake_up() has been called, meaning
> that wq->lock has also been taken (in __wake_up()).
>
> Now, on our side, we are definitely on the 'wq' queue before calling
> finish_wait(), meaning that we also take the wq->lock.
>
> In short, wq->lock is a sync. mechanism in this case. The scheme is as follows:
>
> our side:
>
> [ finish_wait() ]
>
> lock(wq->lock);

But we can skip lock(wq->lock), afaics.

Without rmb(), test_bit() can be re-ordered with list_empty_careful()
in finish_wait() and even with __set_task_state(TASK_RUNNING).

> p.s. if the explanation is vague or heh even wrong, it's definitely
> due to the lack of sleep ;-))

The same on my side ;)

Oleg.

2009-01-23 10:00:27

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC v4] wait: prevent waiter starvation in __wait_on_bit_lock

On Thu, Jan 22, 2009 at 09:25:50PM +0100, Oleg Nesterov wrote:
> On 01/21, Johannes Weiner wrote:
> >
> > @@ -187,6 +187,31 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> > }
> > } while (test_and_set_bit(q->key.bit_nr, q->key.flags));
> > finish_wait(wq, &q->wait);
> > + if (unlikely(ret)) {
> > + /*
> > + * Contenders are woken exclusively. If we were woken
> > + * by an unlock we have to take the lock ourselves and
> > + * wake the next contender on unlock. But the waiting
> > + * function failed, we do not take the lock and won't
> > + * unlock in the future. Make sure the next contender
> > + * does not wait forever on an unlocked bit.
> > + *
> > + * We can also get here without being woken through
> > + * the waitqueue, so there is a small chance of doing a
> > + * bogus wake up between an unlock clearing the bit and
> > + * the next contender being woken up and setting it again.
> > + *
> > + * It does no harm, though, the scheduler will ignore it
> > + * as the process in question is already running.
> > + *
> > + * The unlock path clears the bit and then wakes up the
> > + * next contender. If the next contender is us, the
> > + * barrier makes sure we also see the bit cleared.
> > + */
> > + smp_rmb();
> > + if (!test_bit(q->key.bit_nr, q->key.flags)))
> > + __wake_up_bit(wq, q->key.flags, q->key.bit_nr);
>
> I think this is correct, and (unfortunately ;) you are right:
> we need rmb() even after finish_wait().
>
> And we have to check ret twice, and the false wakeup is still
> possible. This is minor, but just for discussion, can't we do
> this differently?
>
> int finish_wait_xxx(wait_queue_head_t *q, wait_queue_t *wait)
> {
> unsigned long flags;
> int woken;
>
> __set_current_state(TASK_RUNNING);
> spin_lock_irqsave(&q->lock, flags);
> woken = list_empty(&wait->task_list);
> list_del_init(&wait->task_list);
> spin_unlock_irqrestore(&q->lock, flags);
>
> return woken;
> }

Hehe, there is only n solutions to this problem. I had thought about
that too, even written it down. But I was not sure if taking the
spinlock, toggling irqs and (re)storing the flags is better than an
untaken branch. ;)

> Now, __wait_on_bit_lock() does:
>
> if (test_bit(q->key.bit_nr, q->key.flags)) {
> if ((ret = (*action)(q->key.flags))) {
> if (finish_wait_xxx(...))
> __wake_up_bit(...);
> return ret;
> }
> }

If you don't mind putting a second finish_wait() in there (you still
need the one after the loop, right?), we can fix up my version to not
check ret twice but do finish_wait() as you describe and then the
test_bit() && wake up:

do {
if (test_bit())
if ((ret = action())) {
finish_wait()
smp_rmb()
if (!test_bit())
__wake_up_bit()
return ret
}
}
} while (test_and_set_bit())
finish_wait()
return 0

> Or we can introduce
>
> int finish_wait_yyy(wait_queue_head_t *q, wait_queue_t *wait,
> int mode, void *key)
> {
> unsigned long flags;
> int woken;
>
> __set_current_state(TASK_RUNNING);
> spin_lock_irqsave(&q->lock, flags);
> woken = list_empty(&wait->task_list);
> if (woken)
> __wake_up_common(q, mode, 1, key);
> else
> list_del_init(&wait->task_list);
> spin_unlock_irqrestore(&q->lock, flags);
>
> return woken;
> }
>
> Perhaps a bit too much for this particular case, but I am thinking
> about other cases when we need to abort the exclusive wait.
>
> For example, don't we have the similar problems with
> wait_event_interruptible_exclusive() ?

Yeah, we do IIUC. Then having finish_wait() extended is probably a
good idea.

> Oleg.

Hannes

2009-01-23 10:08:00

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [RFC v4] wait: prevent waiter starvation in __wait_on_bit_lock

2009/1/23 Oleg Nesterov <[email protected]>:
> On 01/23, Dmitry Adamushko wrote:
>>
>> 2009/1/22 Oleg Nesterov <[email protected]>:
>> >
>> > I think this is correct, and (unfortunately ;) you are right:
>> > we need rmb() even after finish_wait().
>>
>> Hum, I think it's actually not necessary in this particular case when
>> (1) "the next contender is us" and (2) we are in the "ret != 0" path
>> so that the only thing we really care about -- if we were exclusivly
>> woken up, then wake up somebody else [*].
>>
>> "the next contender is us" implies that we were still on the 'wq'
>> queue when __wake_up_bit() -> __wake_up() has been called, meaning
>> that wq->lock has also been taken (in __wake_up()).
>>
>> Now, on our side, we are definitely on the 'wq' queue before calling
>> finish_wait(), meaning that we also take the wq->lock.
>>
>> In short, wq->lock is a sync. mechanism in this case. The scheme is as follows:
>>
>> our side:
>>
>> [ finish_wait() ]
>>
>> lock(wq->lock);
>
> But we can skip lock(wq->lock), afaics.
>
> Without rmb(), test_bit() can be re-ordered with list_empty_careful()
> in finish_wait() and even with __set_task_state(TASK_RUNNING).

But taking into account the constraints of this special case, namely
(1), we can't skip lock(wq->lock).

(1) "the next contender is us"

In this particular situation, we are only interested in the case when
we were woken up by __wake_up_bit().

that means we are _on_ the 'wq' list when we do finish_wait() -> we do
take the 'wq->lock'.

Moreover, imagine the following case (roughly similar to finish_wait()):

if (LOAD(a) == 1) {
// do something here
mb();
}

LOAD(b);

Can LOAD(b) be reordered with LOAD(a)?

I'd imagine that it can be done with CPUs that do execute 2 branch
paths in advance but then LOAD(b) must be re-loaded if (a == 1) and we
hit mb().


>
> Oleg.
>

--
Best regards,
Dmitry Adamushko

2009-01-23 11:08:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC v4] wait: prevent waiter starvation in __wait_on_bit_lock

On 01/23, Dmitry Adamushko wrote:
>
> 2009/1/23 Oleg Nesterov <[email protected]>:
> > On 01/23, Dmitry Adamushko wrote:
> >>
> >> In short, wq->lock is a sync. mechanism in this case. The scheme is as follows:
> >>
> >> our side:
> >>
> >> [ finish_wait() ]
> >>
> >> lock(wq->lock);
> >
> > But we can skip lock(wq->lock), afaics.
> >
> > Without rmb(), test_bit() can be re-ordered with list_empty_careful()
> > in finish_wait() and even with __set_task_state(TASK_RUNNING).
>
> But taking into account the constraints of this special case, namely
> (1), we can't skip lock(wq->lock).
>
> (1) "the next contender is us"
>
> In this particular situation, we are only interested in the case when
> we were woken up by __wake_up_bit().

Yes,

> that means we are _on_ the 'wq' list when we do finish_wait() -> we do
> take the 'wq->lock'.

Hmm. No?

We are doing exclusive wait, and we use autoremove_wake_function().
If we were woken, we are removed from ->task_list.

> Moreover, imagine the following case (roughly similar to finish_wait()):
>
> if (LOAD(a) == 1) {
> // do something here
> mb();
> }
>
> LOAD(b);
>
> Can LOAD(b) be reordered with LOAD(a)?

Well, I think yes it can. But I'd suggest you to ask somebody else ;)

So, without rmb() I think it is theoretically possible that we read
test_bit() before we get list_empty_careful() == T.

Oleg.

2009-01-23 11:38:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC v4] wait: prevent waiter starvation in __wait_on_bit_lock

On 01/23, Johannes Weiner wrote:
>
> On Thu, Jan 22, 2009 at 09:25:50PM +0100, Oleg Nesterov wrote:
> > On 01/21, Johannes Weiner wrote:
> > >
> > int finish_wait_xxx(wait_queue_head_t *q, wait_queue_t *wait)
> > {
> > unsigned long flags;
> > int woken;
> >
> > __set_current_state(TASK_RUNNING);
> > spin_lock_irqsave(&q->lock, flags);
> > woken = list_empty(&wait->task_list);
> > list_del_init(&wait->task_list);
> > spin_unlock_irqrestore(&q->lock, flags);
> >
> > return woken;
> > }
>
> Hehe, there is only n solutions to this problem. I had thought about
> that too, even written it down. But I was not sure if taking the
> spinlock, toggling irqs and (re)storing the flags is better than an
> untaken branch. ;)

Yes. Fortunately, this is "unlikely" path.

> > if (test_bit(q->key.bit_nr, q->key.flags)) {
> > if ((ret = (*action)(q->key.flags))) {
> > if (finish_wait_xxx(...))
> > __wake_up_bit(...);
> > return ret;
> > }
> > }
>
> If you don't mind putting a second finish_wait() in there (you still
> need the one after the loop, right?), we can fix up my version to not
> check ret twice but do finish_wait() as you describe and then the
> test_bit() && wake up:
>
> do {
> if (test_bit())
> if ((ret = action())) {
> finish_wait()
> smp_rmb()
> if (!test_bit())
> __wake_up_bit()

Yes sure. Except this wakeup can be false.

> > int finish_wait_yyy(wait_queue_head_t *q, wait_queue_t *wait,
> > int mode, void *key)
> > {
> > unsigned long flags;
> > int woken;
> >
> > __set_current_state(TASK_RUNNING);
> > spin_lock_irqsave(&q->lock, flags);
> > woken = list_empty(&wait->task_list);
> > if (woken)
> > __wake_up_common(q, mode, 1, key);
> > else
> > list_del_init(&wait->task_list);
> > spin_unlock_irqrestore(&q->lock, flags);
> >
> > return woken;
> > }
> >
> > Perhaps a bit too much for this particular case, but I am thinking
> > about other cases when we need to abort the exclusive wait.
> >
> > For example, don't we have the similar problems with
> > wait_event_interruptible_exclusive() ?
>
> Yeah, we do IIUC. Then having finish_wait() extended is probably a
> good idea.

Yes.

It is no that I think this new helper is really needed for this
particular case, personally I agree with the patch you sent.

But if we have other places with the similar problem, then perhaps
it is better to introduce the special finish_wait_exclusive() or
whatever.

Oleg.

2009-01-23 12:36:44

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [RFC v4] wait: prevent waiter starvation in __wait_on_bit_lock

2009/1/23 Oleg Nesterov <[email protected]>:
> On 01/23, Dmitry Adamushko wrote:
>>
>> 2009/1/23 Oleg Nesterov <[email protected]>:
>> > On 01/23, Dmitry Adamushko wrote:
>> >>
>> >> In short, wq->lock is a sync. mechanism in this case. The scheme is as follows:
>> >>
>> >> our side:
>> >>
>> >> [ finish_wait() ]
>> >>
>> >> lock(wq->lock);
>> >
>> > But we can skip lock(wq->lock), afaics.
>> >
>> > Without rmb(), test_bit() can be re-ordered with list_empty_careful()
>> > in finish_wait() and even with __set_task_state(TASK_RUNNING).
>>
>> But taking into account the constraints of this special case, namely
>> (1), we can't skip lock(wq->lock).
>>
>> (1) "the next contender is us"
>>
>> In this particular situation, we are only interested in the case when
>> we were woken up by __wake_up_bit().
>
> Yes,
>
>> that means we are _on_ the 'wq' list when we do finish_wait() -> we do
>> take the 'wq->lock'.
>
> Hmm. No?
>
> We are doing exclusive wait, and we use autoremove_wake_function().
> If we were woken, we are removed from ->task_list.

Argh, right, somehow I've made wrong assumptions on the wake-up part :-/


>
> Oleg.
>

--
Best regards,
Dmitry Adamushko

2009-01-23 13:33:37

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC v4] wait: prevent waiter starvation in __wait_on_bit_lock

On 01/23, Oleg Nesterov wrote:
>
> It is no that I think this new helper is really needed for this
> particular case, personally I agree with the patch you sent.
>
> But if we have other places with the similar problem, then perhaps
> it is better to introduce the special finish_wait_exclusive() or
> whatever.

To clarify, I suggest something like this.

int finish_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait,
int ret, int state, void *key)
{
unsigned long flags;

__set_current_state(TASK_RUNNING);

if (ret || !list_empty_careful(&wait->task_list)) {
spin_lock_irqsave(&q->lock, flags);
if (list_empty(&wait->task_list))
__wake_up_common(q, state, 1, key);
else
list_del_init(&wait->task_list);
spin_unlock_irqrestore(&q->lock, flags);
}

return ret;
}

Now, __wait_on_bit_lock() becomes:

int __sched
__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
int (*action)(void *), unsigned mode)
{
int ret = 0;

do {
prepare_to_wait_exclusive(wq, &q->wait, mode);
if (test_bit(q->key.bit_nr, q->key.flags) &&
(ret = (*action)(q->key.flags))
break;
} while (test_and_set_bit(q->key.bit_nr, q->key.flags));

return finish_wait_exclusive(wq, &q->wait, ret, mode, &q->key);
}

And __wait_event_interruptible_exclusive:

#define __wait_event_interruptible_exclusive(wq, condition, ret) \
do { \
DEFINE_WAIT(__wait); \
\
for (;;) { \
prepare_to_wait_exclusive(&wq, &__wait, \
TASK_INTERRUPTIBLE); \
if (condition) \
break; \
if (!signal_pending(current)) { \
schedule(); \
continue; \
} \
ret = -ERESTARTSYS; \
break; \
} \
finish_wait_exclusive(&wq, &__wait, \
ret, TASK_INTERRUPTIBLE, NULL); \
} while (0)

But I can't convince myself this is what we really want. So I am not
sending the patch. And yes, we have to check ret twice.

Oleg.

2009-01-23 19:26:11

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC v4] wait: prevent waiter starvation in __wait_on_bit_lock

On Fri, Jan 23, 2009 at 12:35:41PM +0100, Oleg Nesterov wrote:
> On 01/23, Johannes Weiner wrote:
> >
> > On Thu, Jan 22, 2009 at 09:25:50PM +0100, Oleg Nesterov wrote:
> > > On 01/21, Johannes Weiner wrote:
> > > >
> > > int finish_wait_xxx(wait_queue_head_t *q, wait_queue_t *wait)
> > > {
> > > unsigned long flags;
> > > int woken;
> > >
> > > __set_current_state(TASK_RUNNING);
> > > spin_lock_irqsave(&q->lock, flags);
> > > woken = list_empty(&wait->task_list);
> > > list_del_init(&wait->task_list);
> > > spin_unlock_irqrestore(&q->lock, flags);
> > >
> > > return woken;
> > > }
> >
> > Hehe, there is only n solutions to this problem. I had thought about
> > that too, even written it down. But I was not sure if taking the
> > spinlock, toggling irqs and (re)storing the flags is better than an
> > untaken branch. ;)
>
> Yes. Fortunately, this is "unlikely" path.
>
> > > if (test_bit(q->key.bit_nr, q->key.flags)) {
> > > if ((ret = (*action)(q->key.flags))) {
> > > if (finish_wait_xxx(...))
> > > __wake_up_bit(...);
> > > return ret;
> > > }
> > > }
> >
> > If you don't mind putting a second finish_wait() in there (you still
> > need the one after the loop, right?), we can fix up my version to not
> > check ret twice but do finish_wait() as you describe and then the
> > test_bit() && wake up:
> >
> > do {
> > if (test_bit())
> > if ((ret = action())) {
> > finish_wait()
> > smp_rmb()
> > if (!test_bit())
> > __wake_up_bit()
>
> Yes sure. Except this wakeup can be false.
>
> > > int finish_wait_yyy(wait_queue_head_t *q, wait_queue_t *wait,
> > > int mode, void *key)
> > > {
> > > unsigned long flags;
> > > int woken;
> > >
> > > __set_current_state(TASK_RUNNING);
> > > spin_lock_irqsave(&q->lock, flags);
> > > woken = list_empty(&wait->task_list);
> > > if (woken)
> > > __wake_up_common(q, mode, 1, key);
> > > else
> > > list_del_init(&wait->task_list);
> > > spin_unlock_irqrestore(&q->lock, flags);
> > >
> > > return woken;
> > > }
> > >
> > > Perhaps a bit too much for this particular case, but I am thinking
> > > about other cases when we need to abort the exclusive wait.
> > >
> > > For example, don't we have the similar problems with
> > > wait_event_interruptible_exclusive() ?
> >
> > Yeah, we do IIUC. Then having finish_wait() extended is probably a
> > good idea.
>
> Yes.
>
> It is no that I think this new helper is really needed for this
> particular case, personally I agree with the patch you sent.
>
> But if we have other places with the similar problem, then perhaps
> it is better to introduce the special finish_wait_exclusive() or
> whatever.

Agreed. I will whip up another series that adds
finish_wait_exclusive() and adjusts the problematic callsites.

Hannes

2009-01-26 22:02:05

by Johannes Weiner

[permalink] [raw]
Subject: [RFC v5] wait: prevent exclusive waiter starvation

On Fri, Jan 23, 2009 at 02:30:50PM +0100, Oleg Nesterov wrote:
> On 01/23, Oleg Nesterov wrote:
> >
> > It is no that I think this new helper is really needed for this
> > particular case, personally I agree with the patch you sent.
> >
> > But if we have other places with the similar problem, then perhaps
> > it is better to introduce the special finish_wait_exclusive() or
> > whatever.
>
> To clarify, I suggest something like this.
>
> int finish_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait,
> int ret, int state, void *key)
> {
> unsigned long flags;
>
> __set_current_state(TASK_RUNNING);
>
> if (ret || !list_empty_careful(&wait->task_list)) {
> spin_lock_irqsave(&q->lock, flags);
> if (list_empty(&wait->task_list))
> __wake_up_common(q, state, 1, key);
> else
> list_del_init(&wait->task_list);
> spin_unlock_irqrestore(&q->lock, flags);
> }
>
> return ret;
> }
>
> Now, __wait_on_bit_lock() becomes:
>
> int __sched
> __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> int (*action)(void *), unsigned mode)
> {
> int ret = 0;
>
> do {
> prepare_to_wait_exclusive(wq, &q->wait, mode);
> if (test_bit(q->key.bit_nr, q->key.flags) &&
> (ret = (*action)(q->key.flags))
> break;
> } while (test_and_set_bit(q->key.bit_nr, q->key.flags));
>
> return finish_wait_exclusive(wq, &q->wait, ret, mode, &q->key);
> }
>
> And __wait_event_interruptible_exclusive:
>
> #define __wait_event_interruptible_exclusive(wq, condition, ret) \
> do { \
> DEFINE_WAIT(__wait); \
> \
> for (;;) { \
> prepare_to_wait_exclusive(&wq, &__wait, \
> TASK_INTERRUPTIBLE); \
> if (condition) \
> break; \
> if (!signal_pending(current)) { \
> schedule(); \
> continue; \
> } \
> ret = -ERESTARTSYS; \
> break; \
> } \
> finish_wait_exclusive(&wq, &__wait, \
> ret, TASK_INTERRUPTIBLE, NULL); \
> } while (0)
>
> But I can't convince myself this is what we really want. So I am not
> sending the patch. And yes, we have to check ret twice.

Another iteration. I didn't use a general finish_wait_exclusive() but
a version of this function that just returns whether we were woken
through the queue or not. The result is stable due to the waitqueue
lock. The callsites use it only if interrupted and the normal
finish_wait() otherwise.

Hannes

---
With exclusive waiters, every process woken up through the wait queue
must ensure that the next waiter down the line is woken when it has
finished.

However, if the waiting processes sleep interruptibly, they might
abort waiting prematurely. And if this in turn races with an ongoing
wake up, the woken up process might be the one currently aborting the
wait and the next real contender is never woken up, doomed to stay on
the queue forever.

This has been observed with __wait_on_bit_lock() used by
lock_page_killable(). If the contender was killed and woken up at the
same time, the next contender would never be woken up: the previous
lock holder would wake the interrupted contender and the interrupted
contender would never do unlock_page() -> __wake_up_bit() because it
never took the lock in the first place.

To fix this, it must be ensured that when the interrupted task tries
to remove itself from the waitqueue (finish_wait) it has to check for
whether it was woken up through the queue, and if so, wake up the next
contender.

Add finish_wait_woken() which does the same as finish_wait() but also
returns whether the wait descriptor was already removed from the
queue. Serialized by the waitqueue spinlock, this is safe indicator
for whether we have to wake up the next contender or the previously
running task will do it.

Then use this function for __wait_event_interruptible_exclusive() and
__wait_on_bit_lock() to wake up the next contender if needed.

Reported-by: Chris Mason <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/wait.h | 9 ++++++-
kernel/wait.c | 58 +++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 58 insertions(+), 9 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index ef609f8..56c9402 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -333,16 +333,20 @@ do { \
for (;;) { \
prepare_to_wait_exclusive(&wq, &__wait, \
TASK_INTERRUPTIBLE); \
- if (condition) \
+ if (condition) { \
+ finish_wait(&wq, &__wait); \
break; \
+ } \
if (!signal_pending(current)) { \
schedule(); \
continue; \
} \
ret = -ERESTARTSYS; \
+ if (finish_wait_woken(&wq, &__wait)) \
+ __wake_up_common(&wq, TASK_INTERRUPTIBLE, \
+ 1, NULL); \
break; \
} \
- finish_wait(&wq, &__wait); \
} while (0)

#define wait_event_interruptible_exclusive(wq, condition) \
@@ -431,6 +435,7 @@ extern long interruptible_sleep_on_timeout(wait_queue_head_t *q,
void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state);
void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state);
void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
+int finish_wait_woken(wait_queue_head_t *q, wait_queue_t *wait);
int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);

diff --git a/kernel/wait.c b/kernel/wait.c
index cd87131..7fc0d57 100644
--- a/kernel/wait.c
+++ b/kernel/wait.c
@@ -91,6 +91,15 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
}
EXPORT_SYMBOL(prepare_to_wait_exclusive);

+/*
+ * finish_wait - clean up after waiting in a queue
+ * @q: waitqueue waited on
+ * @wait: wait descriptor
+ *
+ * Sets current thread back to running state and removes
+ * the wait descriptor from the given waitqueue if still
+ * queued.
+ */
void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
{
unsigned long flags;
@@ -117,6 +126,32 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
}
EXPORT_SYMBOL(finish_wait);

+/*
+ * finish_wait_woken - clean up after waiting in a queue
+ * @q: waitqueue waited on
+ * @wait: wait descriptor
+ *
+ * Sets current thread back to running state and removes
+ * the wait descriptor from the given waitqueue if still
+ * queued.
+ *
+ * Returns 1 if the waiting task was woken up through the
+ * wait descriptor in the queue, 0 otherwise.
+ */
+int finish_wait_woken(wait_queue_head_t *q, wait_queue_t *wait)
+{
+ int woken;
+ unsigned long flags;
+
+ __set_current_state(TASK_RUNNING);
+ spin_lock_irqsave(&q->lock, flags);
+ if (!(woken = list_empty(&wait->task_list)))
+ list_del_init(&wait->task_list);
+ spin_unlock_irqrestore(&q->lock, flags);
+ return woken;
+}
+EXPORT_SYMBOL(finish_wait_woken);
+
int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
int ret = default_wake_function(wait, mode, sync, key);
@@ -177,17 +212,26 @@ int __sched
__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
int (*action)(void *), unsigned mode)
{
- int ret = 0;
-
do {
+ int ret;
+
prepare_to_wait_exclusive(wq, &q->wait, mode);
- if (test_bit(q->key.bit_nr, q->key.flags)) {
- if ((ret = (*action)(q->key.flags)))
- break;
- }
+ if (!test_bit(q->key.bit_nr, q->key.flags))
+ continue;
+ if (!(ret = action(q->key.flags)))
+ continue;
+ /*
+ * Exclusive waiting requires the woken up process
+ * to ensure the next wake up. The lock acquisition
+ * failed here, no unlock is expected. Make sure the
+ * next process does not wait forever on the queue.
+ */
+ if (finish_wait_woken(wq, &q->wait))
+ __wake_up_bit(wq, q->key.flags, q->key.bit_nr);
+ return ret;
} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
finish_wait(wq, &q->wait);
- return ret;
+ return 0;
}
EXPORT_SYMBOL(__wait_on_bit_lock);

--
1.6.0.3

2009-01-27 03:30:18

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC v5] wait: prevent exclusive waiter starvation

On 01/26, Johannes Weiner wrote:
>
> Another iteration. I didn't use a general finish_wait_exclusive() but
> a version of this function that just returns whether we were woken
> through the queue or not.

But if your helper (finish_wait_woken) returns true, we always need
to wakeup the next waiter, or we don't need to use it. So why not
place the wakeup in the helper itself?

> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -333,16 +333,20 @@ do { \
> for (;;) { \
> prepare_to_wait_exclusive(&wq, &__wait, \
> TASK_INTERRUPTIBLE); \
> - if (condition) \
> + if (condition) { \
> + finish_wait(&wq, &__wait); \
> break; \
> + } \
> if (!signal_pending(current)) { \
> schedule(); \
> continue; \
> } \
> ret = -ERESTARTSYS; \
> + if (finish_wait_woken(&wq, &__wait)) \
> + __wake_up_common(&wq, TASK_INTERRUPTIBLE, \

No, we can't use __wake_up_common() without wq->lock.

Oleg.

2009-01-27 19:36:42

by Johannes Weiner

[permalink] [raw]
Subject: [RFC v6] wait: prevent exclusive waiter starvation

On Tue, Jan 27, 2009 at 04:23:59AM +0100, Oleg Nesterov wrote:
> On 01/26, Johannes Weiner wrote:
> >
> > Another iteration. I didn't use a general finish_wait_exclusive() but
> > a version of this function that just returns whether we were woken
> > through the queue or not.
>
> But if your helper (finish_wait_woken) returns true, we always need
> to wakeup the next waiter, or we don't need to use it. So why not
> place the wakeup in the helper itself?

Good point.

> > --- a/include/linux/wait.h
> > +++ b/include/linux/wait.h
> > @@ -333,16 +333,20 @@ do { \
> > for (;;) { \
> > prepare_to_wait_exclusive(&wq, &__wait, \
> > TASK_INTERRUPTIBLE); \
> > - if (condition) \
> > + if (condition) { \
> > + finish_wait(&wq, &__wait); \
> > break; \
> > + } \
> > if (!signal_pending(current)) { \
> > schedule(); \
> > continue; \
> > } \
> > ret = -ERESTARTSYS; \
> > + if (finish_wait_woken(&wq, &__wait)) \
> > + __wake_up_common(&wq, TASK_INTERRUPTIBLE, \
>
> No, we can't use __wake_up_common() without wq->lock.

Whoops. Should have noticed that, sorry.

Okay, number six. I renamed the helper to abort_exclusive_wait().
It does the wake up with __wake_up_locked() under the waitqueue lock.

I also hope that the changelog is now, unlike the previous one,
intelligible.

Hannes

---
With exclusive waiters, every process woken up through the wait queue
must ensure that the next waiter down the line is woken when it has
finished.

Interruptible waiters don't do that when aborting due to a signal.
And if an aborting waiter is concurrently woken up through the
waitqueue, noone will ever wake up the next waiter.

This has been observed with __wait_on_bit_lock() used by
lock_page_killable(): the first contender on the queue was aborting
when the actual lock holder woke it up concurrently. The aborted
contender didn't acquire the lock and therefor never did an unlock
followed by waking up the next waiter.

Add abort_exclusive_wait() which removes the process' wait descriptor
from the waitqueue, iff still queued, or wakes up the next waiter
otherwise. It does so under the waitqueue lock. Racing with a wake
up means the aborting process is either already woken (removed from
the queue) and will wake up the next waiter, or it will remove itself
from the queue and the concurrent wake up will apply to the next
waiter after it.

Use abort_exclusive_wait() in __wait_event_interruptible_exclusive()
and __wait_on_bit_lock() when they were interrupted by other means
than a wake up through the queue.

Reported-by: Chris Mason <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Mentored-by: Oleg Nesterov <[email protected]>
---
include/linux/wait.h | 7 ++++-
kernel/wait.c | 57 +++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index ef609f8..57bfced 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -333,16 +333,18 @@ do { \
for (;;) { \
prepare_to_wait_exclusive(&wq, &__wait, \
TASK_INTERRUPTIBLE); \
- if (condition) \
+ if (condition) { \
+ finish_wait(&wq, &__wait); \
break; \
+ } \
if (!signal_pending(current)) { \
schedule(); \
continue; \
} \
ret = -ERESTARTSYS; \
+ abort_exclusive_wait(&wq, &__wait); \
break; \
} \
- finish_wait(&wq, &__wait); \
} while (0)

#define wait_event_interruptible_exclusive(wq, condition) \
@@ -431,6 +433,7 @@ extern long interruptible_sleep_on_timeout(wait_queue_head_t *q,
void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state);
void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state);
void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
+void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait);
int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);

diff --git a/kernel/wait.c b/kernel/wait.c
index cd87131..21f88c4 100644
--- a/kernel/wait.c
+++ b/kernel/wait.c
@@ -91,6 +91,15 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
}
EXPORT_SYMBOL(prepare_to_wait_exclusive);

+/*
+ * finish_wait - clean up after waiting in a queue
+ * @q: waitqueue waited on
+ * @wait: wait descriptor
+ *
+ * Sets current thread back to running state and removes
+ * the wait descriptor from the given waitqueue if still
+ * queued.
+ */
void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
{
unsigned long flags;
@@ -117,6 +126,38 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
}
EXPORT_SYMBOL(finish_wait);

+/*
+ * abort_exclusive_wait - abort exclusive waiting in a queue
+ * @q: waitqueue waited on
+ * @wait: wait descriptor
+ *
+ * Sets current thread back to running state and removes
+ * the wait descriptor from the given waitqueue if still
+ * queued.
+ *
+ * Wakes up the next waiter if the caller is concurrently
+ * woken up through the queue.
+ */
+void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait)
+{
+ unsigned long flags;
+
+ __set_current_state(TASK_RUNNING);
+ spin_lock_irqsave(&q->lock, flags);
+ if (list_empty(&wait->task_list))
+ list_del_init(&wait->task_list);
+ /*
+ * If we were woken through the waitqueue (waker removed
+ * us from the list) we must ensure the next waiter down
+ * the line is woken up. The callsite will not do it as
+ * it didn't finish waiting successfully.
+ */
+ else if (waitqueue_active(q))
+ __wake_up_locked(q, TASK_INTERRUPTIBLE);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL(finish_wait_woken);
+
int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
int ret = default_wake_function(wait, mode, sync, key);
@@ -177,17 +218,19 @@ int __sched
__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
int (*action)(void *), unsigned mode)
{
- int ret = 0;
-
do {
+ int ret;
+
prepare_to_wait_exclusive(wq, &q->wait, mode);
- if (test_bit(q->key.bit_nr, q->key.flags)) {
- if ((ret = (*action)(q->key.flags)))
- break;
- }
+ if (!test_bit(q->key.bit_nr, q->key.flags))
+ continue;
+ if (!(ret = action(q->key.flags)))
+ continue;
+ abort_exclusive_wait(wq, &q->wait);
+ return ret;
} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
finish_wait(wq, &q->wait);
- return ret;
+ return 0;
}
EXPORT_SYMBOL(__wait_on_bit_lock);

--
1.6.0.3

2009-01-27 20:11:35

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC v6] wait: prevent exclusive waiter starvation

On 01/27, Johannes Weiner wrote:
>
> +void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait)
> +{
> + unsigned long flags;
> +
> + __set_current_state(TASK_RUNNING);
> + spin_lock_irqsave(&q->lock, flags);
> + if (list_empty(&wait->task_list))

Hmm... it should be !list_empty() ?

> + list_del_init(&wait->task_list);
> + /*
> + * If we were woken through the waitqueue (waker removed
> + * us from the list) we must ensure the next waiter down
> + * the line is woken up. The callsite will not do it as
> + * it didn't finish waiting successfully.
> + */
> + else if (waitqueue_active(q))
> + __wake_up_locked(q, TASK_INTERRUPTIBLE);
> + spin_unlock_irqrestore(&q->lock, flags);
> +}

Well, personally I don't care, but this is against CodingStyle rules ;)

> int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
> {
> int ret = default_wake_function(wait, mode, sync, key);
> @@ -177,17 +218,19 @@ int __sched
> __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> int (*action)(void *), unsigned mode)
> {
> - int ret = 0;
> -
> do {
> + int ret;
> +
> prepare_to_wait_exclusive(wq, &q->wait, mode);
> - if (test_bit(q->key.bit_nr, q->key.flags)) {
> - if ((ret = (*action)(q->key.flags)))
> - break;
> - }
> + if (!test_bit(q->key.bit_nr, q->key.flags))
> + continue;
> + if (!(ret = action(q->key.flags)))
> + continue;
> + abort_exclusive_wait(wq, &q->wait);

No, no. We should use the same key in abort_exclusive_wait().
Otherwise, how can we wakeup the next waiter which needs this
bit in the same page->flags?

That is why I suggested finish_wait_exclusive(..., void *key)
which should we passed to __wake_up_common().

Oleg.

2009-01-27 22:33:24

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC v6] wait: prevent exclusive waiter starvation

On Tue, Jan 27, 2009 at 09:05:44PM +0100, Oleg Nesterov wrote:
> On 01/27, Johannes Weiner wrote:
> >
> > +void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait)
> > +{
> > + unsigned long flags;
> > +
> > + __set_current_state(TASK_RUNNING);
> > + spin_lock_irqsave(&q->lock, flags);
> > + if (list_empty(&wait->task_list))
>
> Hmm... it should be !list_empty() ?

Yes.

>
> > + list_del_init(&wait->task_list);
> > + /*
> > + * If we were woken through the waitqueue (waker removed
> > + * us from the list) we must ensure the next waiter down
> > + * the line is woken up. The callsite will not do it as
> > + * it didn't finish waiting successfully.
> > + */
> > + else if (waitqueue_active(q))
> > + __wake_up_locked(q, TASK_INTERRUPTIBLE);
> > + spin_unlock_irqrestore(&q->lock, flags);
> > +}
>
> Well, personally I don't care, but this is against CodingStyle rules ;)

I removed it from there and added a note to the kerneldoc.

> > int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
> > {
> > int ret = default_wake_function(wait, mode, sync, key);
> > @@ -177,17 +218,19 @@ int __sched
> > __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> > int (*action)(void *), unsigned mode)
> > {
> > - int ret = 0;
> > -
> > do {
> > + int ret;
> > +
> > prepare_to_wait_exclusive(wq, &q->wait, mode);
> > - if (test_bit(q->key.bit_nr, q->key.flags)) {
> > - if ((ret = (*action)(q->key.flags)))
> > - break;
> > - }
> > + if (!test_bit(q->key.bit_nr, q->key.flags))
> > + continue;
> > + if (!(ret = action(q->key.flags)))
> > + continue;
> > + abort_exclusive_wait(wq, &q->wait);
>
> No, no. We should use the same key in abort_exclusive_wait().
> Otherwise, how can we wakeup the next waiter which needs this
> bit in the same page->flags?
>
> That is why I suggested finish_wait_exclusive(..., void *key)
> which should we passed to __wake_up_common().

Okay, I am obviously wasting our time now. And I definitely stared so
long at the same three lines that I send randomly broken patches, so
v7 coming after some delay including sleep.

Thanks for your patience,

hannes

2009-01-28 09:16:18

by Johannes Weiner

[permalink] [raw]
Subject: [RFC v7] wait: prevent exclusive waiter starvation

With exclusive waiters, every process woken up through the wait queue
must ensure that the next waiter down the line is woken when it has
finished.

Interruptible waiters don't do that when aborting due to a signal.
And if an aborting waiter is concurrently woken up through the
waitqueue, noone will ever wake up the next waiter.

This has been observed with __wait_on_bit_lock() used by
lock_page_killable(): the first contender on the queue was aborting
when the actual lock holder woke it up concurrently. The aborted
contender didn't acquire the lock and therefor never did an unlock
followed by waking up the next waiter.

Add abort_exclusive_wait() which removes the process' wait descriptor
from the waitqueue, iff still queued, or wakes up the next waiter
otherwise. It does so under the waitqueue lock. Racing with a wake
up means the aborting process is either already woken (removed from
the queue) and will wake up the next waiter, or it will remove itself
from the queue and the concurrent wake up will apply to the next
waiter after it.

Use abort_exclusive_wait() in __wait_event_interruptible_exclusive()
and __wait_on_bit_lock() when they were interrupted by other means
than a wake up through the queue.

Reported-by: Chris Mason <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Mentored-by: Oleg Nesterov <[email protected]>
---
include/linux/wait.h | 11 +++++++-
kernel/sched.c | 4 +-
kernel/wait.c | 58 +++++++++++++++++++++++++++++++++++++++++++------
3 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index ef609f8..a210ede 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -132,6 +132,8 @@ static inline void __remove_wait_queue(wait_queue_head_t *head,
list_del(&old->task_list);
}

+void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
+ int nr_exclusive, int sync, void *key);
void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
extern void __wake_up_locked(wait_queue_head_t *q, unsigned int mode);
extern void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -333,16 +335,19 @@ do { \
for (;;) { \
prepare_to_wait_exclusive(&wq, &__wait, \
TASK_INTERRUPTIBLE); \
- if (condition) \
+ if (condition) { \
+ finish_wait(&wq, &__wait); \
break; \
+ } \
if (!signal_pending(current)) { \
schedule(); \
continue; \
} \
ret = -ERESTARTSYS; \
+ abort_exclusive_wait(&wq, &__wait, \
+ TASK_INTERRUPTIBLE, NULL); \
break; \
} \
- finish_wait(&wq, &__wait); \
} while (0)

#define wait_event_interruptible_exclusive(wq, condition) \
@@ -431,6 +436,8 @@ extern long interruptible_sleep_on_timeout(wait_queue_head_t *q,
void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state);
void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state);
void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
+void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
+ unsigned int mode, void *key);
int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);

diff --git a/kernel/sched.c b/kernel/sched.c
index 52bbf1c..eccc2a3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4687,8 +4687,8 @@ EXPORT_SYMBOL(default_wake_function);
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
-static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
- int nr_exclusive, int sync, void *key)
+void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
+ int nr_exclusive, int sync, void *key)
{
wait_queue_t *curr, *next;

diff --git a/kernel/wait.c b/kernel/wait.c
index cd87131..57d10f0 100644
--- a/kernel/wait.c
+++ b/kernel/wait.c
@@ -91,6 +91,15 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
}
EXPORT_SYMBOL(prepare_to_wait_exclusive);

+/*
+ * finish_wait - clean up after waiting in a queue
+ * @q: waitqueue waited on
+ * @wait: wait descriptor
+ *
+ * Sets current thread back to running state and removes
+ * the wait descriptor from the given waitqueue if still
+ * queued.
+ */
void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
{
unsigned long flags;
@@ -117,6 +126,39 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
}
EXPORT_SYMBOL(finish_wait);

+/*
+ * abort_exclusive_wait - abort exclusive waiting in a queue
+ * @q: waitqueue waited on
+ * @wait: wait descriptor
+ * @state: runstate of the waiter to be woken
+ * @key: key to identify a wait bit queue or %NULL
+ *
+ * Sets current thread back to running state and removes
+ * the wait descriptor from the given waitqueue if still
+ * queued.
+ *
+ * Wakes up the next waiter if the caller is concurrently
+ * woken up through the queue.
+ *
+ * This prevents waiter starvation where an exclusive waiter
+ * aborts and is woken up concurrently and noone wakes up
+ * the next waiter.
+ */
+void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
+ unsigned int mode, void *key)
+{
+ unsigned long flags;
+
+ __set_current_state(TASK_RUNNING);
+ spin_lock_irqsave(&q->lock, flags);
+ if (!list_empty(&wait->task_list))
+ list_del_init(&wait->task_list);
+ else if (waitqueue_active(q))
+ __wake_up_common(q, mode, 1, 0, key);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL(abort_exclusive_wait);
+
int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
int ret = default_wake_function(wait, mode, sync, key);
@@ -177,17 +219,19 @@ int __sched
__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
int (*action)(void *), unsigned mode)
{
- int ret = 0;
-
do {
+ int ret;
+
prepare_to_wait_exclusive(wq, &q->wait, mode);
- if (test_bit(q->key.bit_nr, q->key.flags)) {
- if ((ret = (*action)(q->key.flags)))
- break;
- }
+ if (!test_bit(q->key.bit_nr, q->key.flags))
+ continue;
+ if (!(ret = action(q->key.flags)))
+ continue;
+ abort_exclusive_wait(wq, &q->wait, mode, &q->key);
+ return ret;
} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
finish_wait(wq, &q->wait);
- return ret;
+ return 0;
}
EXPORT_SYMBOL(__wait_on_bit_lock);

--
1.6.0.3

2009-01-29 04:47:37

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC v7] wait: prevent exclusive waiter starvation

On 01/28, Johannes Weiner wrote:
>
> Add abort_exclusive_wait() which removes the process' wait descriptor
> from the waitqueue, iff still queued, or wakes up the next waiter
> otherwise. It does so under the waitqueue lock. Racing with a wake
> up means the aborting process is either already woken (removed from
> the queue) and will wake up the next waiter, or it will remove itself
> from the queue and the concurrent wake up will apply to the next
> waiter after it.
>
> Use abort_exclusive_wait() in __wait_event_interruptible_exclusive()
> and __wait_on_bit_lock() when they were interrupted by other means
> than a wake up through the queue.

Imho, this all is right, and this patch should replace
lock_page_killable-avoid-lost-wakeups.patch (except for stable tree).

But I guess we need maintainer's opinion, we have them in cc ;)

Oleg.

2009-01-29 07:38:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC v7] wait: prevent exclusive waiter starvation

On Thu, 29 Jan 2009 05:42:27 +0100 Oleg Nesterov <[email protected]> wrote:

> On 01/28, Johannes Weiner wrote:
> >
> > Add abort_exclusive_wait() which removes the process' wait descriptor
> > from the waitqueue, iff still queued, or wakes up the next waiter
> > otherwise. It does so under the waitqueue lock. Racing with a wake
> > up means the aborting process is either already woken (removed from
> > the queue) and will wake up the next waiter, or it will remove itself
> > from the queue and the concurrent wake up will apply to the next
> > waiter after it.
> >
> > Use abort_exclusive_wait() in __wait_event_interruptible_exclusive()
> > and __wait_on_bit_lock() when they were interrupted by other means
> > than a wake up through the queue.
>
> Imho, this all is right, and this patch should replace
> lock_page_killable-avoid-lost-wakeups.patch (except for stable tree).

I dropped lock_page_killable-avoid-lost-wakeups.patch a while ago.

So I think we're saying that
lock_page_killable-avoid-lost-wakeups.patch actually did fix the bug?

And that "[RFC v7] wait: prevent exclusive waiter starvation" fixes it
as well, and in a preferable manner, but not a manner which we consider
suitable for -stable? (why?)

And hence that lock_page_killable-avoid-lost-wakeups.patch is the
appropriate fix for -stable?

If so, that's a bit unusual, and the -stable maintainers may choose to
take the patch which we're putting into 2.6.29.

Here, for the record (and for [email protected]), is
lock_page_killable-avoid-lost-wakeups.patch:


From: Chris Mason <[email protected]>

lock_page and lock_page_killable both call __wait_on_bit_lock, and both
end up using prepare_to_wait_exclusive(). This means that when someone
does finally unlock the page, only one process is going to get woken up.

But lock_page_killable can exit without taking the lock. If nobody else
comes in and locks the page, any other waiters will wait forever.

For example, procA holding the page lock, procB and procC are waiting on
the lock.

procA: lock_page() // success
procB: lock_page_killable(), sync_page_killable(), io_schedule()
procC: lock_page_killable(), sync_page_killable(), io_schedule()

procA: unlock, wake_up_page(page, PG_locked)
procA: wake up procB

happy admin: kill procB

procB: wakes into sync_page_killable(), notices the signal and returns
-EINTR

procB: __wait_on_bit_lock sees the action() func returns < 0 and does
not take the page lock

procB: lock_page_killable() returns < 0 and exits happily.

procC: sleeping in io_schedule() forever unless someone else locks the
page.

This was seen in production on systems where the database was shutting
down. Testing shows the patch fixes things.

Chuck Lever did all the hard work here, with a page lock debugging
patch that proved we were missing a wakeup.

Every version of lock_page_killable() should need this.

Signed-off-by: Chris Mason <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Chuck Lever <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

mm/filemap.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff -puN mm/filemap.c~lock_page_killable-avoid-lost-wakeups mm/filemap.c
--- a/mm/filemap.c~lock_page_killable-avoid-lost-wakeups
+++ a/mm/filemap.c
@@ -623,9 +623,20 @@ EXPORT_SYMBOL(__lock_page);
int __lock_page_killable(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+ int ret;

- return __wait_on_bit_lock(page_waitqueue(page), &wait,
+ ret = __wait_on_bit_lock(page_waitqueue(page), &wait,
sync_page_killable, TASK_KILLABLE);
+ /*
+ * wait_on_bit_lock uses prepare_to_wait_exclusive, so if multiple
+ * procs were waiting on this page, we were the only proc woken up.
+ *
+ * if ret != 0, we didn't actually get the lock. We need to
+ * make sure any other waiters don't sleep forever.
+ */
+ if (ret)
+ wake_up_page(page, PG_locked);
+ return ret;
}

/**
_

2009-01-29 08:36:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC v7] wait: prevent exclusive waiter starvation

On 01/28, Andrew Morton wrote:
>
> On Thu, 29 Jan 2009 05:42:27 +0100 Oleg Nesterov <[email protected]> wrote:
>
> > On 01/28, Johannes Weiner wrote:
> > >
> > > Add abort_exclusive_wait() which removes the process' wait descriptor
> > > from the waitqueue, iff still queued, or wakes up the next waiter
> > > otherwise. It does so under the waitqueue lock. Racing with a wake
> > > up means the aborting process is either already woken (removed from
> > > the queue) and will wake up the next waiter, or it will remove itself
> > > from the queue and the concurrent wake up will apply to the next
> > > waiter after it.
> > >
> > > Use abort_exclusive_wait() in __wait_event_interruptible_exclusive()
> > > and __wait_on_bit_lock() when they were interrupted by other means
> > > than a wake up through the queue.
> >
> > Imho, this all is right, and this patch should replace
> > lock_page_killable-avoid-lost-wakeups.patch (except for stable tree).
>
> I dropped lock_page_killable-avoid-lost-wakeups.patch a while ago.
>
> So I think we're saying that
> lock_page_killable-avoid-lost-wakeups.patch actually did fix the bug?

I think yes,

> And that "[RFC v7] wait: prevent exclusive waiter starvation" fixes it
> as well, and in a preferable manner, but not a manner which we consider
> suitable for -stable? (why?)

I meant that lock_page_killable-avoid-lost-wakeups.patch is much simpler,
and thus it looks more "safe" for -stable.

But it is not optimal, and Johannes's patch is also more generic, it fixes
wait_event_interruptible_exclusive() as well.

> And hence that lock_page_killable-avoid-lost-wakeups.patch is the
> appropriate fix for -stable?
>
> If so, that's a bit unusual, and the -stable maintainers may choose to
> take the patch which we're putting into 2.6.29.

Well, I don't know ;)

But Johannes's looks good to me, if you already dropped the old patch,
then I think this one can go into -stable after some testing. Hopefully
maintainers can review it.

Oleg.

2009-01-29 09:14:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC v7] wait: prevent exclusive waiter starvation

On Thu, 29 Jan 2009 09:31:08 +0100 Oleg Nesterov <[email protected]> wrote:

> On 01/28, Andrew Morton wrote:
> >
> > On Thu, 29 Jan 2009 05:42:27 +0100 Oleg Nesterov <[email protected]> wrote:
> >
> > > On 01/28, Johannes Weiner wrote:
> > > >
> > > > Add abort_exclusive_wait() which removes the process' wait descriptor
> > > > from the waitqueue, iff still queued, or wakes up the next waiter
> > > > otherwise. It does so under the waitqueue lock. Racing with a wake
> > > > up means the aborting process is either already woken (removed from
> > > > the queue) and will wake up the next waiter, or it will remove itself
> > > > from the queue and the concurrent wake up will apply to the next
> > > > waiter after it.
> > > >
> > > > Use abort_exclusive_wait() in __wait_event_interruptible_exclusive()
> > > > and __wait_on_bit_lock() when they were interrupted by other means
> > > > than a wake up through the queue.
> > >
> > > Imho, this all is right, and this patch should replace
> > > lock_page_killable-avoid-lost-wakeups.patch (except for stable tree).
> >
> > I dropped lock_page_killable-avoid-lost-wakeups.patch a while ago.
> >
> > So I think we're saying that
> > lock_page_killable-avoid-lost-wakeups.patch actually did fix the bug?
>
> I think yes,
>
> > And that "[RFC v7] wait: prevent exclusive waiter starvation" fixes it
> > as well, and in a preferable manner, but not a manner which we consider
> > suitable for -stable? (why?)
>
> I meant that lock_page_killable-avoid-lost-wakeups.patch is much simpler,
> and thus it looks more "safe" for -stable.
>
> But it is not optimal, and Johannes's patch is also more generic, it fixes
> wait_event_interruptible_exclusive() as well.
>
> > And hence that lock_page_killable-avoid-lost-wakeups.patch is the
> > appropriate fix for -stable?
> >
> > If so, that's a bit unusual, and the -stable maintainers may choose to
> > take the patch which we're putting into 2.6.29.
>
> Well, I don't know ;)
>
> But Johannes's looks good to me, if you already dropped the old patch,
> then I think this one can go into -stable after some testing. Hopefully
> maintainers can review it.
>

OK, thanks. That's why we pay the stable guys the big bucks.

I tagged the patch with

Cc: <[email protected]> ["after some testing"]

2009-01-29 14:38:27

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC v7] wait: prevent exclusive waiter starvation

On Thu, 2009-01-29 at 01:11 -0800, Andrew Morton wrote:
> On Thu, 29 Jan 2009 09:31:08 +0100 Oleg Nesterov <[email protected]> wrote:
>
> > On 01/28, Andrew Morton wrote:
> > >
> > > On Thu, 29 Jan 2009 05:42:27 +0100 Oleg Nesterov <[email protected]> wrote:
> > >
> > > > On 01/28, Johannes Weiner wrote:
> > > > >
> > > > > Add abort_exclusive_wait() which removes the process' wait descriptor
> > > > > from the waitqueue, iff still queued, or wakes up the next waiter
> > > > > otherwise. It does so under the waitqueue lock. Racing with a wake
> > > > > up means the aborting process is either already woken (removed from
> > > > > the queue) and will wake up the next waiter, or it will remove itself
> > > > > from the queue and the concurrent wake up will apply to the next
> > > > > waiter after it.
> > > > >
> > > > > Use abort_exclusive_wait() in __wait_event_interruptible_exclusive()
> > > > > and __wait_on_bit_lock() when they were interrupted by other means
> > > > > than a wake up through the queue.
> > > >
> > > > Imho, this all is right, and this patch should replace
> > > > lock_page_killable-avoid-lost-wakeups.patch (except for stable tree).
> > >
> > > I dropped lock_page_killable-avoid-lost-wakeups.patch a while ago.
> > >
> > > So I think we're saying that
> > > lock_page_killable-avoid-lost-wakeups.patch actually did fix the bug?
> >
> > I think yes,
> >

Our test case that was able to reliably trigger the bug was fixed by
lock_page_killable-avoid-lost-wakeups.patch.

I'll ask them to test v7 as well. The run takes about a day, so
confirmation will take a bit.

-chris