2016-10-07 15:25:42

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

Doesn't really matter yet, but pull the HANDOFF and trylock out from
under the wait_lock.

The intention is to add an optimistic spin loop here, which requires
we do not hold the wait_lock, so shuffle code around in preparation.

Also clarify the purpose of taking the wait_lock in the wait loop, its
tempting to want to avoid it altogether, but the cancellation cases
need to to avoid losing wakeups.

Suggested-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/locking/mutex.c | 30 +++++++++++++++++++++++++-----
1 file changed, 25 insertions(+), 5 deletions(-)

--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,

lock_contended(&lock->dep_map, ip);

+ set_task_state(task, state);
for (;;) {
+ /*
+ * Once we hold wait_lock, we're serialized against
+ * mutex_unlock() handing the lock off to us, do a trylock
+ * before testing the error conditions to make sure we pick up
+ * the handoff.
+ */
if (__mutex_trylock(lock, first))
- break;
+ goto acquired;

/*
- * got a signal? (This code gets eliminated in the
- * TASK_UNINTERRUPTIBLE case.)
+ * Check for signals and wound conditions while holding
+ * wait_lock. This ensures the lock cancellation is ordered
+ * against mutex_unlock() and wake-ups do not go missing.
*/
if (unlikely(signal_pending_state(state, task))) {
ret = -EINTR;
@@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
goto err;
}

- __set_task_state(task, state);
spin_unlock_mutex(&lock->wait_lock, flags);
schedule_preempt_disabled();
- spin_lock_mutex(&lock->wait_lock, flags);

if (!first && __mutex_waiter_is_first(lock, &waiter)) {
first = true;
__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
}
+
+ set_task_state(task, state);
+ /*
+ * Here we order against unlock; we must either see it change
+ * state back to RUNNING and fall through the next schedule(),
+ * or we must see its unlock and acquire.
+ */
+ if (__mutex_trylock(lock, first))
+ break;
+
+ spin_lock_mutex(&lock->wait_lock, flags);
}
+ spin_lock_mutex(&lock->wait_lock, flags);
+acquired:
__set_task_state(task, TASK_RUNNING);

mutex_remove_waiter(lock, &waiter, task);
@@ -682,6 +701,7 @@ __mutex_lock_common(struct mutex *lock,
return 0;

err:
+ __set_task_state(task, TASK_RUNNING);
mutex_remove_waiter(lock, &waiter, task);
spin_unlock_mutex(&lock->wait_lock, flags);
debug_mutex_free_waiter(&waiter);



2016-10-13 15:17:35

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

Hi Peter,

I'm struggling to get my head around the handoff code after this change...

On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
>
> lock_contended(&lock->dep_map, ip);
>
> + set_task_state(task, state);
> for (;;) {
> + /*
> + * Once we hold wait_lock, we're serialized against
> + * mutex_unlock() handing the lock off to us, do a trylock
> + * before testing the error conditions to make sure we pick up
> + * the handoff.
> + */
> if (__mutex_trylock(lock, first))
> - break;
> + goto acquired;
>
> /*
> - * got a signal? (This code gets eliminated in the
> - * TASK_UNINTERRUPTIBLE case.)
> + * Check for signals and wound conditions while holding
> + * wait_lock. This ensures the lock cancellation is ordered
> + * against mutex_unlock() and wake-ups do not go missing.
> */
> if (unlikely(signal_pending_state(state, task))) {
> ret = -EINTR;
> @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> goto err;
> }
>
> - __set_task_state(task, state);
> spin_unlock_mutex(&lock->wait_lock, flags);
> schedule_preempt_disabled();
> - spin_lock_mutex(&lock->wait_lock, flags);
>
> if (!first && __mutex_waiter_is_first(lock, &waiter)) {
> first = true;
> __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> }
> +
> + set_task_state(task, state);

With this change, we no longer hold the lock wit_hen we set the task
state, and it's ordered strictly *after* setting the HANDOFF flag.
Doesn't that mean that the unlock code can see the HANDOFF flag, issue
the wakeup, but then we come in and overwrite the task state?

I'm struggling to work out whether that's an issue, but it certainly
feels odd and is a change from the previous behaviour.

Will

2016-10-17 10:45:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> Hi Peter,
>
> I'm struggling to get my head around the handoff code after this change...
>
> On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> > --- a/kernel/locking/mutex.c
> > +++ b/kernel/locking/mutex.c
> > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> >
> > lock_contended(&lock->dep_map, ip);
> >
> > + set_task_state(task, state);
> > for (;;) {
> > + /*
> > + * Once we hold wait_lock, we're serialized against
> > + * mutex_unlock() handing the lock off to us, do a trylock
> > + * before testing the error conditions to make sure we pick up
> > + * the handoff.
> > + */
> > if (__mutex_trylock(lock, first))
> > - break;
> > + goto acquired;
> >
> > /*
> > - * got a signal? (This code gets eliminated in the
> > - * TASK_UNINTERRUPTIBLE case.)
> > + * Check for signals and wound conditions while holding
> > + * wait_lock. This ensures the lock cancellation is ordered
> > + * against mutex_unlock() and wake-ups do not go missing.
> > */
> > if (unlikely(signal_pending_state(state, task))) {
> > ret = -EINTR;
> > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > goto err;
> > }
> >
> > - __set_task_state(task, state);
> > spin_unlock_mutex(&lock->wait_lock, flags);
> > schedule_preempt_disabled();
> > - spin_lock_mutex(&lock->wait_lock, flags);
> >
> > if (!first && __mutex_waiter_is_first(lock, &waiter)) {
> > first = true;
> > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > }
> > +
> > + set_task_state(task, state);
>
> With this change, we no longer hold the lock wit_hen we set the task
> state, and it's ordered strictly *after* setting the HANDOFF flag.
> Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> the wakeup, but then we come in and overwrite the task state?
>
> I'm struggling to work out whether that's an issue, but it certainly
> feels odd and is a change from the previous behaviour.

Right, so I think the code is fine, since in that case the
__mutex_trylock() must see the handoff and we'll break the loop and
(re)set the state to RUNNING.

But you're right in that its slightly odd. I'll reorder them and put the
set_task_state() above the !first thing.


2016-10-17 13:25:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

On Mon, Oct 17, 2016 at 12:44:49PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> > Hi Peter,
> >
> > I'm struggling to get my head around the handoff code after this change...
> >
> > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> > > --- a/kernel/locking/mutex.c
> > > +++ b/kernel/locking/mutex.c
> > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> > >
> > > lock_contended(&lock->dep_map, ip);
> > >
> > > + set_task_state(task, state);
> > > for (;;) {
> > > + /*
> > > + * Once we hold wait_lock, we're serialized against
> > > + * mutex_unlock() handing the lock off to us, do a trylock
> > > + * before testing the error conditions to make sure we pick up
> > > + * the handoff.
> > > + */
> > > if (__mutex_trylock(lock, first))
> > > - break;
> > > + goto acquired;
> > >
> > > /*
> > > - * got a signal? (This code gets eliminated in the
> > > - * TASK_UNINTERRUPTIBLE case.)
> > > + * Check for signals and wound conditions while holding
> > > + * wait_lock. This ensures the lock cancellation is ordered
> > > + * against mutex_unlock() and wake-ups do not go missing.
> > > */
> > > if (unlikely(signal_pending_state(state, task))) {
> > > ret = -EINTR;
> > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > > goto err;
> > > }
> > >
> > > - __set_task_state(task, state);
> > > spin_unlock_mutex(&lock->wait_lock, flags);
> > > schedule_preempt_disabled();
> > > - spin_lock_mutex(&lock->wait_lock, flags);
> > >
> > > if (!first && __mutex_waiter_is_first(lock, &waiter)) {
> > > first = true;
> > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > > }
> > > +
> > > + set_task_state(task, state);
> >
> > With this change, we no longer hold the lock wit_hen we set the task
> > state, and it's ordered strictly *after* setting the HANDOFF flag.
> > Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> > the wakeup, but then we come in and overwrite the task state?
> >
> > I'm struggling to work out whether that's an issue, but it certainly
> > feels odd and is a change from the previous behaviour.
>
> Right, so I think the code is fine, since in that case the
> __mutex_trylock() must see the handoff and we'll break the loop and
> (re)set the state to RUNNING.
>
> But you're right in that its slightly odd. I'll reorder them and put the
> set_task_state() above the !first thing.


Humm,.. we might actually rely on this order, since the MB implied by
set_task_state() is the only thing that separates the store of
__mutex_set_flag() from the load of __mutex_trylock(), and those should
be ordered I think.

Argh, completely messed up my brain. I'll not touch it and think on this
again tomorrow.

2016-10-17 13:45:04

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

On Mon, Oct 17, 2016 at 03:24:08PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 17, 2016 at 12:44:49PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> > > Hi Peter,
> > >
> > > I'm struggling to get my head around the handoff code after this change...
> > >
> > > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> > > > --- a/kernel/locking/mutex.c
> > > > +++ b/kernel/locking/mutex.c
> > > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> > > >
> > > > lock_contended(&lock->dep_map, ip);
> > > >
> > > > + set_task_state(task, state);
> > > > for (;;) {
> > > > + /*
> > > > + * Once we hold wait_lock, we're serialized against
> > > > + * mutex_unlock() handing the lock off to us, do a trylock
> > > > + * before testing the error conditions to make sure we pick up
> > > > + * the handoff.
> > > > + */
> > > > if (__mutex_trylock(lock, first))
> > > > - break;
> > > > + goto acquired;
> > > >
> > > > /*
> > > > - * got a signal? (This code gets eliminated in the
> > > > - * TASK_UNINTERRUPTIBLE case.)
> > > > + * Check for signals and wound conditions while holding
> > > > + * wait_lock. This ensures the lock cancellation is ordered
> > > > + * against mutex_unlock() and wake-ups do not go missing.
> > > > */
> > > > if (unlikely(signal_pending_state(state, task))) {
> > > > ret = -EINTR;
> > > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > > > goto err;
> > > > }
> > > >
> > > > - __set_task_state(task, state);
> > > > spin_unlock_mutex(&lock->wait_lock, flags);
> > > > schedule_preempt_disabled();
> > > > - spin_lock_mutex(&lock->wait_lock, flags);
> > > >
> > > > if (!first && __mutex_waiter_is_first(lock, &waiter)) {
> > > > first = true;
> > > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > > > }
> > > > +
> > > > + set_task_state(task, state);
> > >
> > > With this change, we no longer hold the lock wit_hen we set the task
> > > state, and it's ordered strictly *after* setting the HANDOFF flag.
> > > Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> > > the wakeup, but then we come in and overwrite the task state?
> > >
> > > I'm struggling to work out whether that's an issue, but it certainly
> > > feels odd and is a change from the previous behaviour.
> >
> > Right, so I think the code is fine, since in that case the
> > __mutex_trylock() must see the handoff and we'll break the loop and
> > (re)set the state to RUNNING.
> >
> > But you're right in that its slightly odd. I'll reorder them and put the
> > set_task_state() above the !first thing.
>
>
> Humm,.. we might actually rely on this order, since the MB implied by
> set_task_state() is the only thing that separates the store of
> __mutex_set_flag() from the load of __mutex_trylock(), and those should
> be ordered I think.
>

But __mutex_set_flag() and __mutex_trylock() actually touch the same
atomic word? So we don't need extra things to order them?

Regards,
Boqun

> Argh, completely messed up my brain. I'll not touch it and think on this
> again tomorrow.


Attachments:
(No filename) (3.12 kB)
signature.asc (455.00 B)
Download all attachments

2016-10-17 15:50:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

On Mon, Oct 17, 2016 at 09:45:01PM +0800, Boqun Feng wrote:
> But __mutex_set_flag() and __mutex_trylock() actually touch the same
> atomic word? So we don't need extra things to order them?

Right.. in any case brain is confused. I'll look again at it later.

2016-10-17 23:17:11

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

On 10/07/2016 10:52 AM, Peter Zijlstra wrote:
> Doesn't really matter yet, but pull the HANDOFF and trylock out from
> under the wait_lock.
>
> The intention is to add an optimistic spin loop here, which requires
> we do not hold the wait_lock, so shuffle code around in preparation.
>
> Also clarify the purpose of taking the wait_lock in the wait loop, its
> tempting to want to avoid it altogether, but the cancellation cases
> need to to avoid losing wakeups.
>
> Suggested-by: Waiman Long<[email protected]>
> Signed-off-by: Peter Zijlstra (Intel)<[email protected]>
> ---
> kernel/locking/mutex.c | 30 +++++++++++++++++++++++++-----
> 1 file changed, 25 insertions(+), 5 deletions(-)

> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
>
> lock_contended(&lock->dep_map, ip);
>
> + set_task_state(task, state);

Do we want to set the state here? I am not sure if it is OK to set the
task state without ever calling schedule().

> for (;;) {
> + /*
> + * Once we hold wait_lock, we're serialized against
> + * mutex_unlock() handing the lock off to us, do a trylock
> + * before testing the error conditions to make sure we pick up
> + * the handoff.
> + */
> if (__mutex_trylock(lock, first))
> - break;
> + goto acquired;
>
> /*
> - * got a signal? (This code gets eliminated in the
> - * TASK_UNINTERRUPTIBLE case.)
> + * Check for signals and wound conditions while holding
> + * wait_lock. This ensures the lock cancellation is ordered
> + * against mutex_unlock() and wake-ups do not go missing.
> */
> if (unlikely(signal_pending_state(state, task))) {
> ret = -EINTR;
> @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> goto err;
> }
>
> - __set_task_state(task, state);
> spin_unlock_mutex(&lock->wait_lock, flags);
> schedule_preempt_disabled();
> - spin_lock_mutex(&lock->wait_lock, flags);
>
> if (!first&& __mutex_waiter_is_first(lock,&waiter)) {
> first = true;
> __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> }
> +
> + set_task_state(task, state);

I would suggest keep the __set_task_state() above and change
set_task_state(task, state) to set_task_state(task, TASK_RUNNING) to
provide the memory barrier. Then we don't need adding __set_task_state()
calls below.

> + /*
> + * Here we order against unlock; we must either see it change
> + * state back to RUNNING and fall through the next schedule(),
> + * or we must see its unlock and acquire.
> + */
> + if (__mutex_trylock(lock, first))
> + break;
> +

I don't think we need a trylock here since we are going to do it at the
top of the loop within wait_lock anyway.

> + spin_lock_mutex(&lock->wait_lock, flags);
> }
> + spin_lock_mutex(&lock->wait_lock, flags);
> +acquired:
> __set_task_state(task, TASK_RUNNING);
>
> mutex_remove_waiter(lock,&waiter, task);
> @@ -682,6 +701,7 @@ __mutex_lock_common(struct mutex *lock,
> return 0;
>
> err:
> + __set_task_state(task, TASK_RUNNING);
> mutex_remove_waiter(lock,&waiter, task);
> spin_unlock_mutex(&lock->wait_lock, flags);
> debug_mutex_free_waiter(&waiter);
>
>

Cheers,
Longman

2016-10-18 13:14:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

On Mon, Oct 17, 2016 at 07:16:50PM -0400, Waiman Long wrote:
> >+++ b/kernel/locking/mutex.c
> >@@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> >
> > lock_contended(&lock->dep_map, ip);
> >
> >+ set_task_state(task, state);
>
> Do we want to set the state here? I am not sure if it is OK to set the task
> state without ever calling schedule().

That's entirely fine, note how we'll set it back to RUNNING at the end.

> > for (;;) {
> >+ /*
> >+ * Once we hold wait_lock, we're serialized against
> >+ * mutex_unlock() handing the lock off to us, do a trylock
> >+ * before testing the error conditions to make sure we pick up
> >+ * the handoff.
> >+ */
> > if (__mutex_trylock(lock, first))
> >- break;
> >+ goto acquired;
> >
> > /*
> >- * got a signal? (This code gets eliminated in the
> >- * TASK_UNINTERRUPTIBLE case.)
> >+ * Check for signals and wound conditions while holding
> >+ * wait_lock. This ensures the lock cancellation is ordered
> >+ * against mutex_unlock() and wake-ups do not go missing.
> > */
> > if (unlikely(signal_pending_state(state, task))) {
> > ret = -EINTR;
> >@@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > goto err;
> > }
> >
> >- __set_task_state(task, state);
> > spin_unlock_mutex(&lock->wait_lock, flags);
> > schedule_preempt_disabled();
> >- spin_lock_mutex(&lock->wait_lock, flags);
> >
> > if (!first&& __mutex_waiter_is_first(lock,&waiter)) {
> > first = true;
> > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > }
> >+
> >+ set_task_state(task, state);
>
> I would suggest keep the __set_task_state() above and change
> set_task_state(task, state) to set_task_state(task, TASK_RUNNING) to provide
> the memory barrier. Then we don't need adding __set_task_state() calls
> below.

set_task_state(RUNNING) doesn't make sense, ever.

See the comment near set_task_state() for the reason it has a barrier.

We need it here because when we do that trylock (or optimistic spin) we
need to have set the state and done a barrier, otherwise we can miss a
wakeup and get stuck.

> >+ /*
> >+ * Here we order against unlock; we must either see it change
> >+ * state back to RUNNING and fall through the next schedule(),
> >+ * or we must see its unlock and acquire.
> >+ */
> >+ if (__mutex_trylock(lock, first))
> >+ break;
> >+
>
> I don't think we need a trylock here since we are going to do it at the top
> of the loop within wait_lock anyway.

The idea was to avoid the wait-time of that lock acquire, also, this is
a place-holder for the optimistic spin site for the next patch.


2016-10-19 17:35:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> > if (!first && __mutex_waiter_is_first(lock, &waiter)) {
> > first = true;
> > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > }
> > +
> > + set_task_state(task, state);
>
> With this change, we no longer hold the lock wit_hen we set the task
> state, and it's ordered strictly *after* setting the HANDOFF flag.
> Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> the wakeup, but then we come in and overwrite the task state?
>
> I'm struggling to work out whether that's an issue, but it certainly
> feels odd and is a change from the previous behaviour.

OK, so after a discussion on IRC the problem appears to have been
unfamiliarity with the basic sleep/wakeup scheme. Mutex used to be the
odd duck out for being fully serialized by wait_lock.

The below adds a few words on how the 'normal' sleep/wakeup scheme
works.


---
Subject: sched: Better explain sleep/wakeup
From: Peter Zijlstra <[email protected]>
Date: Wed Oct 19 15:45:27 CEST 2016

There were a few questions wrt how sleep-wakeup works. Try and explain
it more.

Requested-by: Will Deacon <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/sched.h | 52 ++++++++++++++++++++++++++++++++++----------------
kernel/sched/core.c | 15 +++++++-------
2 files changed, 44 insertions(+), 23 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
#define set_task_state(tsk, state_value) \
do { \
(tsk)->task_state_change = _THIS_IP_; \
- smp_store_mb((tsk)->state, (state_value)); \
+ smp_store_mb((tsk)->state, (state_value)); \
} while (0)

-/*
- * set_current_state() includes a barrier so that the write of current->state
- * is correctly serialised wrt the caller's subsequent test of whether to
- * actually sleep:
- *
- * set_current_state(TASK_UNINTERRUPTIBLE);
- * if (do_i_need_to_sleep())
- * schedule();
- *
- * If the caller does not need such serialisation then use __set_current_state()
- */
#define __set_current_state(state_value) \
do { \
current->task_state_change = _THIS_IP_; \
@@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
#define set_current_state(state_value) \
do { \
current->task_state_change = _THIS_IP_; \
- smp_store_mb(current->state, (state_value)); \
+ smp_store_mb(current->state, (state_value)); \
} while (0)

#else

+/*
+ * @tsk had better be current, or you get to keep the pieces.
+ *
+ * The only reason is that computing current can be more expensive than
+ * using a pointer that's already available.
+ *
+ * Therefore, see set_current_state().
+ */
#define __set_task_state(tsk, state_value) \
do { (tsk)->state = (state_value); } while (0)
#define set_task_state(tsk, state_value) \
@@ -299,11 +296,34 @@ extern char ___assert_task_state[1 - 2*!
* is correctly serialised wrt the caller's subsequent test of whether to
* actually sleep:
*
+ * for (;;) {
* set_current_state(TASK_UNINTERRUPTIBLE);
- * if (do_i_need_to_sleep())
- * schedule();
+ * if (!need_sleep)
+ * break;
+ *
+ * schedule();
+ * }
+ * __set_current_state(TASK_RUNNING);
+ *
+ * If the caller does not need such serialisation (because, for instance, the
+ * condition test and condition change and wakeup are under the same lock) then
+ * use __set_current_state().
+ *
+ * The above is typically ordered against the wakeup, which does:
+ *
+ * need_sleep = false;
+ * wake_up_state(p, TASK_UNINTERRUPTIBLE);
+ *
+ * Where wake_up_state() (and all other wakeup primitives) imply enough
+ * barriers to order the store of the variable against wakeup.
+ *
+ * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is,
+ * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
+ * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
+ *
+ * This is obviously fine, since they both store the exact same value.
*
- * If the caller does not need such serialisation then use __set_current_state()
+ * Also see the comments of try_to_wake_up().
*/
#define __set_current_state(state_value) \
do { current->state = (state_value); } while (0)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2000,14 +2000,15 @@ static void ttwu_queue(struct task_struc
* @state: the mask of task states that can be woken
* @wake_flags: wake modifier flags (WF_*)
*
- * Put it on the run-queue if it's not already there. The "current"
- * thread is always on the run-queue (except when the actual
- * re-schedule is in progress), and as such you're allowed to do
- * the simpler "current->state = TASK_RUNNING" to mark yourself
- * runnable without the overhead of this.
+ * If (@state & @p->state) @p->state = TASK_RUNNING.
*
- * Return: %true if @p was woken up, %false if it was already running.
- * or @state didn't match @p's state.
+ * If the task was not queued/runnable, also place it back on a runqueue.
+ *
+ * Atomic against schedule() which would dequeue a task, also see
+ * set_current_state().
+ *
+ * Return: %true if @p->state changes (an actual wakeup was done),
+ * %false otherwise.
*/
static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

2016-10-24 01:57:46

by Davidlohr Bueso

[permalink] [raw]
Subject: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

On Wed, 19 Oct 2016, Peter Zijlstra wrote:

>Subject: sched: Better explain sleep/wakeup
>From: Peter Zijlstra <[email protected]>
>Date: Wed Oct 19 15:45:27 CEST 2016
>
>There were a few questions wrt how sleep-wakeup works. Try and explain
>it more.
>
>Requested-by: Will Deacon <[email protected]>
>Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>---
> include/linux/sched.h | 52 ++++++++++++++++++++++++++++++++++----------------
> kernel/sched/core.c | 15 +++++++-------
> 2 files changed, 44 insertions(+), 23 deletions(-)
>
>--- a/include/linux/sched.h
>+++ b/include/linux/sched.h
>@@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> #define set_task_state(tsk, state_value) \
> do { \
> (tsk)->task_state_change = _THIS_IP_; \
>- smp_store_mb((tsk)->state, (state_value)); \
>+ smp_store_mb((tsk)->state, (state_value)); \
> } while (0)
>
>-/*
>- * set_current_state() includes a barrier so that the write of current->state
>- * is correctly serialised wrt the caller's subsequent test of whether to
>- * actually sleep:
>- *
>- * set_current_state(TASK_UNINTERRUPTIBLE);
>- * if (do_i_need_to_sleep())
>- * schedule();
>- *
>- * If the caller does not need such serialisation then use __set_current_state()
>- */
> #define __set_current_state(state_value) \
> do { \
> current->task_state_change = _THIS_IP_; \
>@@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> #define set_current_state(state_value) \
> do { \
> current->task_state_change = _THIS_IP_; \
>- smp_store_mb(current->state, (state_value)); \
>+ smp_store_mb(current->state, (state_value)); \
> } while (0)
>
> #else
>
>+/*
>+ * @tsk had better be current, or you get to keep the pieces.

That reminds me we were getting rid of the set_task_state() calls. Bcache was
pending, being only user in the kernel that doesn't actually use current; but
instead breaks newly (yet blocked/uninterruptible) created garbage collection
kthread. I cannot figure out why this is done (ie purposely accounting the
load avg. Furthermore gc kicks in in very specific scenarios obviously, such
as as by the allocator task, so I don't see why bcache gc should want to be
interruptible.

Kent, Jens, can we get rid of this?

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 76f7534d1dd1..6e3c358b5759 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1798,7 +1798,6 @@ int bch_gc_thread_start(struct cache_set *c)
if (IS_ERR(c->gc_thread))
return PTR_ERR(c->gc_thread);

- set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
return 0;
}

Thanks,
Davidlohr

>+ *
>+ * The only reason is that computing current can be more expensive than
>+ * using a pointer that's already available.
>+ *
>+ * Therefore, see set_current_state().
>+ */
> #define __set_task_state(tsk, state_value) \
> do { (tsk)->state = (state_value); } while (0)
> #define set_task_state(tsk, state_value) \
>@@ -299,11 +296,34 @@ extern char ___assert_task_state[1 - 2*!
> * is correctly serialised wrt the caller's subsequent test of whether to
> * actually sleep:
> *
>+ * for (;;) {
> * set_current_state(TASK_UNINTERRUPTIBLE);
>- * if (do_i_need_to_sleep())
>- * schedule();
>+ * if (!need_sleep)
>+ * break;
>+ *
>+ * schedule();
>+ * }
>+ * __set_current_state(TASK_RUNNING);
>+ *
>+ * If the caller does not need such serialisation (because, for instance, the
>+ * condition test and condition change and wakeup are under the same lock) then
>+ * use __set_current_state().
>+ *
>+ * The above is typically ordered against the wakeup, which does:
>+ *
>+ * need_sleep = false;
>+ * wake_up_state(p, TASK_UNINTERRUPTIBLE);
>+ *
>+ * Where wake_up_state() (and all other wakeup primitives) imply enough
>+ * barriers to order the store of the variable against wakeup.
>+ *
>+ * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is,
>+ * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
>+ * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
>+ *
>+ * This is obviously fine, since they both store the exact same value.
> *
>- * If the caller does not need such serialisation then use __set_current_state()
>+ * Also see the comments of try_to_wake_up().
> */
> #define __set_current_state(state_value) \
> do { current->state = (state_value); } while (0)
>--- a/kernel/sched/core.c
>+++ b/kernel/sched/core.c
>@@ -2000,14 +2000,15 @@ static void ttwu_queue(struct task_struc
> * @state: the mask of task states that can be woken
> * @wake_flags: wake modifier flags (WF_*)
> *
>- * Put it on the run-queue if it's not already there. The "current"
>- * thread is always on the run-queue (except when the actual
>- * re-schedule is in progress), and as such you're allowed to do
>- * the simpler "current->state = TASK_RUNNING" to mark yourself
>- * runnable without the overhead of this.
>+ * If (@state & @p->state) @p->state = TASK_RUNNING.
> *
>- * Return: %true if @p was woken up, %false if it was already running.
>- * or @state didn't match @p's state.
>+ * If the task was not queued/runnable, also place it back on a runqueue.
>+ *
>+ * Atomic against schedule() which would dequeue a task, also see
>+ * set_current_state().
>+ *
>+ * Return: %true if @p->state changes (an actual wakeup was done),
>+ * %false otherwise.
> */
> static int
> try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

2016-10-24 13:26:49

by Kent Overstreet

[permalink] [raw]
Subject: Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> On Wed, 19 Oct 2016, Peter Zijlstra wrote:
>
> > Subject: sched: Better explain sleep/wakeup
> > From: Peter Zijlstra <[email protected]>
> > Date: Wed Oct 19 15:45:27 CEST 2016
> >
> > There were a few questions wrt how sleep-wakeup works. Try and explain
> > it more.
> >
> > Requested-by: Will Deacon <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > ---
> > include/linux/sched.h | 52 ++++++++++++++++++++++++++++++++++----------------
> > kernel/sched/core.c | 15 +++++++-------
> > 2 files changed, 44 insertions(+), 23 deletions(-)
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_task_state(tsk, state_value) \
> > do { \
> > (tsk)->task_state_change = _THIS_IP_; \
> > - smp_store_mb((tsk)->state, (state_value)); \
> > + smp_store_mb((tsk)->state, (state_value)); \
> > } while (0)
> >
> > -/*
> > - * set_current_state() includes a barrier so that the write of current->state
> > - * is correctly serialised wrt the caller's subsequent test of whether to
> > - * actually sleep:
> > - *
> > - * set_current_state(TASK_UNINTERRUPTIBLE);
> > - * if (do_i_need_to_sleep())
> > - * schedule();
> > - *
> > - * If the caller does not need such serialisation then use __set_current_state()
> > - */
> > #define __set_current_state(state_value) \
> > do { \
> > current->task_state_change = _THIS_IP_; \
> > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_current_state(state_value) \
> > do { \
> > current->task_state_change = _THIS_IP_; \
> > - smp_store_mb(current->state, (state_value)); \
> > + smp_store_mb(current->state, (state_value)); \
> > } while (0)
> >
> > #else
> >
> > +/*
> > + * @tsk had better be current, or you get to keep the pieces.
>
> That reminds me we were getting rid of the set_task_state() calls. Bcache was
> pending, being only user in the kernel that doesn't actually use current; but
> instead breaks newly (yet blocked/uninterruptible) created garbage collection
> kthread. I cannot figure out why this is done (ie purposely accounting the
> load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> as as by the allocator task, so I don't see why bcache gc should want to be
> interruptible.
>
> Kent, Jens, can we get rid of this?
>
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 76f7534d1dd1..6e3c358b5759 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1798,7 +1798,6 @@ int bch_gc_thread_start(struct cache_set *c)
> if (IS_ERR(c->gc_thread))
> return PTR_ERR(c->gc_thread);
>
> - set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
> return 0;
> }

Actually, that code looks broken, or at least stupid. Let me do a proper fix...

2016-10-24 14:27:21

by Kent Overstreet

[permalink] [raw]
Subject: Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> On Wed, 19 Oct 2016, Peter Zijlstra wrote:
>
> > Subject: sched: Better explain sleep/wakeup
> > From: Peter Zijlstra <[email protected]>
> > Date: Wed Oct 19 15:45:27 CEST 2016
> >
> > There were a few questions wrt how sleep-wakeup works. Try and explain
> > it more.
> >
> > Requested-by: Will Deacon <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > ---
> > include/linux/sched.h | 52 ++++++++++++++++++++++++++++++++++----------------
> > kernel/sched/core.c | 15 +++++++-------
> > 2 files changed, 44 insertions(+), 23 deletions(-)
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_task_state(tsk, state_value) \
> > do { \
> > (tsk)->task_state_change = _THIS_IP_; \
> > - smp_store_mb((tsk)->state, (state_value)); \
> > + smp_store_mb((tsk)->state, (state_value)); \
> > } while (0)
> >
> > -/*
> > - * set_current_state() includes a barrier so that the write of current->state
> > - * is correctly serialised wrt the caller's subsequent test of whether to
> > - * actually sleep:
> > - *
> > - * set_current_state(TASK_UNINTERRUPTIBLE);
> > - * if (do_i_need_to_sleep())
> > - * schedule();
> > - *
> > - * If the caller does not need such serialisation then use __set_current_state()
> > - */
> > #define __set_current_state(state_value) \
> > do { \
> > current->task_state_change = _THIS_IP_; \
> > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_current_state(state_value) \
> > do { \
> > current->task_state_change = _THIS_IP_; \
> > - smp_store_mb(current->state, (state_value)); \
> > + smp_store_mb(current->state, (state_value)); \
> > } while (0)
> >
> > #else
> >
> > +/*
> > + * @tsk had better be current, or you get to keep the pieces.
>
> That reminds me we were getting rid of the set_task_state() calls. Bcache was
> pending, being only user in the kernel that doesn't actually use current; but
> instead breaks newly (yet blocked/uninterruptible) created garbage collection
> kthread. I cannot figure out why this is done (ie purposely accounting the
> load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> as as by the allocator task, so I don't see why bcache gc should want to be
> interruptible.
>
> Kent, Jens, can we get rid of this?

Here's a patch that just fixes the way gc gets woken up. Eric, you want to take
this?

-- >8 --
Subject: [PATCH] bcache: Make gc wakeup saner

This lets us ditch a set_task_state() call.

Signed-off-by: Kent Overstreet <[email protected]>
---
drivers/md/bcache/bcache.h | 4 ++--
drivers/md/bcache/btree.c | 39 ++++++++++++++++++++-------------------
drivers/md/bcache/btree.h | 3 +--
drivers/md/bcache/request.c | 4 +---
drivers/md/bcache/super.c | 2 ++
5 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 6b420a55c7..c3ea03c9a1 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -425,7 +425,7 @@ struct cache {
* until a gc finishes - otherwise we could pointlessly burn a ton of
* cpu
*/
- unsigned invalidate_needs_gc:1;
+ unsigned invalidate_needs_gc;

bool discard; /* Get rid of? */

@@ -593,8 +593,8 @@ struct cache_set {

/* Counts how many sectors bio_insert has added to the cache */
atomic_t sectors_to_gc;
+ wait_queue_head_t gc_wait;

- wait_queue_head_t moving_gc_wait;
struct keybuf moving_gc_keys;
/* Number of moving GC bios in flight */
struct semaphore moving_in_flight;
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 81d3db40cd..2efdce0724 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1757,32 +1757,34 @@ static void bch_btree_gc(struct cache_set *c)
bch_moving_gc(c);
}

-static int bch_gc_thread(void *arg)
+static bool gc_should_run(struct cache_set *c)
{
- struct cache_set *c = arg;
struct cache *ca;
unsigned i;

- while (1) {
-again:
- bch_btree_gc(c);
+ for_each_cache(ca, c, i)
+ if (ca->invalidate_needs_gc)
+ return true;

- set_current_state(TASK_INTERRUPTIBLE);
- if (kthread_should_stop())
- break;
+ if (atomic_read(&c->sectors_to_gc) < 0)
+ return true;

- mutex_lock(&c->bucket_lock);
+ return false;
+}

- for_each_cache(ca, c, i)
- if (ca->invalidate_needs_gc) {
- mutex_unlock(&c->bucket_lock);
- set_current_state(TASK_RUNNING);
- goto again;
- }
+static int bch_gc_thread(void *arg)
+{
+ struct cache_set *c = arg;

- mutex_unlock(&c->bucket_lock);
+ while (1) {
+ wait_event_interruptible(c->gc_wait,
+ kthread_should_stop() || gc_should_run(c));

- schedule();
+ if (kthread_should_stop())
+ break;
+
+ set_gc_sectors(c);
+ bch_btree_gc(c);
}

return 0;
@@ -1790,11 +1792,10 @@ static int bch_gc_thread(void *arg)

int bch_gc_thread_start(struct cache_set *c)
{
- c->gc_thread = kthread_create(bch_gc_thread, c, "bcache_gc");
+ c->gc_thread = kthread_run(bch_gc_thread, c, "bcache_gc");
if (IS_ERR(c->gc_thread))
return PTR_ERR(c->gc_thread);

- set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
return 0;
}

diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
index 5c391fa01b..9b80417cd5 100644
--- a/drivers/md/bcache/btree.h
+++ b/drivers/md/bcache/btree.h
@@ -260,8 +260,7 @@ void bch_initial_mark_key(struct cache_set *, int, struct bkey *);

static inline void wake_up_gc(struct cache_set *c)
{
- if (c->gc_thread)
- wake_up_process(c->gc_thread);
+ wake_up(&c->gc_wait);
}

#define MAP_DONE 0
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 40ffe5e424..a37c1776f2 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -196,10 +196,8 @@ static void bch_data_insert_start(struct closure *cl)
struct data_insert_op *op = container_of(cl, struct data_insert_op, cl);
struct bio *bio = op->bio, *n;

- if (atomic_sub_return(bio_sectors(bio), &op->c->sectors_to_gc) < 0) {
- set_gc_sectors(op->c);
+ if (atomic_sub_return(bio_sectors(bio), &op->c->sectors_to_gc) < 0)
wake_up_gc(op->c);
- }

if (op->bypass)
return bch_data_invalidate(cl);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 849ad441cd..66669c8f41 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1491,6 +1491,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
mutex_init(&c->bucket_lock);
init_waitqueue_head(&c->btree_cache_wait);
init_waitqueue_head(&c->bucket_wait);
+ init_waitqueue_head(&c->gc_wait);
sema_init(&c->uuid_write_mutex, 1);

spin_lock_init(&c->btree_gc_time.lock);
@@ -1550,6 +1551,7 @@ static void run_cache_set(struct cache_set *c)

for_each_cache(ca, c, i)
c->nbuckets += ca->sb.nbuckets;
+ set_gc_sectors(c);

if (CACHE_SYNC(&c->sb)) {
LIST_HEAD(journal);
--
2.9.3

2016-10-25 16:55:27

by Eric Wheeler

[permalink] [raw]
Subject: Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

On Mon, 24 Oct 2016, Kent Overstreet wrote:
> On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> > On Wed, 19 Oct 2016, Peter Zijlstra wrote:
> >
> > > Subject: sched: Better explain sleep/wakeup
> > > From: Peter Zijlstra <[email protected]>
> > > Date: Wed Oct 19 15:45:27 CEST 2016
> > >
> > > There were a few questions wrt how sleep-wakeup works. Try and explain
> > > it more.
> > >
> > > Requested-by: Will Deacon <[email protected]>
> > > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > > ---
> > > include/linux/sched.h | 52 ++++++++++++++++++++++++++++++++++----------------
> > > kernel/sched/core.c | 15 +++++++-------
> > > 2 files changed, 44 insertions(+), 23 deletions(-)
> > >
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > > #define set_task_state(tsk, state_value) \
> > > do { \
> > > (tsk)->task_state_change = _THIS_IP_; \
> > > - smp_store_mb((tsk)->state, (state_value)); \
> > > + smp_store_mb((tsk)->state, (state_value)); \
> > > } while (0)
> > >
> > > -/*
> > > - * set_current_state() includes a barrier so that the write of current->state
> > > - * is correctly serialised wrt the caller's subsequent test of whether to
> > > - * actually sleep:
> > > - *
> > > - * set_current_state(TASK_UNINTERRUPTIBLE);
> > > - * if (do_i_need_to_sleep())
> > > - * schedule();
> > > - *
> > > - * If the caller does not need such serialisation then use __set_current_state()
> > > - */
> > > #define __set_current_state(state_value) \
> > > do { \
> > > current->task_state_change = _THIS_IP_; \
> > > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > > #define set_current_state(state_value) \
> > > do { \
> > > current->task_state_change = _THIS_IP_; \
> > > - smp_store_mb(current->state, (state_value)); \
> > > + smp_store_mb(current->state, (state_value)); \
> > > } while (0)
> > >
> > > #else
> > >
> > > +/*
> > > + * @tsk had better be current, or you get to keep the pieces.
> >
> > That reminds me we were getting rid of the set_task_state() calls. Bcache was
> > pending, being only user in the kernel that doesn't actually use current; but
> > instead breaks newly (yet blocked/uninterruptible) created garbage collection
> > kthread. I cannot figure out why this is done (ie purposely accounting the
> > load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> > as as by the allocator task, so I don't see why bcache gc should want to be
> > interruptible.
> >
> > Kent, Jens, can we get rid of this?
>
> Here's a patch that just fixes the way gc gets woken up. Eric, you want to take
> this?

Sure, I'll put it up with my -rc2 pull request to Jens.

A couple of sanity checks (for my understanding at least):

Why does bch_data_insert_start() no longer need to call
set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do?

Does bch_cache_set_alloc() even need to call set_gc_sectors since
bch_gc_thread() does before calling bch_btree_gc?

Also I'm curious, why change invalidate_needs_gc from a bitfield?

-Eric

>
> -- >8 --
> Subject: [PATCH] bcache: Make gc wakeup saner
>
> This lets us ditch a set_task_state() call.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> drivers/md/bcache/bcache.h | 4 ++--
> drivers/md/bcache/btree.c | 39 ++++++++++++++++++++-------------------
> drivers/md/bcache/btree.h | 3 +--
> drivers/md/bcache/request.c | 4 +---
> drivers/md/bcache/super.c | 2 ++
> 5 files changed, 26 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 6b420a55c7..c3ea03c9a1 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -425,7 +425,7 @@ struct cache {
> * until a gc finishes - otherwise we could pointlessly burn a ton of
> * cpu
> */
> - unsigned invalidate_needs_gc:1;
> + unsigned invalidate_needs_gc;
>
> bool discard; /* Get rid of? */
>
> @@ -593,8 +593,8 @@ struct cache_set {
>
> /* Counts how many sectors bio_insert has added to the cache */
> atomic_t sectors_to_gc;
> + wait_queue_head_t gc_wait;
>
> - wait_queue_head_t moving_gc_wait;
> struct keybuf moving_gc_keys;
> /* Number of moving GC bios in flight */
> struct semaphore moving_in_flight;
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 81d3db40cd..2efdce0724 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1757,32 +1757,34 @@ static void bch_btree_gc(struct cache_set *c)
> bch_moving_gc(c);
> }
>
> -static int bch_gc_thread(void *arg)
> +static bool gc_should_run(struct cache_set *c)
> {
> - struct cache_set *c = arg;
> struct cache *ca;
> unsigned i;
>
> - while (1) {
> -again:
> - bch_btree_gc(c);
> + for_each_cache(ca, c, i)
> + if (ca->invalidate_needs_gc)
> + return true;
>
> - set_current_state(TASK_INTERRUPTIBLE);
> - if (kthread_should_stop())
> - break;
> + if (atomic_read(&c->sectors_to_gc) < 0)
> + return true;
>
> - mutex_lock(&c->bucket_lock);
> + return false;
> +}
>
> - for_each_cache(ca, c, i)
> - if (ca->invalidate_needs_gc) {
> - mutex_unlock(&c->bucket_lock);
> - set_current_state(TASK_RUNNING);
> - goto again;
> - }
> +static int bch_gc_thread(void *arg)
> +{
> + struct cache_set *c = arg;
>
> - mutex_unlock(&c->bucket_lock);
> + while (1) {
> + wait_event_interruptible(c->gc_wait,
> + kthread_should_stop() || gc_should_run(c));
>
> - schedule();
> + if (kthread_should_stop())
> + break;
> +
> + set_gc_sectors(c);
> + bch_btree_gc(c);
> }
>
> return 0;
> @@ -1790,11 +1792,10 @@ static int bch_gc_thread(void *arg)
>
> int bch_gc_thread_start(struct cache_set *c)
> {
> - c->gc_thread = kthread_create(bch_gc_thread, c, "bcache_gc");
> + c->gc_thread = kthread_run(bch_gc_thread, c, "bcache_gc");
> if (IS_ERR(c->gc_thread))
> return PTR_ERR(c->gc_thread);
>
> - set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
> return 0;
> }
>
> diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
> index 5c391fa01b..9b80417cd5 100644
> --- a/drivers/md/bcache/btree.h
> +++ b/drivers/md/bcache/btree.h
> @@ -260,8 +260,7 @@ void bch_initial_mark_key(struct cache_set *, int, struct bkey *);
>
> static inline void wake_up_gc(struct cache_set *c)
> {
> - if (c->gc_thread)
> - wake_up_process(c->gc_thread);
> + wake_up(&c->gc_wait);
> }
>
> #define MAP_DONE 0
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 40ffe5e424..a37c1776f2 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -196,10 +196,8 @@ static void bch_data_insert_start(struct closure *cl)
> struct data_insert_op *op = container_of(cl, struct data_insert_op, cl);
> struct bio *bio = op->bio, *n;
>
> - if (atomic_sub_return(bio_sectors(bio), &op->c->sectors_to_gc) < 0) {
> - set_gc_sectors(op->c);
> + if (atomic_sub_return(bio_sectors(bio), &op->c->sectors_to_gc) < 0)
> wake_up_gc(op->c);
> - }
>
> if (op->bypass)
> return bch_data_invalidate(cl);
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 849ad441cd..66669c8f41 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1491,6 +1491,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
> mutex_init(&c->bucket_lock);
> init_waitqueue_head(&c->btree_cache_wait);
> init_waitqueue_head(&c->bucket_wait);
> + init_waitqueue_head(&c->gc_wait);
> sema_init(&c->uuid_write_mutex, 1);
>
> spin_lock_init(&c->btree_gc_time.lock);
> @@ -1550,6 +1551,7 @@ static void run_cache_set(struct cache_set *c)
>
> for_each_cache(ca, c, i)
> c->nbuckets += ca->sb.nbuckets;
> + set_gc_sectors(c);
>
> if (CACHE_SYNC(&c->sb)) {
> LIST_HEAD(journal);
> --
> 2.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2016-10-25 17:46:04

by Kent Overstreet

[permalink] [raw]
Subject: Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

On Tue, Oct 25, 2016 at 09:55:21AM -0700, Eric Wheeler wrote:
> Sure, I'll put it up with my -rc2 pull request to Jens.
>
> A couple of sanity checks (for my understanding at least):
>
> Why does bch_data_insert_start() no longer need to call
> set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do?

Before, the gc thread wasn't being explicitly signalled that it was time to run
- it was just being woken up, meaning that a spurious wakeup would cause gc to
run. At best it was sketchy and fragile, for multiple reasons - wait_event() is
a much better mechanism.

So with wait_event() gc is explicitly checking if it's time to run - the wakeup
doesn't really make anything happen by itself, it just pokes the gc thread so
it's able to notice that it should run.

When you're signalling a thread this way - we're in effect setting a global
variable that says "gc should run now", then kicking the gc thread so it can
check the "gc should run" variable. But wakeups aren't synchronous - our call to
wake_up() doesn't make the gc thread check that variable before it returns, all
we know when the wake_up() call returns is that the gc thread is going to check
that variable some point in the future. So we can't set the "gc should run"
varible, wake up the gc thread, and then set it back to "gc shouldn't run yet" -
what'll happen most of the time is that the gc thread won't run before we set
the variable back to "gc shouldn't run yet", it'll never see that it was
supposed to run and it'll go back to sleep.

So the way you make this work is you have the gc thread has to set the variable
back to "gc shouldn't run yet" _after_ it's seen it and decided to run.

> Does bch_cache_set_alloc() even need to call set_gc_sectors since
> bch_gc_thread() does before calling bch_btree_gc?

Yes, because the gc thread only resets the counter when it's decided to run - we
don't want it to run right away at startup.

> Also I'm curious, why change invalidate_needs_gc from a bitfield?

Bitfields are particularly unsafe for multiple threads to access - the compiler
has to emit instructions to do read/modify/write, which will clobber adjacent
data. A bare int is also not in _general_ safe for multiple threads to access
without a lock, but for what it's doing here it's fine.