2020-08-31 23:05:27

by Paul E. McKenney

[permalink] [raw]
Subject: Question on task_blocks_on_rt_mutex()

Hello!

The task_blocks_on_rt_mutex() function uses rt_mutex_owner() to
take a snapshot of the lock owner right up front. At this point,
the ->wait_lock is held, which at first glance prevents the owner
from leaving. Except that if there are not yet any waiters (that is,
the low-order bit of ->owner is zero), rt_mutex_fastunlock() might
locklessly clear the ->owner field. And in that case, it looks like
task_blocks_on_rt_mutex() will blithely continue using the ex-owner's
task_struct structure, without anything that I can see that prevents
the ex-owner from exiting.

What am I missing here?

The reason that I am looking into this is that locktorture scenario LOCK05
hangs, and does so leaving the torture_rtmutex.waiters field equal to 0x1.
This is of course a legal transitional state, but I would not expect it
to persist for more than three minutes. Yet it often does.

This leads me to believe that there is a way for an unlock to fail to wake
up a task concurrently acquiring the lock. This seems to be repaired
by later lock acquisitions, and in fact setting the locktorture.stutter
module parameter to zero avoids the hang. Except that I first found the
above apparently unprotected access to what was recently the owner task.

Thoughts?

Thanx, Paul


2020-08-31 23:23:01

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Question on task_blocks_on_rt_mutex()

On Mon, Aug 31, 2020 at 03:49:11PM -0700, Paul E. McKenney wrote:
> Hello!
>
> The task_blocks_on_rt_mutex() function uses rt_mutex_owner() to
> take a snapshot of the lock owner right up front. At this point,
> the ->wait_lock is held, which at first glance prevents the owner
> from leaving. Except that if there are not yet any waiters (that is,
> the low-order bit of ->owner is zero), rt_mutex_fastunlock() might
> locklessly clear the ->owner field. And in that case, it looks like
> task_blocks_on_rt_mutex() will blithely continue using the ex-owner's
> task_struct structure, without anything that I can see that prevents
> the ex-owner from exiting.
>
> What am I missing here?

One thing I missed is that the low-order bit of ->owner would already
be set by this point.

> The reason that I am looking into this is that locktorture scenario LOCK05
> hangs, and does so leaving the torture_rtmutex.waiters field equal to 0x1.
> This is of course a legal transitional state, but I would not expect it
> to persist for more than three minutes. Yet it often does.
>
> This leads me to believe that there is a way for an unlock to fail to wake
> up a task concurrently acquiring the lock. This seems to be repaired
> by later lock acquisitions, and in fact setting the locktorture.stutter
> module parameter to zero avoids the hang. Except that I first found the
> above apparently unprotected access to what was recently the owner task.
>
> Thoughts?

Some breakage elsewhere, presumably...

Thanx, Paul

2020-09-01 17:51:01

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Question on task_blocks_on_rt_mutex()

On Mon, Aug 31, 2020 at 04:21:30PM -0700, Paul E. McKenney wrote:
> On Mon, Aug 31, 2020 at 03:49:11PM -0700, Paul E. McKenney wrote:
> > Hello!
> >
> > The task_blocks_on_rt_mutex() function uses rt_mutex_owner() to
> > take a snapshot of the lock owner right up front. At this point,
> > the ->wait_lock is held, which at first glance prevents the owner
> > from leaving. Except that if there are not yet any waiters (that is,
> > the low-order bit of ->owner is zero), rt_mutex_fastunlock() might
> > locklessly clear the ->owner field. And in that case, it looks like
> > task_blocks_on_rt_mutex() will blithely continue using the ex-owner's
> > task_struct structure, without anything that I can see that prevents
> > the ex-owner from exiting.
> >
> > What am I missing here?
>
> One thing I missed is that the low-order bit of ->owner would already
> be set by this point.
>
> > The reason that I am looking into this is that locktorture scenario LOCK05
> > hangs, and does so leaving the torture_rtmutex.waiters field equal to 0x1.
> > This is of course a legal transitional state, but I would not expect it
> > to persist for more than three minutes. Yet it often does.
> >
> > This leads me to believe that there is a way for an unlock to fail to wake
> > up a task concurrently acquiring the lock. This seems to be repaired
> > by later lock acquisitions, and in fact setting the locktorture.stutter
> > module parameter to zero avoids the hang. Except that I first found the
> > above apparently unprotected access to what was recently the owner task.
> >
> > Thoughts?
>
> Some breakage elsewhere, presumably...

And it might be a lost wakeup, given that ->locktorture_wake_up is equal
to 1 for one of the locktorture writer tasks given the patch below.
Yes, this is a virtual environment. Except that none of the other
locktorture scenarios make this happen, nor the rcutorture scenario,
nor the scftorture scenarios. Maybe just the wrong timing? Who knows,
looking further.

Thanx, Paul

------------------------------------------------------------------------

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 93ecd93..054b6f8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -720,6 +720,7 @@ struct task_struct {
struct list_head rcu_node_entry;
struct rcu_node *rcu_blocked_node;
#endif /* #ifdef CONFIG_PREEMPT_RCU */
+ u8 locktorture_wake_up;

#ifdef CONFIG_TASKS_RCU
unsigned long rcu_tasks_nvcsw;
diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index 316531d..7f183be 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -617,6 +617,7 @@ static int lock_torture_writer(void *arg)

VERBOSE_TOROUT_STRING("lock_torture_writer task started");
set_user_nice(current, MAX_NICE);
+ current->locktorture_wake_up = 0;

do {
if ((torture_random(&rand) & 0xfffff) == 0)
@@ -635,6 +636,7 @@ static int lock_torture_writer(void *arg)
lock_is_write_held = false;
WRITE_ONCE(last_lock_release, jiffies);
cxt.cur_ops->writeunlock();
+ current->locktorture_wake_up = 0;

stutter_wait("lock_torture_writer");
} while (!torture_must_stop());
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index cfdd5b9..4a0b046 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1053,6 +1053,7 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
*/
preempt_disable();
wake_q_add(wake_q, waiter->task);
+ waiter->task->locktorture_wake_up = 1;
raw_spin_unlock(&current->pi_lock);
}

2020-09-01 23:59:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Question on task_blocks_on_rt_mutex()

On Tue, Sep 01, 2020 at 10:49:38AM -0700, Paul E. McKenney wrote:
> On Mon, Aug 31, 2020 at 04:21:30PM -0700, Paul E. McKenney wrote:
> > On Mon, Aug 31, 2020 at 03:49:11PM -0700, Paul E. McKenney wrote:
> > > Hello!
> > >
> > > The task_blocks_on_rt_mutex() function uses rt_mutex_owner() to
> > > take a snapshot of the lock owner right up front. At this point,
> > > the ->wait_lock is held, which at first glance prevents the owner
> > > from leaving. Except that if there are not yet any waiters (that is,
> > > the low-order bit of ->owner is zero), rt_mutex_fastunlock() might
> > > locklessly clear the ->owner field. And in that case, it looks like
> > > task_blocks_on_rt_mutex() will blithely continue using the ex-owner's
> > > task_struct structure, without anything that I can see that prevents
> > > the ex-owner from exiting.
> > >
> > > What am I missing here?
> >
> > One thing I missed is that the low-order bit of ->owner would already
> > be set by this point.
> >
> > > The reason that I am looking into this is that locktorture scenario LOCK05
> > > hangs, and does so leaving the torture_rtmutex.waiters field equal to 0x1.
> > > This is of course a legal transitional state, but I would not expect it
> > > to persist for more than three minutes. Yet it often does.
> > >
> > > This leads me to believe that there is a way for an unlock to fail to wake
> > > up a task concurrently acquiring the lock. This seems to be repaired
> > > by later lock acquisitions, and in fact setting the locktorture.stutter
> > > module parameter to zero avoids the hang. Except that I first found the
> > > above apparently unprotected access to what was recently the owner task.
> > >
> > > Thoughts?
> >
> > Some breakage elsewhere, presumably...
>
> And it might be a lost wakeup, given that ->locktorture_wake_up is equal
> to 1 for one of the locktorture writer tasks given the patch below.
> Yes, this is a virtual environment. Except that none of the other
> locktorture scenarios make this happen, nor the rcutorture scenario,
> nor the scftorture scenarios. Maybe just the wrong timing? Who knows,
> looking further.

And it appears that a default-niced CPU-bound SCHED_OTHER process is
not preempted by a newly awakened MAX_NICE SCHED_OTHER process. OK,
OK, I never waited for more than 10 minutes, but on my 2.2GHz that is
close enough to a hang for most people.

Which means that the patch below prevents the hangs. And maybe does
other things as well, firing rcutorture up on it to check.

But is this indefinite delay expected behavior?

This reproduces for me on current mainline as follows:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05

This hangs within a minute of boot on my setup. Here "hangs" is defined
as stopping the per-15-second console output of:
Writes: Total: 569906696 Max/Min: 81495031/63736508 Fail: 0

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/torture.c b/kernel/torture.c
index 1061492..9310c1d 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -602,6 +602,7 @@ static int stutter_gap;
*/
bool stutter_wait(const char *title)
{
+ unsigned i = 0;
int spt;
bool ret = false;

@@ -612,8 +613,13 @@ bool stutter_wait(const char *title)
if (spt == 1) {
schedule_timeout_interruptible(1);
} else if (spt == 2) {
- while (READ_ONCE(stutter_pause_test))
+ while (READ_ONCE(stutter_pause_test)) {
+ if (!(++i & 0xffff))
+ schedule_timeout_interruptible(1);
+ else if (!(i & 0xfff))
+ udelay(1);
cond_resched();
+ }
} else {
schedule_timeout_interruptible(round_jiffies_relative(HZ));
}

2020-09-02 02:09:22

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: Question on task_blocks_on_rt_mutex()

On Tue, 01 Sep 2020, Paul E. McKenney wrote:

>And it appears that a default-niced CPU-bound SCHED_OTHER process is
>not preempted by a newly awakened MAX_NICE SCHED_OTHER process. OK,
>OK, I never waited for more than 10 minutes, but on my 2.2GHz that is
>close enough to a hang for most people.
>
>Which means that the patch below prevents the hangs. And maybe does
>other things as well, firing rcutorture up on it to check.
>
>But is this indefinite delay expected behavior?
>
>This reproduces for me on current mainline as follows:
>
>tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05
>
>This hangs within a minute of boot on my setup. Here "hangs" is defined
>as stopping the per-15-second console output of:
> Writes: Total: 569906696 Max/Min: 81495031/63736508 Fail: 0

Ok this doesn't seem to be related to lockless wake_qs then. fyi there have
been missed wakeups in the past where wake_q_add() fails the cmpxchg because
the task is already pending a wakeup leading to the actual wakeup ocurring
before its corresponding wake_up_q(). This is why we have wake_q_add_safe().
But for rtmutexes, because there is no lock stealing only top-waiter is awoken
as well as try_to_take_rt_mutex() is done under the lock->wait_lock I was not
seeing an actual race here.

Thanks,
Davidlohr

2020-09-02 15:55:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Question on task_blocks_on_rt_mutex()

On Tue, Sep 01, 2020 at 06:51:28PM -0700, Davidlohr Bueso wrote:
> On Tue, 01 Sep 2020, Paul E. McKenney wrote:
>
> > And it appears that a default-niced CPU-bound SCHED_OTHER process is
> > not preempted by a newly awakened MAX_NICE SCHED_OTHER process. OK,
> > OK, I never waited for more than 10 minutes, but on my 2.2GHz that is
> > close enough to a hang for most people.
> >
> > Which means that the patch below prevents the hangs. And maybe does
> > other things as well, firing rcutorture up on it to check.
> >
> > But is this indefinite delay expected behavior?
> >
> > This reproduces for me on current mainline as follows:
> >
> > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05
> >
> > This hangs within a minute of boot on my setup. Here "hangs" is defined
> > as stopping the per-15-second console output of:
> > Writes: Total: 569906696 Max/Min: 81495031/63736508 Fail: 0
>
> Ok this doesn't seem to be related to lockless wake_qs then. fyi there have
> been missed wakeups in the past where wake_q_add() fails the cmpxchg because
> the task is already pending a wakeup leading to the actual wakeup ocurring
> before its corresponding wake_up_q(). This is why we have wake_q_add_safe().
> But for rtmutexes, because there is no lock stealing only top-waiter is awoken
> as well as try_to_take_rt_mutex() is done under the lock->wait_lock I was not
> seeing an actual race here.

This problem is avoided if stutter_wait() does the occasional sleep.
I would have expected preemption to take effect, but even setting the
kthreads in stutter_wait() to MAX_NICE doesn't help. The current fix
destroys intended instant-on nature of stutter_wait(), so the eventual
fix will need to use hrtimer-based sleeps or some such.

Thanx, Paul

2020-09-03 20:09:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Question on task_blocks_on_rt_mutex()

On Wed, Sep 02, 2020 at 08:54:10AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 01, 2020 at 06:51:28PM -0700, Davidlohr Bueso wrote:
> > On Tue, 01 Sep 2020, Paul E. McKenney wrote:
> >
> > > And it appears that a default-niced CPU-bound SCHED_OTHER process is
> > > not preempted by a newly awakened MAX_NICE SCHED_OTHER process. OK,
> > > OK, I never waited for more than 10 minutes, but on my 2.2GHz that is
> > > close enough to a hang for most people.
> > >
> > > Which means that the patch below prevents the hangs. And maybe does
> > > other things as well, firing rcutorture up on it to check.
> > >
> > > But is this indefinite delay expected behavior?
> > >
> > > This reproduces for me on current mainline as follows:
> > >
> > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05
> > >
> > > This hangs within a minute of boot on my setup. Here "hangs" is defined
> > > as stopping the per-15-second console output of:
> > > Writes: Total: 569906696 Max/Min: 81495031/63736508 Fail: 0
> >
> > Ok this doesn't seem to be related to lockless wake_qs then. fyi there have
> > been missed wakeups in the past where wake_q_add() fails the cmpxchg because
> > the task is already pending a wakeup leading to the actual wakeup ocurring
> > before its corresponding wake_up_q(). This is why we have wake_q_add_safe().
> > But for rtmutexes, because there is no lock stealing only top-waiter is awoken
> > as well as try_to_take_rt_mutex() is done under the lock->wait_lock I was not
> > seeing an actual race here.
>
> This problem is avoided if stutter_wait() does the occasional sleep.
> I would have expected preemption to take effect, but even setting the
> kthreads in stutter_wait() to MAX_NICE doesn't help. The current fix
> destroys intended instant-on nature of stutter_wait(), so the eventual
> fix will need to use hrtimer-based sleeps or some such.

And here is my current best shot at a workaround/fix/whatever. Thoughts?

Thanx, Paul

------------------------------------------------------------------------

commit d93a64389f4d544ded241d0ba30b2586497f5dc0
Author: Paul E. McKenney <[email protected]>
Date: Tue Sep 1 16:58:41 2020 -0700

torture: Periodically pause in stutter_wait()

Running locktorture scenario LOCK05 results in hangs:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05

The lock_torture_writer() kthreads set themselves to MAX_NICE while
running SCHED_OTHER. Other locktorture kthreads run at default niceness,
also SCHED_OTHER. This results in these other locktorture kthreads
indefinitely preempting the lock_torture_writer() kthreads. Note that
the cond_resched() in the stutter_wait() function's loop is ineffective
because this scenario is built with CONFIG_PREEMPT=y.

It is not clear that such indefinite preemption is supposed to happen, but
in the meantime this commit prevents kthreads running in stutter_wait()
from being completely CPU-bound, thus allowing the other threads to get
some CPU in a timely fashion. This commit also uses hrtimers to provide
very short sleeps to avoid degrading the sudden-on testing that stutter
is supposed to provide.

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/torture.c b/kernel/torture.c
index 1061492..5488ad2 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -602,8 +602,11 @@ static int stutter_gap;
*/
bool stutter_wait(const char *title)
{
- int spt;
+ ktime_t delay;
+ unsigned i = 0;
+ int oldnice;
bool ret = false;
+ int spt;

cond_resched_tasks_rcu_qs();
spt = READ_ONCE(stutter_pause_test);
@@ -612,8 +615,17 @@ bool stutter_wait(const char *title)
if (spt == 1) {
schedule_timeout_interruptible(1);
} else if (spt == 2) {
- while (READ_ONCE(stutter_pause_test))
+ oldnice = task_nice(current);
+ set_user_nice(current, MAX_NICE);
+ while (READ_ONCE(stutter_pause_test)) {
+ if (!(i++ & 0xffff)) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ delay = 10 * NSEC_PER_USEC;
+ schedule_hrtimeout(&delay, HRTIMER_MODE_REL);
+ }
cond_resched();
+ }
+ set_user_nice(current, oldnice);
} else {
schedule_timeout_interruptible(round_jiffies_relative(HZ));
}

2020-09-04 17:43:11

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: Question on task_blocks_on_rt_mutex()

On Thu, 03 Sep 2020, Paul E. McKenney wrote:

>commit d93a64389f4d544ded241d0ba30b2586497f5dc0
>Author: Paul E. McKenney <[email protected]>
>Date: Tue Sep 1 16:58:41 2020 -0700
>
> torture: Periodically pause in stutter_wait()
>
> Running locktorture scenario LOCK05 results in hangs:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05
>
> The lock_torture_writer() kthreads set themselves to MAX_NICE while
> running SCHED_OTHER. Other locktorture kthreads run at default niceness,
> also SCHED_OTHER. This results in these other locktorture kthreads
> indefinitely preempting the lock_torture_writer() kthreads. Note that
> the cond_resched() in the stutter_wait() function's loop is ineffective
> because this scenario is built with CONFIG_PREEMPT=y.
>
> It is not clear that such indefinite preemption is supposed to happen, but
> in the meantime this commit prevents kthreads running in stutter_wait()
> from being completely CPU-bound, thus allowing the other threads to get
> some CPU in a timely fashion. This commit also uses hrtimers to provide
> very short sleeps to avoid degrading the sudden-on testing that stutter
> is supposed to provide.
>
> Signed-off-by: Paul E. McKenney <[email protected]>

Reviwed-by: Davidlohr Bueso <[email protected]>

2020-09-04 19:59:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Question on task_blocks_on_rt_mutex()

On Fri, Sep 04, 2020 at 10:24:32AM -0700, Davidlohr Bueso wrote:
> On Thu, 03 Sep 2020, Paul E. McKenney wrote:
>
> > commit d93a64389f4d544ded241d0ba30b2586497f5dc0
> > Author: Paul E. McKenney <[email protected]>
> > Date: Tue Sep 1 16:58:41 2020 -0700
> >
> > torture: Periodically pause in stutter_wait()
> >
> > Running locktorture scenario LOCK05 results in hangs:
> >
> > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05
> >
> > The lock_torture_writer() kthreads set themselves to MAX_NICE while
> > running SCHED_OTHER. Other locktorture kthreads run at default niceness,
> > also SCHED_OTHER. This results in these other locktorture kthreads
> > indefinitely preempting the lock_torture_writer() kthreads. Note that
> > the cond_resched() in the stutter_wait() function's loop is ineffective
> > because this scenario is built with CONFIG_PREEMPT=y.
> >
> > It is not clear that such indefinite preemption is supposed to happen, but
> > in the meantime this commit prevents kthreads running in stutter_wait()
> > from being completely CPU-bound, thus allowing the other threads to get
> > some CPU in a timely fashion. This commit also uses hrtimers to provide
> > very short sleeps to avoid degrading the sudden-on testing that stutter
> > is supposed to provide.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
>
> Reviwed-by: Davidlohr Bueso <[email protected]>

Applied, thank you!

Thanx, Paul

2020-09-05 21:26:17

by Joel Fernandes

[permalink] [raw]
Subject: Re: Question on task_blocks_on_rt_mutex()

Hi Paul,

On Thu, Sep 03, 2020 at 01:06:39PM -0700, Paul E. McKenney wrote:
> On Wed, Sep 02, 2020 at 08:54:10AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 01, 2020 at 06:51:28PM -0700, Davidlohr Bueso wrote:
> > > On Tue, 01 Sep 2020, Paul E. McKenney wrote:
> > >
> > > > And it appears that a default-niced CPU-bound SCHED_OTHER process is
> > > > not preempted by a newly awakened MAX_NICE SCHED_OTHER process. OK,
> > > > OK, I never waited for more than 10 minutes, but on my 2.2GHz that is
> > > > close enough to a hang for most people.
> > > >
> > > > Which means that the patch below prevents the hangs. And maybe does
> > > > other things as well, firing rcutorture up on it to check.
> > > >
> > > > But is this indefinite delay expected behavior?
> > > >
> > > > This reproduces for me on current mainline as follows:
> > > >
> > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05
> > > >
> > > > This hangs within a minute of boot on my setup. Here "hangs" is defined
> > > > as stopping the per-15-second console output of:
> > > > Writes: Total: 569906696 Max/Min: 81495031/63736508 Fail: 0
> > >
> > > Ok this doesn't seem to be related to lockless wake_qs then. fyi there have
> > > been missed wakeups in the past where wake_q_add() fails the cmpxchg because
> > > the task is already pending a wakeup leading to the actual wakeup ocurring
> > > before its corresponding wake_up_q(). This is why we have wake_q_add_safe().
> > > But for rtmutexes, because there is no lock stealing only top-waiter is awoken
> > > as well as try_to_take_rt_mutex() is done under the lock->wait_lock I was not
> > > seeing an actual race here.
> >
> > This problem is avoided if stutter_wait() does the occasional sleep.
> > I would have expected preemption to take effect, but even setting the
> > kthreads in stutter_wait() to MAX_NICE doesn't help. The current fix
> > destroys intended instant-on nature of stutter_wait(), so the eventual
> > fix will need to use hrtimer-based sleeps or some such.
>
> And here is my current best shot at a workaround/fix/whatever. Thoughts?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit d93a64389f4d544ded241d0ba30b2586497f5dc0
> Author: Paul E. McKenney <[email protected]>
> Date: Tue Sep 1 16:58:41 2020 -0700
>
> torture: Periodically pause in stutter_wait()
>
> Running locktorture scenario LOCK05 results in hangs:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05
>
> The lock_torture_writer() kthreads set themselves to MAX_NICE while
> running SCHED_OTHER. Other locktorture kthreads run at default niceness,
> also SCHED_OTHER. This results in these other locktorture kthreads
> indefinitely preempting the lock_torture_writer() kthreads. Note that

In the past I have seen issues with niceness and CFS. Those issues were
related to tick granularity, if the scheduler tick is too coarse, then
scheduler may allow a low priority task to run for a bit longer. But this
also means that higher priority tasks will take even longer to catch up to
the vruntime of the lower priority ones. IIRC, this can run into several
seconds.

Not fully sure if that's what you're seeing. If you drop the niceness by some
amount, does the issue go away or get better?

> the cond_resched() in the stutter_wait() function's loop is ineffective
> because this scenario is built with CONFIG_PREEMPT=y.
>
> It is not clear that such indefinite preemption is supposed to happen, but
> in the meantime this commit prevents kthreads running in stutter_wait()
> from being completely CPU-bound, thus allowing the other threads to get
> some CPU in a timely fashion. This commit also uses hrtimers to provide
> very short sleeps to avoid degrading the sudden-on testing that stutter
> is supposed to provide.

There is a CFS tracepoint called sched:sched_stat_runtime. That could be
enabled to see what happens to the vruntime values on the wakeup of the lower
prio task.

I'm also seeing the LOCK05 failure, I see that some writer threads are in
TASK_UNINTERRUPTIBLE state shown by hung task detector on LOCK05. So these
writers didn't wake up for over 2 minutes to begin with:

[ 246.797326] task:lock_torture_wr state:D stack:14696 pid: 72 ppid: 2 flags:0x00004000
[ 246.798826] Call Trace:
[ 246.799282] __schedule+0x414/0x6a0
[ 246.799917] schedule+0x41/0xe0
[ 246.800510] __rt_mutex_slowlock+0x49/0xd0
[ 246.801259] rt_mutex_slowlock+0xca/0x1e0
[ 246.801994] ? lock_torture_reader+0x110/0x110
[ 246.802799] torture_rtmutex_lock+0xc/0x10
[ 246.803545] lock_torture_writer+0x72/0x150
[ 246.804322] kthread+0x120/0x160
[ 246.804911] ? kthread_park+0x80/0x80
[ 246.805581] ret_from_fork+0x22/0x30
[ 246.806237] INFO: task lock_torture_wr:73 blocked for more than 122 seconds.
[ 246.807505] Not tainted 5.9.0-rc1+ #26
[ 246.808287] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 246.809690] task:lock_torture_wr state:D stack:14696 pid: 73 ppid: 2 flags:0x00004000
[ 246.811208] Call Trace:
[ 246.811657] __schedule+0x414/0x6a0
[ 246.812306] schedule+0x41/0xe0
[ 246.812881] __rt_mutex_slowlock+0x49/0xd0
[ 246.813636] rt_mutex_slowlock+0xca/0x1e0
[ 246.814371] ? lock_torture_reader+0x110/0x110
[ 246.815182] torture_rtmutex_lock+0xc/0x10
[ 246.815923] lock_torture_writer+0x72/0x150
[ 246.816692] kthread+0x120/0x160
[ 246.817287] ? kthread_park+0x80/0x80
[ 246.817952] ret_from_fork+0x22/0x30

Could this just be a side effect of the issue you are seeing? (A writer
acquired a lock but never got CPU to release it, which inturn caused lock
acquirers to block in D-state indefinitely).

thanks,

- Joel

2020-09-05 21:47:41

by Joel Fernandes

[permalink] [raw]
Subject: Re: Question on task_blocks_on_rt_mutex()

On Sat, Sep 05, 2020 at 05:24:06PM -0400, Joel Fernandes wrote:
> Hi Paul,
>
> On Thu, Sep 03, 2020 at 01:06:39PM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 02, 2020 at 08:54:10AM -0700, Paul E. McKenney wrote:
> > > On Tue, Sep 01, 2020 at 06:51:28PM -0700, Davidlohr Bueso wrote:
> > > > On Tue, 01 Sep 2020, Paul E. McKenney wrote:
> > > >
> > > > > And it appears that a default-niced CPU-bound SCHED_OTHER process is
> > > > > not preempted by a newly awakened MAX_NICE SCHED_OTHER process. OK,
> > > > > OK, I never waited for more than 10 minutes, but on my 2.2GHz that is
> > > > > close enough to a hang for most people.
> > > > >
> > > > > Which means that the patch below prevents the hangs. And maybe does
> > > > > other things as well, firing rcutorture up on it to check.
> > > > >
> > > > > But is this indefinite delay expected behavior?
> > > > >
> > > > > This reproduces for me on current mainline as follows:
> > > > >
> > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05
> > > > >
> > > > > This hangs within a minute of boot on my setup. Here "hangs" is defined
> > > > > as stopping the per-15-second console output of:
> > > > > Writes: Total: 569906696 Max/Min: 81495031/63736508 Fail: 0
[...]
> > ------------------------------------------------------------------------
> >
> > commit d93a64389f4d544ded241d0ba30b2586497f5dc0
> > Author: Paul E. McKenney <[email protected]>
> > Date: Tue Sep 1 16:58:41 2020 -0700
> >
> > torture: Periodically pause in stutter_wait()
> >
> > Running locktorture scenario LOCK05 results in hangs:
> >
> > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05
> >
> > The lock_torture_writer() kthreads set themselves to MAX_NICE while
> > running SCHED_OTHER. Other locktorture kthreads run at default niceness,
> > also SCHED_OTHER. This results in these other locktorture kthreads
> > indefinitely preempting the lock_torture_writer() kthreads. Note that
>
> In the past I have seen issues with niceness and CFS. Those issues were
> related to tick granularity, if the scheduler tick is too coarse, then
> scheduler may allow a low priority task to run for a bit longer. But this
> also means that higher priority tasks will take even longer to catch up to
> the vruntime of the lower priority ones. IIRC, this can run into several
> seconds.
>
> Not fully sure if that's what you're seeing. If you drop the niceness by some
> amount, does the issue go away or get better?
>
> > the cond_resched() in the stutter_wait() function's loop is ineffective
> > because this scenario is built with CONFIG_PREEMPT=y.
> >
> > It is not clear that such indefinite preemption is supposed to happen, but
> > in the meantime this commit prevents kthreads running in stutter_wait()
> > from being completely CPU-bound, thus allowing the other threads to get
> > some CPU in a timely fashion. This commit also uses hrtimers to provide
> > very short sleeps to avoid degrading the sudden-on testing that stutter
> > is supposed to provide.
>
> There is a CFS tracepoint called sched:sched_stat_runtime. That could be
> enabled to see what happens to the vruntime values on the wakeup of the lower
> prio task.
>
> I'm also seeing the LOCK05 failure, I see that some writer threads are in
> TASK_UNINTERRUPTIBLE state shown by hung task detector on LOCK05. So these
> writers didn't wake up for over 2 minutes to begin with:
>
> [ 246.797326] task:lock_torture_wr state:D stack:14696 pid: 72 ppid: 2 flags:0x00004000
> [ 246.798826] Call Trace:
> [ 246.799282] __schedule+0x414/0x6a0
> [ 246.799917] schedule+0x41/0xe0
> [ 246.800510] __rt_mutex_slowlock+0x49/0xd0
> [ 246.801259] rt_mutex_slowlock+0xca/0x1e0
> [ 246.801994] ? lock_torture_reader+0x110/0x110
> [ 246.802799] torture_rtmutex_lock+0xc/0x10
> [ 246.803545] lock_torture_writer+0x72/0x150
> [ 246.804322] kthread+0x120/0x160
> [ 246.804911] ? kthread_park+0x80/0x80
> [ 246.805581] ret_from_fork+0x22/0x30
> [ 246.806237] INFO: task lock_torture_wr:73 blocked for more than 122 seconds.
> [ 246.807505] Not tainted 5.9.0-rc1+ #26
> [ 246.808287] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 246.809690] task:lock_torture_wr state:D stack:14696 pid: 73 ppid: 2 flags:0x00004000
> [ 246.811208] Call Trace:
> [ 246.811657] __schedule+0x414/0x6a0
> [ 246.812306] schedule+0x41/0xe0
> [ 246.812881] __rt_mutex_slowlock+0x49/0xd0
> [ 246.813636] rt_mutex_slowlock+0xca/0x1e0
> [ 246.814371] ? lock_torture_reader+0x110/0x110
> [ 246.815182] torture_rtmutex_lock+0xc/0x10
> [ 246.815923] lock_torture_writer+0x72/0x150
> [ 246.816692] kthread+0x120/0x160
> [ 246.817287] ? kthread_park+0x80/0x80
> [ 246.817952] ret_from_fork+0x22/0x30
>
> Could this just be a side effect of the issue you are seeing? (A writer
> acquired a lock but never got CPU to release it, which inturn caused lock
> acquirers to block in D-state indefinitely).

It appears to me the reason could be because the higher priority task is RT:

sched_switch: prev_comm=lock_torture_wr prev_pid=74 prev_prio=139 prev_state=R+ ==> next_comm=lock_torture_wr next_pid=70 next_prio=49

After this, only pid=70 runs till the hungtasks detector screams.

Could this because the writer calls cur_ops->task_boost(); which sets pid=70
to RT? As long as RT task runs, it will block the CFS task without giving it CPU.

thanks,

- Joel

2020-09-06 04:18:35

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Question on task_blocks_on_rt_mutex()

On Sat, Sep 05, 2020 at 05:45:02PM -0400, Joel Fernandes wrote:
> On Sat, Sep 05, 2020 at 05:24:06PM -0400, Joel Fernandes wrote:
> > Hi Paul,
> >
> > On Thu, Sep 03, 2020 at 01:06:39PM -0700, Paul E. McKenney wrote:
> > > On Wed, Sep 02, 2020 at 08:54:10AM -0700, Paul E. McKenney wrote:
> > > > On Tue, Sep 01, 2020 at 06:51:28PM -0700, Davidlohr Bueso wrote:
> > > > > On Tue, 01 Sep 2020, Paul E. McKenney wrote:
> > > > >
> > > > > > And it appears that a default-niced CPU-bound SCHED_OTHER process is
> > > > > > not preempted by a newly awakened MAX_NICE SCHED_OTHER process. OK,
> > > > > > OK, I never waited for more than 10 minutes, but on my 2.2GHz that is
> > > > > > close enough to a hang for most people.
> > > > > >
> > > > > > Which means that the patch below prevents the hangs. And maybe does
> > > > > > other things as well, firing rcutorture up on it to check.
> > > > > >
> > > > > > But is this indefinite delay expected behavior?
> > > > > >
> > > > > > This reproduces for me on current mainline as follows:
> > > > > >
> > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05
> > > > > >
> > > > > > This hangs within a minute of boot on my setup. Here "hangs" is defined
> > > > > > as stopping the per-15-second console output of:
> > > > > > Writes: Total: 569906696 Max/Min: 81495031/63736508 Fail: 0
> [...]
> > > ------------------------------------------------------------------------
> > >
> > > commit d93a64389f4d544ded241d0ba30b2586497f5dc0
> > > Author: Paul E. McKenney <[email protected]>
> > > Date: Tue Sep 1 16:58:41 2020 -0700
> > >
> > > torture: Periodically pause in stutter_wait()
> > >
> > > Running locktorture scenario LOCK05 results in hangs:
> > >
> > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05
> > >
> > > The lock_torture_writer() kthreads set themselves to MAX_NICE while
> > > running SCHED_OTHER. Other locktorture kthreads run at default niceness,
> > > also SCHED_OTHER. This results in these other locktorture kthreads
> > > indefinitely preempting the lock_torture_writer() kthreads. Note that
> >
> > In the past I have seen issues with niceness and CFS. Those issues were
> > related to tick granularity, if the scheduler tick is too coarse, then
> > scheduler may allow a low priority task to run for a bit longer. But this
> > also means that higher priority tasks will take even longer to catch up to
> > the vruntime of the lower priority ones. IIRC, this can run into several
> > seconds.
> >
> > Not fully sure if that's what you're seeing. If you drop the niceness by some
> > amount, does the issue go away or get better?
> >
> > > the cond_resched() in the stutter_wait() function's loop is ineffective
> > > because this scenario is built with CONFIG_PREEMPT=y.
> > >
> > > It is not clear that such indefinite preemption is supposed to happen, but
> > > in the meantime this commit prevents kthreads running in stutter_wait()
> > > from being completely CPU-bound, thus allowing the other threads to get
> > > some CPU in a timely fashion. This commit also uses hrtimers to provide
> > > very short sleeps to avoid degrading the sudden-on testing that stutter
> > > is supposed to provide.
> >
> > There is a CFS tracepoint called sched:sched_stat_runtime. That could be
> > enabled to see what happens to the vruntime values on the wakeup of the lower
> > prio task.
> >
> > I'm also seeing the LOCK05 failure, I see that some writer threads are in
> > TASK_UNINTERRUPTIBLE state shown by hung task detector on LOCK05. So these
> > writers didn't wake up for over 2 minutes to begin with:
> >
> > [ 246.797326] task:lock_torture_wr state:D stack:14696 pid: 72 ppid: 2 flags:0x00004000
> > [ 246.798826] Call Trace:
> > [ 246.799282] __schedule+0x414/0x6a0
> > [ 246.799917] schedule+0x41/0xe0
> > [ 246.800510] __rt_mutex_slowlock+0x49/0xd0
> > [ 246.801259] rt_mutex_slowlock+0xca/0x1e0
> > [ 246.801994] ? lock_torture_reader+0x110/0x110
> > [ 246.802799] torture_rtmutex_lock+0xc/0x10
> > [ 246.803545] lock_torture_writer+0x72/0x150
> > [ 246.804322] kthread+0x120/0x160
> > [ 246.804911] ? kthread_park+0x80/0x80
> > [ 246.805581] ret_from_fork+0x22/0x30
> > [ 246.806237] INFO: task lock_torture_wr:73 blocked for more than 122 seconds.
> > [ 246.807505] Not tainted 5.9.0-rc1+ #26
> > [ 246.808287] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 246.809690] task:lock_torture_wr state:D stack:14696 pid: 73 ppid: 2 flags:0x00004000
> > [ 246.811208] Call Trace:
> > [ 246.811657] __schedule+0x414/0x6a0
> > [ 246.812306] schedule+0x41/0xe0
> > [ 246.812881] __rt_mutex_slowlock+0x49/0xd0
> > [ 246.813636] rt_mutex_slowlock+0xca/0x1e0
> > [ 246.814371] ? lock_torture_reader+0x110/0x110
> > [ 246.815182] torture_rtmutex_lock+0xc/0x10
> > [ 246.815923] lock_torture_writer+0x72/0x150
> > [ 246.816692] kthread+0x120/0x160
> > [ 246.817287] ? kthread_park+0x80/0x80
> > [ 246.817952] ret_from_fork+0x22/0x30
> >
> > Could this just be a side effect of the issue you are seeing? (A writer
> > acquired a lock but never got CPU to release it, which inturn caused lock
> > acquirers to block in D-state indefinitely).
>
> It appears to me the reason could be because the higher priority task is RT:
>
> sched_switch: prev_comm=lock_torture_wr prev_pid=74 prev_prio=139 prev_state=R+ ==> next_comm=lock_torture_wr next_pid=70 next_prio=49
>
> After this, only pid=70 runs till the hungtasks detector screams.
>
> Could this because the writer calls cur_ops->task_boost(); which sets pid=70
> to RT? As long as RT task runs, it will block the CFS task without giving it CPU.

Thank you for looking into this! Sounds like something I would do...
And unlike rcutorture, the timeframes are too short for throttling
to kick in. Seems like the fix is instead to set to everything to
SCHED_OTHER while it is in torture_stutter(). I will give that a try.
Though perhaps the hrtimer sleep is better?

Thanx, Paul