2010-12-23 22:51:21

by Steven Rostedt

[permalink] [raw]
Subject: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup

From: Steven Rostedt <[email protected]>

The commit: rtmutex: Optimize rt lock wakeup

Does not do what it was suppose to do.
This is because the adaptive waiter sets its state to TASK_(UN)INTERRUPTIBLE
before going into the loop. Thus, the test in wakeup_next_waiter()
will always fail on an adaptive waiter, as it only tests to see if
the pending waiter never has its state set ot TASK_RUNNING unless
something else had woke it up.

The smp_mb() added to make this test work is just as expensive as
just calling wakeup. And since we we fail to wake up anyway, we are
doing both a smp_mb() and wakeup as well.

I tested this with dbench and we run faster without this patch.
I also tried a variant that instead fixed the loop, to change the state
only if the spinner was to go to sleep, and that still did not show
any improvement.

Cc: Gregory Haskins <[email protected]>
Cc: Peter Morreale <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
kernel/rtmutex.c | 29 ++---------------------------
1 files changed, 2 insertions(+), 27 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 318d7ed..e218873 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -554,33 +554,8 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
*/
if (!savestate)
wake_up_process(pendowner);
- else {
- /*
- * We can skip the actual (expensive) wakeup if the
- * waiter is already running, but we have to be careful
- * of race conditions because they may be about to sleep.
- *
- * The waiter-side protocol has the following pattern:
- * 1: Set state != RUNNING
- * 2: Conditionally sleep if waiter->task != NULL;
- *
- * And the owner-side has the following:
- * A: Set waiter->task = NULL
- * B: Conditionally wake if the state != RUNNING
- *
- * As long as we ensure 1->2 order, and A->B order, we
- * will never miss a wakeup.
- *
- * Therefore, this barrier ensures that waiter->task = NULL
- * is visible before we test the pendowner->state. The
- * corresponding barrier is in the sleep logic.
- */
- smp_mb();
-
- /* If !RUNNING && !RUNNING_MUTEX */
- if (pendowner->state & ~TASK_RUNNING_MUTEX)
- wake_up_process_mutex(pendowner);
- }
+ else
+ wake_up_process_mutex(pendowner);

rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);

--
1.7.2.3


2010-12-24 04:45:46

by Gregory Haskins

[permalink] [raw]
Subject: Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup

Hey Steve,

>>> On 12/23/2010 at 05:47 PM, in message <[email protected]>,
Steven Rostedt <[email protected]> wrote:
> From: Steven Rostedt <[email protected]>
>
> The commit: rtmutex: Optimize rt lock wakeup
>
> Does not do what it was suppose to do.
> This is because the adaptive waiter sets its state to TASK_(UN)INTERRUPTIBLE
> before going into the loop. Thus, the test in wakeup_next_waiter()
> will always fail on an adaptive waiter, as it only tests to see if
> the pending waiter never has its state set ot TASK_RUNNING unless
> something else had woke it up.
>
> The smp_mb() added to make this test work is just as expensive as
> just calling wakeup. And since we we fail to wake up anyway, we are
> doing both a smp_mb() and wakeup as well.
>
> I tested this with dbench and we run faster without this patch.
> I also tried a variant that instead fixed the loop, to change the state
> only if the spinner was to go to sleep, and that still did not show
> any improvement.

Just a quick note to say I am a bit skeptical of this patch. I know you are offline next week, so lets plan on hashing it out after the new year before I ack it.

Happy holidays!
-Greg

>
> Cc: Gregory Haskins <[email protected]>
> Cc: Peter Morreale <[email protected]>
> Signed-off-by: Steven Rostedt <[email protected]>
> ---
> kernel/rtmutex.c | 29 ++---------------------------
> 1 files changed, 2 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 318d7ed..e218873 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -554,33 +554,8 @@ static void wakeup_next_waiter(struct rt_mutex *lock,
> int savestate)
> */
> if (!savestate)
> wake_up_process(pendowner);
> - else {
> - /*
> - * We can skip the actual (expensive) wakeup if the
> - * waiter is already running, but we have to be careful
> - * of race conditions because they may be about to sleep.
> - *
> - * The waiter-side protocol has the following pattern:
> - * 1: Set state != RUNNING
> - * 2: Conditionally sleep if waiter->task != NULL;
> - *
> - * And the owner-side has the following:
> - * A: Set waiter->task = NULL
> - * B: Conditionally wake if the state != RUNNING
> - *
> - * As long as we ensure 1->2 order, and A->B order, we
> - * will never miss a wakeup.
> - *
> - * Therefore, this barrier ensures that waiter->task = NULL
> - * is visible before we test the pendowner->state. The
> - * corresponding barrier is in the sleep logic.
> - */
> - smp_mb();
> -
> - /* If !RUNNING && !RUNNING_MUTEX */
> - if (pendowner->state & ~TASK_RUNNING_MUTEX)
> - wake_up_process_mutex(pendowner);
> - }
> + else
> + wake_up_process_mutex(pendowner);
>
> rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
>



2010-12-24 04:54:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup

On Thu, 2010-12-23 at 21:45 -0700, Gregory Haskins wrote:
> Hey Steve,
>
> >>> On 12/23/2010 at 05:47 PM, in message <[email protected]>,
> Steven Rostedt <[email protected]> wrote:
> > From: Steven Rostedt <[email protected]>
> >
> > The commit: rtmutex: Optimize rt lock wakeup
> >
> > Does not do what it was suppose to do.
> > This is because the adaptive waiter sets its state to TASK_(UN)INTERRUPTIBLE
> > before going into the loop. Thus, the test in wakeup_next_waiter()
> > will always fail on an adaptive waiter, as it only tests to see if
> > the pending waiter never has its state set ot TASK_RUNNING unless
> > something else had woke it up.
> >
> > The smp_mb() added to make this test work is just as expensive as
> > just calling wakeup. And since we we fail to wake up anyway, we are
> > doing both a smp_mb() and wakeup as well.
> >
> > I tested this with dbench and we run faster without this patch.
> > I also tried a variant that instead fixed the loop, to change the state
> > only if the spinner was to go to sleep, and that still did not show
> > any improvement.
>
> Just a quick note to say I am a bit skeptical of this patch. I know you are offline next week, so lets plan on hashing it out after the new year before I ack it.

Sure, but as I said, it is mostly broken anyway. I could even insert
some tracepoints to show that this is always missed (heck I'll add an
unlikely and do the branch profiler ;-)

The reason is that adaptive spinners spin in some other state than
TASK_RUNNING, thus it does not help adaptive spinners at all. I first
tried to fix that, but it made dbench run even slower. But I only did a
few tests, and only on a 4 CPU box, so it was a rather small sample. The
removal of the code had to deal with more that it was already broken
than anything else.

But yeah, we can hash this out in the new year. This is one of the
reasons I only posted this patch set as an RFC.

>
> Happy holidays!

You too!

-- Steve

2010-12-24 16:08:08

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup

On Thu, 2010-12-23 at 21:45 -0700, Gregory Haskins wrote:
> Hey Steve,
>
> >>> On 12/23/2010 at 05:47 PM, in message <[email protected]>,
> Steven Rostedt <[email protected]> wrote:
> > From: Steven Rostedt <[email protected]>
> >
> > The commit: rtmutex: Optimize rt lock wakeup
> >
> > Does not do what it was suppose to do.
> > This is because the adaptive waiter sets its state to TASK_(UN)INTERRUPTIBLE
> > before going into the loop. Thus, the test in wakeup_next_waiter()
> > will always fail on an adaptive waiter, as it only tests to see if
> > the pending waiter never has its state set ot TASK_RUNNING unless
> > something else had woke it up.
> >
> > The smp_mb() added to make this test work is just as expensive as
> > just calling wakeup. And since we we fail to wake up anyway, we are
> > doing both a smp_mb() and wakeup as well.
> >
> > I tested this with dbench and we run faster without this patch.
> > I also tried a variant that instead fixed the loop, to change the state
> > only if the spinner was to go to sleep, and that still did not show
> > any improvement.
>
> Just a quick note to say I am a bit skeptical of this patch. I know you are offline next week, so lets plan on hashing it out after the new year before I ack it.
>

We shouldn't be too quick to merely rip this out without a little
thinking. Clearly this is broken, however the intent was clear.

That being that if a waiter is spinning, we don't need to wake it up.

The wake up path is substantially more than a barrier; it includes a
barrier as well as grabbing the task_rq_lock only to find that the task
is oncpu. Then various accounting is updated, etc.

We know definitively that a waiter can only spin if the owner is oncpu,
by definition of adaptive spinning. We also know that only the owner
can release the lock to a waiter (spinning or not). So it seems clear
that avoiding unnecessary contention on the rq lock would be a Good
Thing(tm).

Perhaps this cannot be done safely, but if you saw an improvement in
dbench by merely removing a barrier, imagine the improvement by removing
the contention on the lock.

Happy Holidays to all!

-PWM




> Happy holidays!
> -Greg
>
> >
> > Cc: Gregory Haskins <[email protected]>
> > Cc: Peter Morreale <[email protected]>
> > Signed-off-by: Steven Rostedt <[email protected]>
> > ---
> > kernel/rtmutex.c | 29 ++---------------------------
> > 1 files changed, 2 insertions(+), 27 deletions(-)
> >
> > diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> > index 318d7ed..e218873 100644
> > --- a/kernel/rtmutex.c
> > +++ b/kernel/rtmutex.c
> > @@ -554,33 +554,8 @@ static void wakeup_next_waiter(struct rt_mutex *lock,
> > int savestate)
> > */
> > if (!savestate)
> > wake_up_process(pendowner);
> > - else {
> > - /*
> > - * We can skip the actual (expensive) wakeup if the
> > - * waiter is already running, but we have to be careful
> > - * of race conditions because they may be about to sleep.
> > - *
> > - * The waiter-side protocol has the following pattern:
> > - * 1: Set state != RUNNING
> > - * 2: Conditionally sleep if waiter->task != NULL;
> > - *
> > - * And the owner-side has the following:
> > - * A: Set waiter->task = NULL
> > - * B: Conditionally wake if the state != RUNNING
> > - *
> > - * As long as we ensure 1->2 order, and A->B order, we
> > - * will never miss a wakeup.
> > - *
> > - * Therefore, this barrier ensures that waiter->task = NULL
> > - * is visible before we test the pendowner->state. The
> > - * corresponding barrier is in the sleep logic.
> > - */
> > - smp_mb();
> > -
> > - /* If !RUNNING && !RUNNING_MUTEX */
> > - if (pendowner->state & ~TASK_RUNNING_MUTEX)
> > - wake_up_process_mutex(pendowner);
> > - }
> > + else
> > + wake_up_process_mutex(pendowner);
> >
> > rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
> >
>
>
>

2010-12-28 14:07:09

by Gregory Haskins

[permalink] [raw]
Subject: Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup

>>> On 12/23/2010 at 11:54 PM, in message
<[email protected]>, Steven Rostedt
<[email protected]> wrote:
> On Thu, 2010-12-23 at 21:45 -0700, Gregory Haskins wrote:
>> Hey Steve,
>>
>> >>> On 12/23/2010 at 05:47 PM, in message <[email protected]>,
>> Steven Rostedt <[email protected]> wrote:
>> > From: Steven Rostedt <[email protected]>
>> >
>> > The commit: rtmutex: Optimize rt lock wakeup
>> >
>> > Does not do what it was suppose to do.
>> > This is because the adaptive waiter sets its state to
> TASK_(UN)INTERRUPTIBLE
>> > before going into the loop. Thus, the test in wakeup_next_waiter()
>> > will always fail on an adaptive waiter, as it only tests to see if
>> > the pending waiter never has its state set ot TASK_RUNNING unless
>> > something else had woke it up.
>> >
>> > The smp_mb() added to make this test work is just as expensive as
>> > just calling wakeup. And since we we fail to wake up anyway, we are
>> > doing both a smp_mb() and wakeup as well.
>> >
>> > I tested this with dbench and we run faster without this patch.
>> > I also tried a variant that instead fixed the loop, to change the state
>> > only if the spinner was to go to sleep, and that still did not show
>> > any improvement.
>>
>> Just a quick note to say I am a bit skeptical of this patch. I know you are
> offline next week, so lets plan on hashing it out after the new year before I
> ack it.
>
> Sure, but as I said, it is mostly broken anyway. I could even insert
> some tracepoints to show that this is always missed (heck I'll add an
> unlikely and do the branch profiler ;-)

Well, I think that would be a good datapoint and is one of the things I'd like to see.

>
> The reason is that adaptive spinners spin in some other state than
> TASK_RUNNING, thus it does not help adaptive spinners at all. I first
> tried to fix that, but it made dbench run even slower.

This is why I am skeptical. You are essentially asserting there are two issues here, IIUC:

1) The intent of avoiding a wakeup is broken and we take the double whammy of a mb()
plus the wakeup() anyway.

2) mb() is apparently slower than wakeup().

I agree (1) is plausible, though I would like to see the traces to confirm. Its been a long time
since I looked at that code, but I think the original code either ran in RUNNING_MUTEX and was
inadvertently broken in the mean time or the other cpu would have transitioned to RUNNING on
its own when we flipped the owner before the release-side check was performed. Or perhaps
we just plain screwed this up and it was racy ;) I'm not sure. But as Peter (M) stated, it seems
like a shame to walk away from the concept without further investigation. I think everyone can
agree that at the very least, if it is in fact taking a double whammy we should fix that.

For (2), I am skeptical in two parts ;). You stated you thought mb() was just as expensive as a
wakeup which seems suspect to me, given a wakeup needs to be a superset of a barrier
II[R|U]C. Lets call this "2a". In addition, your results when you removed the logic and went
straight to a wakeup() and found dbench actually was faster than the "fixed mb()" path would
imply wakeup() is actually _faster_ than mb(). Lets call this "2b".

For (2a), I would like to see some traces that compare mb() to wakeup() (of a presumably
already running task that happens in the INTERRUPTIBLE state) to be convinced that wakeup() is
equal/faster. I suspect it isn't

For (2b), I would suggest that we don't rely on dbench alone in evaluating the merit of the
change. In some ways, its a great test for this type of change since it leans heavily on the coarse
VFS locks. However, dbench is also pretty odd and thrives on somewhat chaotic behavior. For
instance, it loves the "lateral steal" logic, even though this patch technically breaks fairness. So
I would therefore propose a suite of benchmarks known for creating as much lock contention as
possible should be run in addition to dbench alone.

Happy new year, all,
-Greg