2014-04-03 23:43:21

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: Promela/spin model for NO_HZ_FULL_SYSIDLE code

On Sun, Mar 30, 2014 at 04:08:56PM -0700, Paul E. McKenney wrote:
> For whatever it is worth, the following model claims safety and progress
> for the sysidle state machine.
>
> Thoughts?

I'm going to get fun of myself by risking a review of this. Warning,
I don't speak promelian, so I may well write non-sense :)

>
> Thanx, Paul
>
> ------------------------------------------------------------------------
> sysidle.sh
> ------------------------------------------------------------------------
> spin -a sysidle.spin
> cc -DNP -o pan pan.c
> # Fair scheduling to focus progress checks in timekeeper.
> ./pan -f -l -m1280000 -w22
>
> ------------------------------------------------------------------------
> sysidle.spin
> ------------------------------------------------------------------------
> /*
> * Promela model for CONFIG_NO_HZ_FULL_SYSIDLE=y in the Linux kernel.
> * This model assumes that the dyntick-idle bit manipulation works based
> * on long usage, and substitutes a per-thread boolean "am_busy[]" array
> * for the Linux kernel's dyntick-idle masks. The focus of this model
> * is therefore on the state machine itself. Checks for both safety and
> * forward progress.
> *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> * (at your option) any later version.
> *
> * This program is distributed in the hope that it will be useful,
> * but WITHOUT ANY WARRANTY; without even the implied warranty of
> * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> * GNU General Public License for more details.
> *
> * You should have received a copy of the GNU General Public License
> * along with this program; if not, you can access it online at
> * http://www.gnu.org/licenses/gpl-2.0.html.
> *
> * Copyright IBM Corporation, 2014
> *
> * Author: Paul E. McKenney <[email protected]>
> */
>
> #define NUM_WORKERS 3
>
> byte wakeup_timekeeper = 0; /* Models rcu_kick_nohz_cpu(). */
>
> #define RCU_SYSIDLE_NOT 0 /* Some CPU is not idle. */
> #define RCU_SYSIDLE_SHORT 1 /* All CPUs idle for brief period. */
> #define RCU_SYSIDLE_LONG 2 /* All CPUs idle for long enough. */
> #define RCU_SYSIDLE_FULL 3 /* All CPUs idle, ready for sysidle. */
> #define RCU_SYSIDLE_FULL_NOTED 4 /* Actually entered sysidle state. */
>
> byte full_sysidle_state = RCU_SYSIDLE_NOT;
>
> byte am_busy[NUM_WORKERS]; /* Busy is similar to "not dyntick-idle". */
> byte am_setup[NUM_WORKERS]; /* Setup means timekeeper knows I am not idle. */
>
> /*
> * Non-timekeeping CPU going into and out of dyntick-idle state.
> */
> proctype worker(byte me)
> {
> byte oldstate;
>
> do
> :: 1 ->
> /* Go idle. */
> am_setup[me] = 0;
> am_busy[me] = 0;
>
> /* Dyntick-idle in the following loop. */
> do
> :: 1 -> skip;
> :: 1 -> break;
> od;
>
> /* Exit idle loop, model getting out of dyntick idle state. */
> am_busy[me] = 1;
>
> /* Get state out of full-system idle states. */
> atomic {
> oldstate = full_sysidle_state;

On the upstream code, the first read of full_sysidle_state after exiting idle is not
performed by an atomic operation. So I wonder if this is right to put this
in the atomic section.

I don't know the language enough to tell if it has no effect but I'm just
worried that it gets badly intepreted. Like the above read plus the below
conditional write in the same atomic section gets packed in a kind of cmpxchg_if_above() ?

Which is what we want to avoid if the value is not above RCU_SYSIDLE_SHORT after
a non atomic read.

> if
> :: oldstate > RCU_SYSIDLE_SHORT ->
> full_sysidle_state = RCU_SYSIDLE_NOT;
> :: else -> skip;
> fi;
> }
>
> /* If needed, wake up the timekeeper. */
> if
> :: oldstate == RCU_SYSIDLE_FULL_NOTED ->
> wakeup_timekeeper = 1;
> :: else -> skip;
> fi;
>
> /* Mark ourselves fully awake and operational. */
> am_setup[me] = 1;
>
> /* We are fully awake, so timekeeper must not be asleep. */
> assert(full_sysidle_state < RCU_SYSIDLE_FULL);
>
> /* Running in kernel in the following loop. */
> do
> :: 1 -> skip;
> :: 1 -> break;
> od;
> od
> }
>
> /*
> * Are all the workers in dyntick-idle state?
> */
> #define check_idle() \
> i = 0; \
> idle = 1; \
> do \
> :: i < NUM_WORKERS -> \
> if \
> :: am_busy[i] == 1 -> idle = 0; \
> :: else -> skip; \
> fi; \
> i++; \
> :: i >= NUM_WORKERS -> break; \
> od
>
> /*
> * Timekeeping CPU.
> */
> proctype timekeeper()
> {
> byte i;
> byte idle;
> byte curstate;
> byte newstate;
>
> do
> :: 1 ->
> /* Capture current state. */
> check_idle();
> curstate = full_sysidle_state;
> newstate = curstate;
>
> /* Check for acceptance state. */
> if
> :: idle == 0 ->
> progress_idle:

Is this some kind of label? I can't find the target anywhere.

> skip;
> :: curstate == RCU_SYSIDLE_NOT ->
> progress_idle_reset:
> skip;
> :: else -> skip;
> fi;
>
> /* Manage state... */
> if
> :: idle == 1 && curstate < RCU_SYSIDLE_FULL_NOTED ->
> /* Idle, advance to next state. */
> atomic {
> if
> :: full_sysidle_state == curstate ->
> newstate = curstate + 1;
> full_sysidle_state = newstate;
> :: else -> skip;
> fi;
> }

It looks good but just one thing about the transition from FULL -> FULL_NOTED.
At least in the case of CONFIG_NO_HZ_FULL_SYSIDLE_SMALL (which is usually the
scenario I refer to, but I'll check further the grace-period driven way as well),
we switch from FULL to FULL_NOTED without checking a new round of the dynticks counters.

But this timekeeper() proc doesn't seem to care and does a check_idle() no matter
the current state.

There should probably be a special case to handle that otherwise we add a new
round of dynticks counters read between FULL and FULL_NOTED transition and this is an
entirely different scenario than what we run.

> :: idle == 0 && full_sysidle_state >= RCU_SYSIDLE_LONG ->
> /* Non-idle and state advanced, revert to base state. */
> full_sysidle_state = RCU_SYSIDLE_NOT;

Looking at the upstream code, I think we reset also when state == RCU_SYSIDLE_SHORT
once we detect a non-idle state. If it's not a mistyping, I'm probably missing something.

Thanks.

> :: else -> skip;
> fi;
>
> /* If in RCU_SYSIDLE_FULL_NOTED, wait to be awakened. */
> do
> :: newstate != RCU_SYSIDLE_FULL_NOTED &&
> wakeup_timekeeper == 1 ->
> assert(0); /* Should never get here. */
> :: newstate != RCU_SYSIDLE_FULL_NOTED &&
> wakeup_timekeeper == 0 ->
> break;
> :: newstate == RCU_SYSIDLE_FULL_NOTED &&
> wakeup_timekeeper == 1 ->
> progress_full_system_idle_1:
> assert(full_sysidle_state == RCU_SYSIDLE_NOT);
> wakeup_timekeeper = 0;
> break;
> :: newstate == RCU_SYSIDLE_FULL_NOTED &&
> wakeup_timekeeper == 0 ->
> progress_full_system_idle_2:
>
> /* We are asleep, so all workers better be idle. */
> atomic {
> i = 0;
> idle = 1;
> do
> :: i < NUM_WORKERS ->
> if
> :: am_setup[i] -> idle = 0;
> :: else -> skip;
> fi;
> i++;
> :: i >= NUM_WORKERS -> break;
> od;
> assert(idle == 1 ||
> full_sysidle_state < RCU_SYSIDLE_FULL);
> }
> od;
> assert(full_sysidle_state <= RCU_SYSIDLE_FULL_NOTED);
> od;
> }
>
> init {
> byte i = 0;
>
> do
> :: i < NUM_WORKERS ->
> am_busy[i] = 1;
> am_setup[i] = 1;
> run worker(i);
> i++;
> :: i >= NUM_WORKERS -> break;
> od;
> run timekeeper();
> }
>


2014-04-07 18:12:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Promela/spin model for NO_HZ_FULL_SYSIDLE code

On Fri, Apr 04, 2014 at 01:43:16AM +0200, Frederic Weisbecker wrote:
> On Sun, Mar 30, 2014 at 04:08:56PM -0700, Paul E. McKenney wrote:
> > For whatever it is worth, the following model claims safety and progress
> > for the sysidle state machine.
> >
> > Thoughts?
>
> I'm going to get fun of myself by risking a review of this. Warning,
> I don't speak promelian, so I may well write non-sense :)

Actually, you did find one real mismatch and one arguable one. ;-)

> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> > sysidle.sh
> > ------------------------------------------------------------------------
> > spin -a sysidle.spin
> > cc -DNP -o pan pan.c
> > # Fair scheduling to focus progress checks in timekeeper.
> > ./pan -f -l -m1280000 -w22
> >
> > ------------------------------------------------------------------------
> > sysidle.spin
> > ------------------------------------------------------------------------
> > /*
> > * Promela model for CONFIG_NO_HZ_FULL_SYSIDLE=y in the Linux kernel.
> > * This model assumes that the dyntick-idle bit manipulation works based
> > * on long usage, and substitutes a per-thread boolean "am_busy[]" array
> > * for the Linux kernel's dyntick-idle masks. The focus of this model
> > * is therefore on the state machine itself. Checks for both safety and
> > * forward progress.
> > *
> > * This program is free software; you can redistribute it and/or modify
> > * it under the terms of the GNU General Public License as published by
> > * the Free Software Foundation; either version 2 of the License, or
> > * (at your option) any later version.
> > *
> > * This program is distributed in the hope that it will be useful,
> > * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > * GNU General Public License for more details.
> > *
> > * You should have received a copy of the GNU General Public License
> > * along with this program; if not, you can access it online at
> > * http://www.gnu.org/licenses/gpl-2.0.html.
> > *
> > * Copyright IBM Corporation, 2014
> > *
> > * Author: Paul E. McKenney <[email protected]>
> > */
> >
> > #define NUM_WORKERS 3
> >
> > byte wakeup_timekeeper = 0; /* Models rcu_kick_nohz_cpu(). */
> >
> > #define RCU_SYSIDLE_NOT 0 /* Some CPU is not idle. */
> > #define RCU_SYSIDLE_SHORT 1 /* All CPUs idle for brief period. */
> > #define RCU_SYSIDLE_LONG 2 /* All CPUs idle for long enough. */
> > #define RCU_SYSIDLE_FULL 3 /* All CPUs idle, ready for sysidle. */
> > #define RCU_SYSIDLE_FULL_NOTED 4 /* Actually entered sysidle state. */
> >
> > byte full_sysidle_state = RCU_SYSIDLE_NOT;
> >
> > byte am_busy[NUM_WORKERS]; /* Busy is similar to "not dyntick-idle". */
> > byte am_setup[NUM_WORKERS]; /* Setup means timekeeper knows I am not idle. */
> >
> > /*
> > * Non-timekeeping CPU going into and out of dyntick-idle state.
> > */
> > proctype worker(byte me)
> > {
> > byte oldstate;
> >
> > do
> > :: 1 ->
> > /* Go idle. */
> > am_setup[me] = 0;
> > am_busy[me] = 0;
> >
> > /* Dyntick-idle in the following loop. */
> > do
> > :: 1 -> skip;
> > :: 1 -> break;
> > od;
> >
> > /* Exit idle loop, model getting out of dyntick idle state. */
> > am_busy[me] = 1;
> >
> > /* Get state out of full-system idle states. */
> > atomic {
> > oldstate = full_sysidle_state;
>
> On the upstream code, the first read of full_sysidle_state after exiting idle is not
> performed by an atomic operation. So I wonder if this is right to put this
> in the atomic section.
>
> I don't know the language enough to tell if it has no effect but I'm just
> worried that it gets badly intepreted. Like the above read plus the below
> conditional write in the same atomic section gets packed in a kind of cmpxchg_if_above() ?
>
> Which is what we want to avoid if the value is not above RCU_SYSIDLE_SHORT after
> a non atomic read.

Given that cmpxchg() is being used to emulate exactly that atomic
operation, I feel good about this Promela translation. If the value is
different at the time of the cmpxchg(), the cmpxchg() fails. I suppose
I could write it as follows instead:

/* Get state out of full-system idle states. */
oldstate = full_sysidle_state;
do
:: 1 ->
atomic {
if
:: oldstate > RCU_SYSIDLE_SHORT &&
oldstate == full_sysidle_state ->
full_sysidle_state = RCU_SYSIDLE_NOT;
break;
:: else ->
oldstate = full_sysidle_state;
fi;
}
od;

Here the "if" emulates the cmpxchg() instruction and the rest emulates
the loop. Both representations get the same result when

> > if
> > :: oldstate > RCU_SYSIDLE_SHORT ->
> > full_sysidle_state = RCU_SYSIDLE_NOT;
> > :: else -> skip;
> > fi;
> > }
> >
> > /* If needed, wake up the timekeeper. */
> > if
> > :: oldstate == RCU_SYSIDLE_FULL_NOTED ->
> > wakeup_timekeeper = 1;
> > :: else -> skip;
> > fi;
> >
> > /* Mark ourselves fully awake and operational. */
> > am_setup[me] = 1;
> >
> > /* We are fully awake, so timekeeper must not be asleep. */
> > assert(full_sysidle_state < RCU_SYSIDLE_FULL);
> >
> > /* Running in kernel in the following loop. */
> > do
> > :: 1 -> skip;
> > :: 1 -> break;
> > od;
> > od
> > }
> >
> > /*
> > * Are all the workers in dyntick-idle state?
> > */
> > #define check_idle() \
> > i = 0; \
> > idle = 1; \
> > do \
> > :: i < NUM_WORKERS -> \
> > if \
> > :: am_busy[i] == 1 -> idle = 0; \
> > :: else -> skip; \
> > fi; \
> > i++; \
> > :: i >= NUM_WORKERS -> break; \
> > od
> >
> > /*
> > * Timekeeping CPU.
> > */
> > proctype timekeeper()
> > {
> > byte i;
> > byte idle;
> > byte curstate;
> > byte newstate;
> >
> > do
> > :: 1 ->
> > /* Capture current state. */
> > check_idle();
> > curstate = full_sysidle_state;
> > newstate = curstate;
> >
> > /* Check for acceptance state. */
> > if
> > :: idle == 0 ->
> > progress_idle:
>
> Is this some kind of label? I can't find the target anywhere.

It is a marker. If you specify -DNP and if there is any cycle of
states that does not pass through a label beginning with "progress",
the verification will fail. So it is useful for finding livelocks.

Mathieu posted another way of getting this same effect.

> > skip;
> > :: curstate == RCU_SYSIDLE_NOT ->
> > progress_idle_reset:
> > skip;
> > :: else -> skip;
> > fi;
> >
> > /* Manage state... */
> > if
> > :: idle == 1 && curstate < RCU_SYSIDLE_FULL_NOTED ->
> > /* Idle, advance to next state. */
> > atomic {
> > if
> > :: full_sysidle_state == curstate ->
> > newstate = curstate + 1;
> > full_sysidle_state = newstate;
> > :: else -> skip;
> > fi;
> > }
>
> It looks good but just one thing about the transition from FULL -> FULL_NOTED.
> At least in the case of CONFIG_NO_HZ_FULL_SYSIDLE_SMALL (which is usually the
> scenario I refer to, but I'll check further the grace-period driven way as well),
> we switch from FULL to FULL_NOTED without checking a new round of the dynticks counters.
>
> But this timekeeper() proc doesn't seem to care and does a check_idle() no matter
> the current state.
>
> There should probably be a special case to handle that otherwise we add a new
> round of dynticks counters read between FULL and FULL_NOTED transition and this is an
> entirely different scenario than what we run.

Good catch! I changed the above RCU_SYSIDLE_FULL_NOTED to RCU_SYSIDLE_FULL
and added an atomic block to move to RCU_SYSIDLE_FULL_NOTED. Still verifies
(whew!).

> > :: idle == 0 && full_sysidle_state >= RCU_SYSIDLE_LONG ->
> > /* Non-idle and state advanced, revert to base state. */
> > full_sysidle_state = RCU_SYSIDLE_NOT;
>
> Looking at the upstream code, I think we reset also when state == RCU_SYSIDLE_SHORT
> once we detect a non-idle state. If it's not a mistyping, I'm probably missing something.

I don't see this. The resetting happens in rcu_sysidle_force_exit(),
which contains the following:

while (oldstate > RCU_SYSIDLE_SHORT) {
newoldstate = cmpxchg(&full_sysidle_state,
oldstate, RCU_SYSIDLE_NOT);
if (oldstate == newoldstate &&
oldstate == RCU_SYSIDLE_FULL_NOTED) {
rcu_kick_nohz_cpu(tick_do_timer_cpu);
return; /* We cleared it, done! */
}
oldstate = newoldstate;
}

If the state is RCU_SYSIDLE_SHORT, we skip the body of the "if" thus
declining to reset back to RCU_SYSIDLE_NOT. Or am I confused?

Thanx, Paul

> Thanks.
>
> > :: else -> skip;
> > fi;
> >
> > /* If in RCU_SYSIDLE_FULL_NOTED, wait to be awakened. */
> > do
> > :: newstate != RCU_SYSIDLE_FULL_NOTED &&
> > wakeup_timekeeper == 1 ->
> > assert(0); /* Should never get here. */
> > :: newstate != RCU_SYSIDLE_FULL_NOTED &&
> > wakeup_timekeeper == 0 ->
> > break;
> > :: newstate == RCU_SYSIDLE_FULL_NOTED &&
> > wakeup_timekeeper == 1 ->
> > progress_full_system_idle_1:
> > assert(full_sysidle_state == RCU_SYSIDLE_NOT);
> > wakeup_timekeeper = 0;
> > break;
> > :: newstate == RCU_SYSIDLE_FULL_NOTED &&
> > wakeup_timekeeper == 0 ->
> > progress_full_system_idle_2:
> >
> > /* We are asleep, so all workers better be idle. */
> > atomic {
> > i = 0;
> > idle = 1;
> > do
> > :: i < NUM_WORKERS ->
> > if
> > :: am_setup[i] -> idle = 0;
> > :: else -> skip;
> > fi;
> > i++;
> > :: i >= NUM_WORKERS -> break;
> > od;
> > assert(idle == 1 ||
> > full_sysidle_state < RCU_SYSIDLE_FULL);
> > }
> > od;
> > assert(full_sysidle_state <= RCU_SYSIDLE_FULL_NOTED);
> > od;
> > }
> >
> > init {
> > byte i = 0;
> >
> > do
> > :: i < NUM_WORKERS ->
> > am_busy[i] = 1;
> > am_setup[i] = 1;
> > run worker(i);
> > i++;
> > :: i >= NUM_WORKERS -> break;
> > od;
> > run timekeeper();
> > }
> >
>

2014-04-07 18:54:58

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: Promela/spin model for NO_HZ_FULL_SYSIDLE code

On Mon, Apr 07, 2014 at 11:11:55AM -0700, Paul E. McKenney wrote:
> > On the upstream code, the first read of full_sysidle_state after exiting idle is not
> > performed by an atomic operation. So I wonder if this is right to put this
> > in the atomic section.
> >
> > I don't know the language enough to tell if it has no effect but I'm just
> > worried that it gets badly intepreted. Like the above read plus the below
> > conditional write in the same atomic section gets packed in a kind of cmpxchg_if_above() ?
> >
> > Which is what we want to avoid if the value is not above RCU_SYSIDLE_SHORT after
> > a non atomic read.
>
> Given that cmpxchg() is being used to emulate exactly that atomic
> operation, I feel good about this Promela translation.

Right but the very first read that checks if full_sysidle_state is > RCU_SYSIDLE_SHORT
is done in a non-atomic way. Only when it verifies the condition do we use cmpxchg()

while (oldstate > RCU_SYSIDLE_SHORT) {
newoldstate = cmpxchg(&full_sysidle_state,
oldstate, RCU_SYSIDLE_NOT);
if (oldstate == newoldstate && oldstate == RCU_SYSIDLE_FULL_NOTED) {
rcu_kick_nohz_cpu(tick_do_timer_cpu);
return; /* We cleared it, done! */
}
oldstate = newoldstate;
}

Now the way you wrote it PROMELA looked like we use cmpxchg() (or some close relative)
right away from the first read. It's hard to translate with real world operations so I'm
inventing cmpxchg_if_above() which has the same atomic and full barrier properties
than cmpxchg() expect that it only exchanges old value with new value if old is
above the last parameter:

oldstate = cmpxchg_if_above(oldstate, RCU_SYSIDLE_NOT, RCU_SYSIDLE_SHORT);
if (oldstate == RCU_SYSIDLE_FULL_NOTED)
rcu_kick_nohz_cpu(tick_do_timer_cpu);
return; /* We cleared it, done! */


> If the value is different at the time of the cmpxchg(), the cmpxchg() fails.

Right but it might not do a cmpxchg() if oldval is <= SHORT.

> I suppose I could write it as follows instead:
>
> /* Get state out of full-system idle states. */
> oldstate = full_sysidle_state;
> do
> :: 1 ->
> atomic {
> if
> :: oldstate > RCU_SYSIDLE_SHORT &&
> oldstate == full_sysidle_state ->
> full_sysidle_state = RCU_SYSIDLE_NOT;
> break;
> :: else ->
> oldstate = full_sysidle_state;
> fi;
> }
> od;
>
> Here the "if" emulates the cmpxchg() instruction and the rest emulates
> the loop. Both representations get the same result when

Ok if they have the same result and implement the first read in a non atomic
way then it's all good. The PROMELA syntax has just been confusing to me in
this regard.

>
> > > if
> > > :: oldstate > RCU_SYSIDLE_SHORT ->
> > > full_sysidle_state = RCU_SYSIDLE_NOT;
> > > :: else -> skip;
> > > fi;
> > > }
> > >
> > > /* If needed, wake up the timekeeper. */
> > > if
> > > :: oldstate == RCU_SYSIDLE_FULL_NOTED ->
> > > wakeup_timekeeper = 1;
> > > :: else -> skip;
> > > fi;
> > >
> > > /* Mark ourselves fully awake and operational. */
> > > am_setup[me] = 1;
> > >
> > > /* We are fully awake, so timekeeper must not be asleep. */
> > > assert(full_sysidle_state < RCU_SYSIDLE_FULL);
> > >
> > > /* Running in kernel in the following loop. */
> > > do
> > > :: 1 -> skip;
> > > :: 1 -> break;
> > > od;
> > > od
> > > }
> > >
> > > /*
> > > * Are all the workers in dyntick-idle state?
> > > */
> > > #define check_idle() \
> > > i = 0; \
> > > idle = 1; \
> > > do \
> > > :: i < NUM_WORKERS -> \
> > > if \
> > > :: am_busy[i] == 1 -> idle = 0; \
> > > :: else -> skip; \
> > > fi; \
> > > i++; \
> > > :: i >= NUM_WORKERS -> break; \
> > > od
> > >
> > > /*
> > > * Timekeeping CPU.
> > > */
> > > proctype timekeeper()
> > > {
> > > byte i;
> > > byte idle;
> > > byte curstate;
> > > byte newstate;
> > >
> > > do
> > > :: 1 ->
> > > /* Capture current state. */
> > > check_idle();
> > > curstate = full_sysidle_state;
> > > newstate = curstate;
> > >
> > > /* Check for acceptance state. */
> > > if
> > > :: idle == 0 ->
> > > progress_idle:
> >
> > Is this some kind of label? I can't find the target anywhere.
>
> It is a marker. If you specify -DNP and if there is any cycle of
> states that does not pass through a label beginning with "progress",
> the verification will fail. So it is useful for finding livelocks.
>
> Mathieu posted another way of getting this same effect.

Ah ok.

>
> > > :: idle == 0 && full_sysidle_state >= RCU_SYSIDLE_LONG ->
> > > /* Non-idle and state advanced, revert to base state. */
> > > full_sysidle_state = RCU_SYSIDLE_NOT;
> >
> > Looking at the upstream code, I think we reset also when state == RCU_SYSIDLE_SHORT
> > once we detect a non-idle state. If it's not a mistyping, I'm probably missing something.
>
> I don't see this. The resetting happens in rcu_sysidle_force_exit(),
> which contains the following:
>
> while (oldstate > RCU_SYSIDLE_SHORT) {
> newoldstate = cmpxchg(&full_sysidle_state,
> oldstate, RCU_SYSIDLE_NOT);
> if (oldstate == newoldstate &&
> oldstate == RCU_SYSIDLE_FULL_NOTED) {
> rcu_kick_nohz_cpu(tick_do_timer_cpu);
> return; /* We cleared it, done! */
> }
> oldstate = newoldstate;
> }
>
> If the state is RCU_SYSIDLE_SHORT, we skip the body of the "if" thus
> declining to reset back to RCU_SYSIDLE_NOT. Or am I confused?

Hmm, the loop above is the code of !timekeeper side. I'm referring to the timekeeper side (which
is what this PROMELA chunck represents, unless I'm utterly confused).

And the timekeeper side resets full_sysidle_state when it detects a non-idle CPU:

rcu_sys_is_idle() -> rcu_sysidle_report -> rcu_sysidle_cancel()

So the PROMELA code suggests that when we find a non idle CPU, we only reset
full_sysidle_state when is it >= RCU_SYSIDLE_LONG:

:: idle == 0 && full_sysidle_state >= RCU_SYSIDLE_LONG ->
/* Non-idle and state advanced, revert to base state. */
full_sysidle_state = RCU_SYSIDLE_NOT;

...but actually the condition upstream seems to be full_sysidle_state >= RCU_SYSIDLE_SHORT.

For example we loop in rcu_sys_is_idle() (case of NR_CPUS < 8), and we detect a round of full-idle CPUs, so we
set full_sysidle_state to RCU_SYSIDLE_SHORT. Then we do another pass in the loop but this
time we detect a CPU is not idle, so we reset to RCU_SYSIDLE_NOT instead of advancing to RCU_SYSIDLE_LONG.

Right?

Thanks.

2014-04-07 19:54:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Promela/spin model for NO_HZ_FULL_SYSIDLE code

On Mon, Apr 07, 2014 at 08:54:52PM +0200, Frederic Weisbecker wrote:
> On Mon, Apr 07, 2014 at 11:11:55AM -0700, Paul E. McKenney wrote:
> > > On the upstream code, the first read of full_sysidle_state after exiting idle is not
> > > performed by an atomic operation. So I wonder if this is right to put this
> > > in the atomic section.
> > >
> > > I don't know the language enough to tell if it has no effect but I'm just
> > > worried that it gets badly intepreted. Like the above read plus the below
> > > conditional write in the same atomic section gets packed in a kind of cmpxchg_if_above() ?
> > >
> > > Which is what we want to avoid if the value is not above RCU_SYSIDLE_SHORT after
> > > a non atomic read.
> >
> > Given that cmpxchg() is being used to emulate exactly that atomic
> > operation, I feel good about this Promela translation.
>
> Right but the very first read that checks if full_sysidle_state is > RCU_SYSIDLE_SHORT
> is done in a non-atomic way. Only when it verifies the condition do we use cmpxchg()
>
> while (oldstate > RCU_SYSIDLE_SHORT) {
> newoldstate = cmpxchg(&full_sysidle_state,
> oldstate, RCU_SYSIDLE_NOT);
> if (oldstate == newoldstate && oldstate == RCU_SYSIDLE_FULL_NOTED) {
> rcu_kick_nohz_cpu(tick_do_timer_cpu);
> return; /* We cleared it, done! */
> }
> oldstate = newoldstate;
> }
>
> Now the way you wrote it PROMELA looked like we use cmpxchg() (or some close relative)
> right away from the first read. It's hard to translate with real world operations so I'm
> inventing cmpxchg_if_above() which has the same atomic and full barrier properties
> than cmpxchg() expect that it only exchanges old value with new value if old is
> above the last parameter:
>
> oldstate = cmpxchg_if_above(oldstate, RCU_SYSIDLE_NOT, RCU_SYSIDLE_SHORT);
> if (oldstate == RCU_SYSIDLE_FULL_NOTED)
> rcu_kick_nohz_cpu(tick_do_timer_cpu);
> return; /* We cleared it, done! */
>
>
> > If the value is different at the time of the cmpxchg(), the cmpxchg() fails.
>
> Right but it might not do a cmpxchg() if oldval is <= SHORT.
>
> > I suppose I could write it as follows instead:
> >
> > /* Get state out of full-system idle states. */
> > oldstate = full_sysidle_state;
> > do
> > :: 1 ->
> > atomic {
> > if
> > :: oldstate > RCU_SYSIDLE_SHORT &&
> > oldstate == full_sysidle_state ->
> > full_sysidle_state = RCU_SYSIDLE_NOT;
> > break;
> > :: else ->
> > oldstate = full_sysidle_state;
> > fi;
> > }
> > od;
> >
> > Here the "if" emulates the cmpxchg() instruction and the rest emulates
> > the loop. Both representations get the same result when
>
> Ok if they have the same result and implement the first read in a non atomic
> way then it's all good. The PROMELA syntax has just been confusing to me in
> this regard.

Actually, my first attempt above wasn't quite right. This one is better.
So, yes, Promela can be confusing. ;-)

/* Get state out of full-system idle states. */
oldstate = full_sysidle_state;
do
:: 1 ->
atomic {
if
:: oldstate > RCU_SYSIDLE_SHORT &&
oldstate == full_sysidle_state ->
full_sysidle_state = RCU_SYSIDLE_NOT;
break;
:: oldstate > RCU_SYSIDLE_SHORT &&
oldstate != full_sysidle_state ->
oldstate = full_sysidle_state;
:: oldstate <= RCU_SYSIDLE_SHORT -> break;
fi;
}
od;

And this passes as well.

> > > > if
> > > > :: oldstate > RCU_SYSIDLE_SHORT ->
> > > > full_sysidle_state = RCU_SYSIDLE_NOT;
> > > > :: else -> skip;
> > > > fi;
> > > > }
> > > >
> > > > /* If needed, wake up the timekeeper. */
> > > > if
> > > > :: oldstate == RCU_SYSIDLE_FULL_NOTED ->
> > > > wakeup_timekeeper = 1;
> > > > :: else -> skip;
> > > > fi;
> > > >
> > > > /* Mark ourselves fully awake and operational. */
> > > > am_setup[me] = 1;
> > > >
> > > > /* We are fully awake, so timekeeper must not be asleep. */
> > > > assert(full_sysidle_state < RCU_SYSIDLE_FULL);
> > > >
> > > > /* Running in kernel in the following loop. */
> > > > do
> > > > :: 1 -> skip;
> > > > :: 1 -> break;
> > > > od;
> > > > od
> > > > }
> > > >
> > > > /*
> > > > * Are all the workers in dyntick-idle state?
> > > > */
> > > > #define check_idle() \
> > > > i = 0; \
> > > > idle = 1; \
> > > > do \
> > > > :: i < NUM_WORKERS -> \
> > > > if \
> > > > :: am_busy[i] == 1 -> idle = 0; \
> > > > :: else -> skip; \
> > > > fi; \
> > > > i++; \
> > > > :: i >= NUM_WORKERS -> break; \
> > > > od
> > > >
> > > > /*
> > > > * Timekeeping CPU.
> > > > */
> > > > proctype timekeeper()
> > > > {
> > > > byte i;
> > > > byte idle;
> > > > byte curstate;
> > > > byte newstate;
> > > >
> > > > do
> > > > :: 1 ->
> > > > /* Capture current state. */
> > > > check_idle();
> > > > curstate = full_sysidle_state;
> > > > newstate = curstate;
> > > >
> > > > /* Check for acceptance state. */
> > > > if
> > > > :: idle == 0 ->
> > > > progress_idle:
> > >
> > > Is this some kind of label? I can't find the target anywhere.
> >
> > It is a marker. If you specify -DNP and if there is any cycle of
> > states that does not pass through a label beginning with "progress",
> > the verification will fail. So it is useful for finding livelocks.
> >
> > Mathieu posted another way of getting this same effect.
>
> Ah ok.
>
> >
> > > > :: idle == 0 && full_sysidle_state >= RCU_SYSIDLE_LONG ->
> > > > /* Non-idle and state advanced, revert to base state. */
> > > > full_sysidle_state = RCU_SYSIDLE_NOT;
> > >
> > > Looking at the upstream code, I think we reset also when state == RCU_SYSIDLE_SHORT
> > > once we detect a non-idle state. If it's not a mistyping, I'm probably missing something.
> >
> > I don't see this. The resetting happens in rcu_sysidle_force_exit(),
> > which contains the following:
> >
> > while (oldstate > RCU_SYSIDLE_SHORT) {
> > newoldstate = cmpxchg(&full_sysidle_state,
> > oldstate, RCU_SYSIDLE_NOT);
> > if (oldstate == newoldstate &&
> > oldstate == RCU_SYSIDLE_FULL_NOTED) {
> > rcu_kick_nohz_cpu(tick_do_timer_cpu);
> > return; /* We cleared it, done! */
> > }
> > oldstate = newoldstate;
> > }
> >
> > If the state is RCU_SYSIDLE_SHORT, we skip the body of the "if" thus
> > declining to reset back to RCU_SYSIDLE_NOT. Or am I confused?
>
> Hmm, the loop above is the code of !timekeeper side. I'm referring to the timekeeper side (which
> is what this PROMELA chunck represents, unless I'm utterly confused).

Yep, we are in the timekeeper part of the model.

> And the timekeeper side resets full_sysidle_state when it detects a non-idle CPU:
>
> rcu_sys_is_idle() -> rcu_sysidle_report -> rcu_sysidle_cancel()
>
> So the PROMELA code suggests that when we find a non idle CPU, we only reset
> full_sysidle_state when is it >= RCU_SYSIDLE_LONG:
>
> :: idle == 0 && full_sysidle_state >= RCU_SYSIDLE_LONG ->
> /* Non-idle and state advanced, revert to base state. */
> full_sysidle_state = RCU_SYSIDLE_NOT;
>
> ...but actually the condition upstream seems to be full_sysidle_state >= RCU_SYSIDLE_SHORT.
>
> For example we loop in rcu_sys_is_idle() (case of NR_CPUS < 8), and we detect a round of full-idle CPUs, so we
> set full_sysidle_state to RCU_SYSIDLE_SHORT. Then we do another pass in the loop but this
> time we detect a CPU is not idle, so we reset to RCU_SYSIDLE_NOT instead of advancing to RCU_SYSIDLE_LONG.
>
> Right?

So rcu_sys_is_idle() invokes rcu_sysidle_report() after each scan.
If rcu_sysidle_report() sees that there was a nonidle CPU, it invokes
rcu_sysidle_cancel(). Which does indeed set the state back to
RCU_SYSIDLE_NOT, as you say. Who wrote this code, anyway? ;-)

IIRC, the rationale was that hitting the state hard and often on
small systems should not be a problem. But there will probably
soon be some system out there that will not like this sort of
treatment.

I am tempted to modify the kernel to make it easier to model by putting a
check for full_sysidle_state >= RCU_SYSIDLE_LONG in rcu_sysidle_cancel(),
and I might at some point do just that. For the moment, I just made
the model check both ways. It does work either way. (Whew!)

Seems like it would be simpler if the state machines for both large
and small systems operated the same, though. And the extra check
certainly shouldn't hurt small systems. So I will change the kernel!

Thanx, Paul