2024-01-27 22:54:58

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model

Le Mon, Jan 15, 2024 at 03:37:41PM +0100, Anna-Maria Behnsen a ?crit :
> +static bool tmigr_inactive_up(struct tmigr_group *group,
> + struct tmigr_group *child,
> + void *ptr)
> +{
> + union tmigr_state curstate, newstate, childstate;
> + struct tmigr_walk *data = ptr;
> + bool walk_done;
> + u8 childmask;
> +
> + childmask = data->childmask;
> + curstate.state = atomic_read(&group->migr_state);
> + childstate.state = 0;
> +
> + do {

So I got the confirmation from Boqun (+Cc) and Paul that a failing cmpxchg
may not order the load of the old value against subsequent loads. And
that may apply to atomic_try_cmpxchg() as well.

Therefore you not only need to turn group->migr_state read into
an atomic_read_acquire() but you also need to do this on each iteration
of this loop. For example you can move the read_acquire right here.

Thanks.

> + if (child)
> + childstate.state = atomic_read(&child->migr_state);
> +
> + newstate = curstate;
> + walk_done = true;
> +
> + /* Reset active bit when the child is no longer active */
> + if (!childstate.active)
> + newstate.active &= ~childmask;
> +
> + if (newstate.migrator == childmask) {
> + /*
> + * Find a new migrator for the group, because the child
> + * group is idle!
> + */
> + if (!childstate.active) {
> + unsigned long new_migr_bit, active = newstate.active;
> +
> + new_migr_bit = find_first_bit(&active, BIT_CNT);
> +
> + if (new_migr_bit != BIT_CNT) {
> + newstate.migrator = BIT(new_migr_bit);
> + } else {
> + newstate.migrator = TMIGR_NONE;
> +
> + /* Changes need to be propagated */
> + walk_done = false;
> + }
> + }
> + }
> +
> + newstate.seq++;
> +
> + WARN_ON_ONCE((newstate.migrator != TMIGR_NONE) && !(newstate.active));
> +
> + } while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));


2024-01-29 10:50:51

by Anna-Maria Behnsen

[permalink] [raw]
Subject: Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model

Frederic Weisbecker <[email protected]> writes:

> Le Mon, Jan 15, 2024 at 03:37:41PM +0100, Anna-Maria Behnsen a écrit :
>> +static bool tmigr_inactive_up(struct tmigr_group *group,
>> + struct tmigr_group *child,
>> + void *ptr)
>> +{
>> + union tmigr_state curstate, newstate, childstate;
>> + struct tmigr_walk *data = ptr;
>> + bool walk_done;
>> + u8 childmask;
>> +
>> + childmask = data->childmask;
>> + curstate.state = atomic_read(&group->migr_state);
>> + childstate.state = 0;
>> +
>> + do {
>
> So I got the confirmation from Boqun (+Cc) and Paul that a failing cmpxchg
> may not order the load of the old value against subsequent loads. And
> that may apply to atomic_try_cmpxchg() as well.
>
> Therefore you not only need to turn group->migr_state read into
> an atomic_read_acquire() but you also need to do this on each iteration
> of this loop. For example you can move the read_acquire right here.

I tried to read and understand more about the memory barriers especially
the acquire/release stuff. So please correct me whenever I'm wrong.

We have to make sure that the child/group state values contain the last
updates and prevent reordering to be able to rely on those values.

So I understand, that we need the atomic_read_acquire() here for the
child state, because we change the group state accordingly and need to
make sure, that it contains the last update of it. The cmpxchg which
writes the child state is (on success) a full memory barrier. And the
atomic_read_acquire() makes sure all preceding "critical sections"
(which ends with the full memory barrier) are visible. Is this right?

To make sure the proper states are used, atomic_read_acquire() is then
also required in:
- tmigr_check_migrator()
- tmigr_check_migrator_and_lonely()
- tmigr_check_lonely()
- tmigr_new_timer_up() (for childstate and groupstate)
- tmigr_connect_child_parent()
Right?

Regarding the pairing of acquire: What happens when two
atomic_read_acquire() are executed afterwards without pairing 1:1 with a
release or stronger memory barrier?

Now I want to understand the case for the group state here and also in
active_up path. When reading it without acquire, it is possible, that
not all changes are visible due to reordering,... . But then the worst
outcome would be that the cmpxchg fails and the loop has to be done once
more? Is this right?

I know that memory barriers are not for free and redo the loop is also
not for free. But I don't know which of both is worse. At least in
inactive_up() path, we are not in the critical path. In active_up() it
would be good to take the less expensive option.

I want to understand the atomic_try_cmpxchg_acquire() variant: The Read
is an acquire, so even if the compare/write fails, the value which is
handed back is the one which was update last with a succesful cmpxchg
and then we can rely on this value?

Thanks a lot in advance for the help to understand this topic a little
better!

Anna-Maria

>
> Thanks.
>
>> + if (child)
>> + childstate.state = atomic_read(&child->migr_state);
>> +
>> + newstate = curstate;
>> + walk_done = true;
>> +
>> + /* Reset active bit when the child is no longer active */
>> + if (!childstate.active)
>> + newstate.active &= ~childmask;
>> +
>> + if (newstate.migrator == childmask) {
>> + /*
>> + * Find a new migrator for the group, because the child
>> + * group is idle!
>> + */
>> + if (!childstate.active) {
>> + unsigned long new_migr_bit, active = newstate.active;
>> +
>> + new_migr_bit = find_first_bit(&active, BIT_CNT);
>> +
>> + if (new_migr_bit != BIT_CNT) {
>> + newstate.migrator = BIT(new_migr_bit);
>> + } else {
>> + newstate.migrator = TMIGR_NONE;
>> +
>> + /* Changes need to be propagated */
>> + walk_done = false;
>> + }
>> + }
>> + }
>> +
>> + newstate.seq++;
>> +
>> + WARN_ON_ONCE((newstate.migrator != TMIGR_NONE) && !(newstate.active));
>> +
>> + } while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));

2024-01-29 14:05:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model

On Sat, Jan 27, 2024 at 11:54:46PM +0100, Frederic Weisbecker wrote:
> Le Mon, Jan 15, 2024 at 03:37:41PM +0100, Anna-Maria Behnsen a ?crit :
> > +static bool tmigr_inactive_up(struct tmigr_group *group,
> > + struct tmigr_group *child,
> > + void *ptr)
> > +{
> > + union tmigr_state curstate, newstate, childstate;
> > + struct tmigr_walk *data = ptr;
> > + bool walk_done;
> > + u8 childmask;
> > +
> > + childmask = data->childmask;
> > + curstate.state = atomic_read(&group->migr_state);
> > + childstate.state = 0;
> > +
> > + do {
>
> So I got the confirmation from Boqun (+Cc) and Paul that a failing cmpxchg
> may not order the load of the old value against subsequent loads. And
> that may apply to atomic_try_cmpxchg() as well.

Plus we checked with Mark Rutland, who agreed and who went further kindly
submitted a patch to clarify the documentation:

https://lore.kernel.org/lkml/[email protected]/

Here is an LKMM litmus test demonstrating that failing cmpxchg() does not
provide ordering:

------------------------------------------------------------------------

C cmpxchg-fail-order

{}

P0(int *x, int *y, int *z)
{
int r0;
int r1;

WRITE_ONCE(*x, 1);
r1 = cmpxchg(z, 1, 0);
r0 = READ_ONCE(*y);
}

P1(int *x, int *y, int *z)
{
int r0;
int r1;

WRITE_ONCE(*y, 1);
r1 = cmpxchg(z, 1, 0);
r0 = READ_ONCE(*x);
}

locations[0:r1;1:r1]
exists (0:r0=0 /\ 1:r0=0)

------------------------------------------------------------------------

Yes, this is cmpxchg() rather than atomic_cmpxchg(), but both have the
same ordering properties.

Here P0() is one thread and P1() the other. Their parameters are the
global shared variables. The "{}" preceding P0() is where initialization
could be placed, but this test is fine with the default initialization
of zero, for example, given that both instances of cmpxchg() specify
the value 1 as the old value (and thus both instances will fail).

The "locations" clause says to print the final values of P0()'s and P1()'s
local variables r1, where the leading "0:" selects P0() and the leading
"1:" selects P1(). And yes, the first process must be named "P0" and
subsequent processes' names must consist of "P" followed by a consecutive
number, so P0(), P1(), P2(), P3(), ...

The "exists" clause gives an expression to be tested "at the end of time".
Again, the "0:" and "1:" select P0() and P1(), respectively. The "="
tests for equality. The "/\" is boolean AND. Boolean OR would be
"\/" and boolean NOT "~". Parentheses may be used. A variable name on
the left-hand side of a relational operator denotes the value of that
variable, but on the right-hand side its address.

The value of any variable mentioned in the "exists" clause is printed as
part of the "States" list shown below.

Running this with the herd7 tool does the moral equivalent of a full
state-space search, and gets the following:

------------------------------------------------------------------------

$ herd7 -conf linux-kernel.cfg /tmp/cmpxchg.litmus
Test cmpxchg-fail-order Allowed
States 4
0:r0=0; 0:r1=0; 1:r0=0; 1:r1=0;
0:r0=0; 0:r1=0; 1:r0=1; 1:r1=0;
0:r0=1; 0:r1=0; 1:r0=0; 1:r1=0;
0:r0=1; 0:r1=0; 1:r0=1; 1:r1=0;
Ok
Witnesses
Positive: 1 Negative: 3
Condition exists (0:r0=0 /\ 1:r0=0)
Observation cmpxchg-fail-order Sometimes 1 3
Time cmpxchg-fail-order 0.00
Hash=564afea251867c6127350213c7eb388d

------------------------------------------------------------------------

Here, there are four possible combinations of final values for the two
r0 local variables, and all of them can happen, including the combination
where both are zero, which is the combination called out by the "exists"
clause. The "Sometimes" says that the expression in the "exists" clause
is sometimes true (other options being "Always" and "Never").

The fact that both instances of r0 can be zero shows that failing
cmpxchg() really does not provide ordering. And again, ordering for
atomic_try_cmpxchg() is the same as that for cmpxchg().

Thanx, Paul

> Therefore you not only need to turn group->migr_state read into
> an atomic_read_acquire() but you also need to do this on each iteration
> of this loop. For example you can move the read_acquire right here.
>
> Thanks.
>
> > + if (child)
> > + childstate.state = atomic_read(&child->migr_state);
> > +
> > + newstate = curstate;
> > + walk_done = true;
> > +
> > + /* Reset active bit when the child is no longer active */
> > + if (!childstate.active)
> > + newstate.active &= ~childmask;
> > +
> > + if (newstate.migrator == childmask) {
> > + /*
> > + * Find a new migrator for the group, because the child
> > + * group is idle!
> > + */
> > + if (!childstate.active) {
> > + unsigned long new_migr_bit, active = newstate.active;
> > +
> > + new_migr_bit = find_first_bit(&active, BIT_CNT);
> > +
> > + if (new_migr_bit != BIT_CNT) {
> > + newstate.migrator = BIT(new_migr_bit);
> > + } else {
> > + newstate.migrator = TMIGR_NONE;
> > +
> > + /* Changes need to be propagated */
> > + walk_done = false;
> > + }
> > + }
> > + }
> > +
> > + newstate.seq++;
> > +
> > + WARN_ON_ONCE((newstate.migrator != TMIGR_NONE) && !(newstate.active));
> > +
> > + } while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));

2024-01-29 22:22:06

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model

Le Mon, Jan 29, 2024 at 11:50:39AM +0100, Anna-Maria Behnsen a ?crit :
> Frederic Weisbecker <[email protected]> writes:
>
> > Le Mon, Jan 15, 2024 at 03:37:41PM +0100, Anna-Maria Behnsen a ?crit :
> >> +static bool tmigr_inactive_up(struct tmigr_group *group,
> >> + struct tmigr_group *child,
> >> + void *ptr)
> >> +{
> >> + union tmigr_state curstate, newstate, childstate;
> >> + struct tmigr_walk *data = ptr;
> >> + bool walk_done;
> >> + u8 childmask;
> >> +
> >> + childmask = data->childmask;
> >> + curstate.state = atomic_read(&group->migr_state);
> >> + childstate.state = 0;
> >> +
> >> + do {
> >
> > So I got the confirmation from Boqun (+Cc) and Paul that a failing cmpxchg
> > may not order the load of the old value against subsequent loads. And
> > that may apply to atomic_try_cmpxchg() as well.
> >
> > Therefore you not only need to turn group->migr_state read into
> > an atomic_read_acquire() but you also need to do this on each iteration
> > of this loop. For example you can move the read_acquire right here.
>
> I tried to read and understand more about the memory barriers especially
> the acquire/release stuff. So please correct me whenever I'm wrong.
>
> We have to make sure that the child/group state values contain the last
> updates and prevent reordering to be able to rely on those values.
>
> So I understand, that we need the atomic_read_acquire() here for the
> child state, because we change the group state accordingly and need to
> make sure, that it contains the last update of it. The cmpxchg which
> writes the child state is (on success) a full memory barrier. And the
> atomic_read_acquire() makes sure all preceding "critical sections"
> (which ends with the full memory barrier) are visible. Is this right?

Right. And BTW I'm being suggested by Paul to actually avoid
atomic_read_acquire() after cmpxchg() failure because that implies an
error prone re-read. So pick up your favourite between smp_rmb() or
smp_mb__after_atomic().

With the latter this could look like:

curstate.state = atomic_read_acquire(&group->migr_state);
for (;;) {
childstate.state = atomic_read(&child->migr_state);
...
if (atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state))
break;
smp_mb__after_atomic();
}

>
> To make sure the proper states are used, atomic_read_acquire() is then
> also required in:
> - tmigr_check_migrator()
> - tmigr_check_migrator_and_lonely()
> - tmigr_check_lonely()

Not sure about those. I'll check them.

> - tmigr_new_timer_up() (for childstate and groupstate)

Actually you need to fix some ordering there that I suggested a while ago :)
See https://lore.kernel.org/all/ZIhKT3h7Dc0G3xoU@lothringen/

> - tmigr_connect_child_parent()
> Right?
>
> Regarding the pairing of acquire: What happens when two
> atomic_read_acquire() are executed afterwards without pairing 1:1 with a
> release or stronger memory barrier?

I think I'll need an example.

>
> Now I want to understand the case for the group state here and also in
> active_up path. When reading it without acquire, it is possible, that
> not all changes are visible due to reordering,... . But then the worst
> outcome would be that the cmpxchg fails and the loop has to be done once
> more? Is this right?

Right. This one looks good as it doesn't depend on the child's value.

>
> I know that memory barriers are not for free and redo the loop is also
> not for free. But I don't know which of both is worse. At least in
> inactive_up() path, we are not in the critical path. In active_up() it
> would be good to take the less expensive option.

I don't think you need to change the active_up(), from a quick glance.

>
> I want to understand the atomic_try_cmpxchg_acquire() variant: The Read
> is an acquire, so even if the compare/write fails, the value which is
> handed back is the one which was update last with a succesful cmpxchg
> and then we can rely on this value?

So cmpxchg_acquire() provides a weaker ordering than cmpxchg(). Instead
of issuing a full memory barrier, it issues an acquire barrier, which is
really not what you want since you actually want to order what precedes
the cmpxchg() with the write that it performs. At the very least you would
actually need cmpxchg_release().

And most importantly, neither cmpxchg(), cmpxchg_release() nor cmpxchg_acquire()
guarantee any ordering on failure.

Thanks.

2024-01-30 13:33:15

by Anna-Maria Behnsen

[permalink] [raw]
Subject: Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model

Frederic Weisbecker <[email protected]> writes:

> Le Mon, Jan 29, 2024 at 11:50:39AM +0100, Anna-Maria Behnsen a écrit :
>> Frederic Weisbecker <[email protected]> writes:
>>
>> > Le Mon, Jan 15, 2024 at 03:37:41PM +0100, Anna-Maria Behnsen a écrit :
>> >> +static bool tmigr_inactive_up(struct tmigr_group *group,
>> >> + struct tmigr_group *child,
>> >> + void *ptr)
>> >> +{
>> >> + union tmigr_state curstate, newstate, childstate;
>> >> + struct tmigr_walk *data = ptr;
>> >> + bool walk_done;
>> >> + u8 childmask;
>> >> +
>> >> + childmask = data->childmask;
>> >> + curstate.state = atomic_read(&group->migr_state);
>> >> + childstate.state = 0;
>> >> +
>> >> + do {
>> >
>> > So I got the confirmation from Boqun (+Cc) and Paul that a failing cmpxchg
>> > may not order the load of the old value against subsequent loads. And
>> > that may apply to atomic_try_cmpxchg() as well.
>> >
>> > Therefore you not only need to turn group->migr_state read into
>> > an atomic_read_acquire() but you also need to do this on each iteration
>> > of this loop. For example you can move the read_acquire right here.
>>
>> I tried to read and understand more about the memory barriers especially
>> the acquire/release stuff. So please correct me whenever I'm wrong.
>>
>> We have to make sure that the child/group state values contain the last
>> updates and prevent reordering to be able to rely on those values.
>>
>> So I understand, that we need the atomic_read_acquire() here for the
>> child state, because we change the group state accordingly and need to
>> make sure, that it contains the last update of it. The cmpxchg which
>> writes the child state is (on success) a full memory barrier. And the
>> atomic_read_acquire() makes sure all preceding "critical sections"
>> (which ends with the full memory barrier) are visible. Is this right?
>
> Right. And BTW I'm being suggested by Paul to actually avoid
> atomic_read_acquire() after cmpxchg() failure because that implies an
> error prone re-read. So pick up your favourite between smp_rmb() or
> smp_mb__after_atomic().
>
> With the latter this could look like:
>
> curstate.state = atomic_read_acquire(&group->migr_state);
> for (;;) {
> childstate.state = atomic_read(&child->migr_state);
> ...
> if (atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state))
> break;
> smp_mb__after_atomic();
> }

I'll take this.

>>
>> To make sure the proper states are used, atomic_read_acquire() is then
>> also required in:
>> - tmigr_check_migrator()
>> - tmigr_check_migrator_and_lonely()
>> - tmigr_check_lonely()
>
> Not sure about those. I'll check them.

Please ignore - I was on the wrong track and John helped me out (at
least another step) of my "memory barrier understanding mess". So I have
no more questions regarding the memory barriers here - at least at the
moment.

>> - tmigr_new_timer_up() (for childstate and groupstate)
>
> Actually you need to fix some ordering there that I suggested a while ago :)
> See https://lore.kernel.org/all/ZIhKT3h7Dc0G3xoU@lothringen/

I quickly opened it and it seems that I completely missed it
somehow. I'm sorry!

Thanks for your patience!

Anna-Maria