2022-11-14 15:50:24

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH 1/2] torture: use for_each_present() loop in torture_online_all()

"Paul E. McKenney" <[email protected]> writes:

> On Fri, Nov 11, 2022 at 01:51:24PM +0100, Sven Schnelle wrote:
>> A CPU listed in the possible mask doesn't have to be present, in
>> which case it would crash the kernel in torture_online_all().
>> To prevent this use a for_each_present() loop.
>>
>> Signed-off-by: Sven Schnelle <[email protected]>
>
> Looks good to me! Any reason for no mailing list on CC?

No, my fault. I setup get_maintainer.pl to be called from git
send-email, but looks like i did it wrong :-)

> Ah, and any synchronization required in case it is possible for a CPU
> to leave the cpu_present_mask? Or can they only be added?

Hmm... I think the main question is, whether it is ok for a cpu to be
removed from the system when rcutorture is running? In both cases it
would disappear from the cpu online mask, so i don't think the patch
would change the behaviour. But i can check and send additional patches
if there are other places that needs adjustment.

Regards
Sven


2022-11-14 16:56:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 1/2] torture: use for_each_present() loop in torture_online_all()

On Mon, Nov 14, 2022 at 04:35:06PM +0100, Sven Schnelle wrote:
> "Paul E. McKenney" <[email protected]> writes:
>
> > On Fri, Nov 11, 2022 at 01:51:24PM +0100, Sven Schnelle wrote:
> >> A CPU listed in the possible mask doesn't have to be present, in
> >> which case it would crash the kernel in torture_online_all().
> >> To prevent this use a for_each_present() loop.
> >>
> >> Signed-off-by: Sven Schnelle <[email protected]>
> >
> > Looks good to me! Any reason for no mailing list on CC?
>
> No, my fault. I setup get_maintainer.pl to be called from git
> send-email, but looks like i did it wrong :-)

Been there, done that! ;-)

> > Ah, and any synchronization required in case it is possible for a CPU
> > to leave the cpu_present_mask? Or can they only be added?
>
> Hmm... I think the main question is, whether it is ok for a cpu to be
> removed from the system when rcutorture is running? In both cases it
> would disappear from the cpu online mask, so i don't think the patch
> would change the behaviour. But i can check and send additional patches
> if there are other places that needs adjustment.

Yes, rcutorture has lower-level checks for CPUs being hotplugged
behind its back. Which might be sufficient. But this patch is in
response to something bad happening if the CPU is also not present in
the cpu_present_mask. Would that same bad thing happen if rcutorture saw
the CPU in cpu_online_mask, but by the time it attempted to CPU-hotplug
it, that CPU was gone not just from cpu_online_mask, but also from
cpu_present_mask?

Or are CPUs never removed from cpu_present_mask?

Thanx, Paul

2022-11-15 07:12:36

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH 1/2] torture: use for_each_present() loop in torture_online_all()

Hi Paul,

"Paul E. McKenney" <[email protected]> writes:

> On Mon, Nov 14, 2022 at 04:35:06PM +0100, Sven Schnelle wrote:
>> "Paul E. McKenney" <[email protected]> writes:
>>
>> > On Fri, Nov 11, 2022 at 01:51:24PM +0100, Sven Schnelle wrote:
>> >> A CPU listed in the possible mask doesn't have to be present, in
>> >> which case it would crash the kernel in torture_online_all().
>> >> To prevent this use a for_each_present() loop.
>> >>
>> >> Signed-off-by: Sven Schnelle <[email protected]>
>> >
>> > Looks good to me! Any reason for no mailing list on CC?
>>
>> No, my fault. I setup get_maintainer.pl to be called from git
>> send-email, but looks like i did it wrong :-)
>
> Been there, done that! ;-)
>
>> > Ah, and any synchronization required in case it is possible for a CPU
>> > to leave the cpu_present_mask? Or can they only be added?
>>
>> Hmm... I think the main question is, whether it is ok for a cpu to be
>> removed from the system when rcutorture is running? In both cases it
>> would disappear from the cpu online mask, so i don't think the patch
>> would change the behaviour. But i can check and send additional patches
>> if there are other places that needs adjustment.
>
> Yes, rcutorture has lower-level checks for CPUs being hotplugged
> behind its back. Which might be sufficient. But this patch is in
> response to something bad happening if the CPU is also not present in
> the cpu_present_mask. Would that same bad thing happen if rcutorture saw
> the CPU in cpu_online_mask, but by the time it attempted to CPU-hotplug
> it, that CPU was gone not just from cpu_online_mask, but also from
> cpu_present_mask?
>
> Or are CPUs never removed from cpu_present_mask?

In the current implementation CPUs can only be added to the
cpu_present_mask, but never removed. This might change in the future
when we get support from firmware for that, but the current s390 code
doesn't do that.

Regards
Sven

2022-11-15 14:05:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 1/2] torture: use for_each_present() loop in torture_online_all()

On Tue, Nov 15, 2022 at 07:55:50AM +0100, Sven Schnelle wrote:
> Hi Paul,
>
> "Paul E. McKenney" <[email protected]> writes:
>
> > On Mon, Nov 14, 2022 at 04:35:06PM +0100, Sven Schnelle wrote:
> >> "Paul E. McKenney" <[email protected]> writes:
> >>
> >> > On Fri, Nov 11, 2022 at 01:51:24PM +0100, Sven Schnelle wrote:
> >> >> A CPU listed in the possible mask doesn't have to be present, in
> >> >> which case it would crash the kernel in torture_online_all().
> >> >> To prevent this use a for_each_present() loop.
> >> >>
> >> >> Signed-off-by: Sven Schnelle <[email protected]>
> >> >
> >> > Looks good to me! Any reason for no mailing list on CC?
> >>
> >> No, my fault. I setup get_maintainer.pl to be called from git
> >> send-email, but looks like i did it wrong :-)
> >
> > Been there, done that! ;-)
> >
> >> > Ah, and any synchronization required in case it is possible for a CPU
> >> > to leave the cpu_present_mask? Or can they only be added?
> >>
> >> Hmm... I think the main question is, whether it is ok for a cpu to be
> >> removed from the system when rcutorture is running? In both cases it
> >> would disappear from the cpu online mask, so i don't think the patch
> >> would change the behaviour. But i can check and send additional patches
> >> if there are other places that needs adjustment.
> >
> > Yes, rcutorture has lower-level checks for CPUs being hotplugged
> > behind its back. Which might be sufficient. But this patch is in
> > response to something bad happening if the CPU is also not present in
> > the cpu_present_mask. Would that same bad thing happen if rcutorture saw
> > the CPU in cpu_online_mask, but by the time it attempted to CPU-hotplug
> > it, that CPU was gone not just from cpu_online_mask, but also from
> > cpu_present_mask?
> >
> > Or are CPUs never removed from cpu_present_mask?
>
> In the current implementation CPUs can only be added to the
> cpu_present_mask, but never removed. This might change in the future
> when we get support from firmware for that, but the current s390 code
> doesn't do that.

Very good!

Then could the patch please check that bits are never removed?
That way the code will complain should firmware support be added.

Thanx, Paul

2022-11-17 06:36:03

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH 1/2] torture: use for_each_present() loop in torture_online_all()

Hi Paul,

"Paul E. McKenney" <[email protected]> writes:

>> > Yes, rcutorture has lower-level checks for CPUs being hotplugged
>> > behind its back. Which might be sufficient. But this patch is in
>> > response to something bad happening if the CPU is also not present in
>> > the cpu_present_mask. Would that same bad thing happen if rcutorture saw
>> > the CPU in cpu_online_mask, but by the time it attempted to CPU-hotplug
>> > it, that CPU was gone not just from cpu_online_mask, but also from
>> > cpu_present_mask?
>> >
>> > Or are CPUs never removed from cpu_present_mask?
>>
>> In the current implementation CPUs can only be added to the
>> cpu_present_mask, but never removed. This might change in the future
>> when we get support from firmware for that, but the current s390 code
>> doesn't do that.
>
> Very good!
>
> Then could the patch please check that bits are never removed?
> That way the code will complain should firmware support be added.
>
> Thanx, Paul

I'm not sure whether i fully understand that. If the CPU could
be removed from the system and the cpu_present_mask, that could
happen at any time. So i don't see how we should check about that?

Regards
Sven

2022-11-17 15:23:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 1/2] torture: use for_each_present() loop in torture_online_all()

On Thu, Nov 17, 2022 at 07:30:32AM +0100, Sven Schnelle wrote:
> Hi Paul,
>
> "Paul E. McKenney" <[email protected]> writes:
>
> >> > Yes, rcutorture has lower-level checks for CPUs being hotplugged
> >> > behind its back. Which might be sufficient. But this patch is in
> >> > response to something bad happening if the CPU is also not present in
> >> > the cpu_present_mask. Would that same bad thing happen if rcutorture saw
> >> > the CPU in cpu_online_mask, but by the time it attempted to CPU-hotplug
> >> > it, that CPU was gone not just from cpu_online_mask, but also from
> >> > cpu_present_mask?
> >> >
> >> > Or are CPUs never removed from cpu_present_mask?
> >>
> >> In the current implementation CPUs can only be added to the
> >> cpu_present_mask, but never removed. This might change in the future
> >> when we get support from firmware for that, but the current s390 code
> >> doesn't do that.
> >
> > Very good!
> >
> > Then could the patch please check that bits are never removed?
> > That way the code will complain should firmware support be added.
> >
> > Thanx, Paul
>
> I'm not sure whether i fully understand that. If the CPU could
> be removed from the system and the cpu_present_mask, that could
> happen at any time. So i don't see how we should check about that?

Well, that is my question to you. ;-)

Suppose we have the following sequence of events:

o rcutorture sees that CPU 5 is in cpu_present_mask, but offline.

o rcutorture therefore decides to online CPU 5.

o s390 firmware removes CPU 5, and s390 architecture code then
clears it from the cpu_present_mask.

o rcutorture proceeds with onlining CPU 5.

Don't we then get the same problem that prompted you to change from
cpu_possible_mask to cpu_present mask? If not, why can't the rcutorture
code continue to use cpu_possible_mask?

If it really is bad to try to online or offline a CPU that is in
cpu_possible_mask but not in cpu_present_mask, and if CPUs can be removed
from cpu_present_mask, then we need some way to synchronize the removal
of CPUs from cpu_present_mask. There are of course a lot of possible
ways to do that synchronization, for example, protecting cpu_present_mask
with a mutex or similar.

Alternatively, s390 could restrict things. One way to do that would
be to turn off rcutorture's use of CPU hotplug when running on s390,
for example, by using the module parameters provided for that purpose.
Another way to do that would be to refrain from removing CPUs from
cpu_present_mask while rcutorture is running.

Are there other approaches?

Thanx, Paul

2022-11-18 23:57:31

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 1/2] torture: use for_each_present() loop in torture_online_all()

On Thu, Nov 17, 2022 at 07:06:37AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 17, 2022 at 07:30:32AM +0100, Sven Schnelle wrote:
> > Hi Paul,
> >
> > "Paul E. McKenney" <[email protected]> writes:
> >
> > >> > Yes, rcutorture has lower-level checks for CPUs being hotplugged
> > >> > behind its back. Which might be sufficient. But this patch is in
> > >> > response to something bad happening if the CPU is also not present in
> > >> > the cpu_present_mask. Would that same bad thing happen if rcutorture saw
> > >> > the CPU in cpu_online_mask, but by the time it attempted to CPU-hotplug
> > >> > it, that CPU was gone not just from cpu_online_mask, but also from
> > >> > cpu_present_mask?
> > >> >
> > >> > Or are CPUs never removed from cpu_present_mask?
> > >>
> > >> In the current implementation CPUs can only be added to the
> > >> cpu_present_mask, but never removed. This might change in the future
> > >> when we get support from firmware for that, but the current s390 code
> > >> doesn't do that.
> > >
> > > Very good!
> > >
> > > Then could the patch please check that bits are never removed?
> > > That way the code will complain should firmware support be added.
> > >
> > > Thanx, Paul
> >
> > I'm not sure whether i fully understand that. If the CPU could
> > be removed from the system and the cpu_present_mask, that could
> > happen at any time. So i don't see how we should check about that?
>
> Well, that is my question to you. ;-)
>
> Suppose we have the following sequence of events:
>
> o rcutorture sees that CPU 5 is in cpu_present_mask, but offline.
>
> o rcutorture therefore decides to online CPU 5.
>
> o s390 firmware removes CPU 5, and s390 architecture code then
> clears it from the cpu_present_mask.
>
> o rcutorture proceeds with onlining CPU 5.
>
> Don't we then get the same problem that prompted you to change from
> cpu_possible_mask to cpu_present mask? If not, why can't the rcutorture
> code continue to use cpu_possible_mask?
>
> If it really is bad to try to online or offline a CPU that is in
> cpu_possible_mask but not in cpu_present_mask, and if CPUs can be removed
> from cpu_present_mask, then we need some way to synchronize the removal
> of CPUs from cpu_present_mask. There are of course a lot of possible
> ways to do that synchronization, for example, protecting cpu_present_mask
> with a mutex or similar.
>
> Alternatively, s390 could restrict things. One way to do that would
> be to turn off rcutorture's use of CPU hotplug when running on s390,
> for example, by using the module parameters provided for that purpose.
> Another way to do that would be to refrain from removing CPUs from
> cpu_present_mask while rcutorture is running.
>
> Are there other approaches?

For the near term, why not have rcutorture keep a snapshot of
cpu_present_mask, and splat if a CPU is ever removed from that mask?

That would catch any issues, and defer any synchronization decisions to
a time at which we actually have some chance of knowing what is going on.

Thanx, Paul