2022-03-30 12:38:03

by Zqiang

[permalink] [raw]
Subject: [PATCH] rcu-tasks: Check the atomic variable trc_n_readers_need_end again when wait timeout

When the trc_wait waitqueue timeout, the atomic variable
trc_n_readers_need_end need to be checked again, perhaps the
conditions have been established at this time, avoid invalid
stall information output.

Signed-off-by: Zqiang <[email protected]>
---
kernel/rcu/tasks.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 65d6e21a607a..b73a2b362d6b 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -1544,7 +1544,7 @@ static void rcu_tasks_trace_postgp(struct rcu_tasks *rtp)
trc_wait,
atomic_read(&trc_n_readers_need_end) == 0,
READ_ONCE(rcu_task_stall_timeout));
- if (ret)
+ if (ret || !atomic_read(&trc_n_readers_need_end))
break; // Count reached zero.
// Stall warning time, so make a list of the offenders.
rcu_read_lock();
--
2.25.1


2022-03-31 02:52:22

by Zqiang

[permalink] [raw]
Subject: RE: [PATCH] rcu-tasks: Check the atomic variable trc_n_readers_need_end again when wait timeout


On Wed, Mar 30, 2022 at 07:20:14PM +0800, Zqiang wrote:
> When the trc_wait waitqueue timeout, the atomic variable
> trc_n_readers_need_end need to be checked again, perhaps the
> conditions have been established at this time, avoid invalid stall
> information output.
>
>But are you actually seeing this invalid stall information? Either way, please seem my comments and question below.
>
>Please don't get me wrong, we do have similar checks for normal vanilla RCU stall warnings, for example, this statement in print_other_cpu_stall() in kernel/rcu/tree_stall.h:
>
> pr_err("INFO: Stall ended before state dump start\n");
>
>However, the approach used there did benefit from significant real-world experience with its absence. ;-)
>
> Thanx, Paul
>
> Signed-off-by: Zqiang <[email protected]>
> ---
> kernel/rcu/tasks.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index
> 65d6e21a607a..b73a2b362d6b 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -1544,7 +1544,7 @@ static void rcu_tasks_trace_postgp(struct rcu_tasks *rtp)
> trc_wait,
> atomic_read(&trc_n_readers_need_end) == 0,
> READ_ONCE(rcu_task_stall_timeout));

>If I understand correctly, this patch is intended to handle a situation where the wait_event_idle_exclusive_timeout() timed out, but where the value of trc_n_readers_need_end decreased to zero just at this point.

Yes, this patch is intended to handle a situation.

>If so, please see my question below. If not, please show me the exact sequence of events leading up to the problem.

> - if (ret)
> + if (ret || !atomic_read(&trc_n_readers_need_end))
> break; // Count reached zero.

>Couldn't the value of trc_n_readers_need_end decrease to zero right here, still resulting in invalid stall information?

The value of trc_n_readers_need_end decrease to zero right here, indicates that the current grace period has been completed.
Even if we go to print some task with condition 't->trc_reader_special.b.need_qs' as true, and these task will not affect the current grace period. Is my understanding correct?

Thanks
Zqiang

> // Stall warning time, so make a list of the offenders.
> rcu_read_lock();
> --
> 2.25.1
>

2022-03-31 03:17:59

by Zqiang

[permalink] [raw]
Subject: RE: [PATCH] rcu-tasks: Check the atomic variable trc_n_readers_need_end again when wait timeout


On Wed, Mar 30, 2022 at 07:20:14PM +0800, Zqiang wrote:
> When the trc_wait waitqueue timeout, the atomic variable
> trc_n_readers_need_end need to be checked again, perhaps the
> conditions have been established at this time, avoid invalid stall
> information output.
>
>But are you actually seeing this invalid stall information? Either way, please seem my comments and question below.
>
>Please don't get me wrong, we do have similar checks for normal vanilla RCU stall warnings, for example, this statement in print_other_cpu_stall() in kernel/rcu/tree_stall.h:
>
> pr_err("INFO: Stall ended before state dump start\n");
>
>However, the approach used there did benefit from significant
>real-world experience with its absence. ;-)
>
> Thanx, Paul
>
> Signed-off-by: Zqiang <[email protected]>
> ---
> kernel/rcu/tasks.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index
> 65d6e21a607a..b73a2b362d6b 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -1544,7 +1544,7 @@ static void rcu_tasks_trace_postgp(struct rcu_tasks *rtp)
> trc_wait,
> atomic_read(&trc_n_readers_need_end) == 0,
> READ_ONCE(rcu_task_stall_timeout));

>If I understand correctly, this patch is intended to handle a situation where the wait_event_idle_exclusive_timeout() timed out, but where the value of trc_n_readers_need_end decreased to zero just at this point.
>>
>>Yes, this patch is intended to handle a situation.

>If so, please see my question below. If not, please show me the exact sequence of events leading up to the problem.

> - if (ret)
> + if (ret || !atomic_read(&trc_n_readers_need_end))
> break; // Count reached zero.

>Couldn't the value of trc_n_readers_need_end decrease to zero right here, still resulting in invalid stall information?
>
>>The value of trc_n_readers_need_end decrease to zero right here, indicates that the current grace period has been completed.
>>Even if we go to print some task with condition 't->trc_reader_special.b.need_qs' as true, and these task will not affect the current grace period. Is my understanding correct?
>>

Is my commit information not clear? Is this description more clearly

When the wait_event_idle_exclusive_timeout() timed out, if the value of
trc_n_readers_need_end decrease to zero just at this point
this indicates that the current grace period is just completed at this point,
direct break and avoid printing stall information.


>>Thanks
>>Zqiang

> // Stall warning time, so make a list of the offenders.
> rcu_read_lock();
> --
> 2.25.1
>

2022-03-31 04:43:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] rcu-tasks: Check the atomic variable trc_n_readers_need_end again when wait timeout

On Wed, Mar 30, 2022 at 07:20:14PM +0800, Zqiang wrote:
> When the trc_wait waitqueue timeout, the atomic variable
> trc_n_readers_need_end need to be checked again, perhaps the
> conditions have been established at this time, avoid invalid
> stall information output.

But are you actually seeing this invalid stall information? Either way,
please seem my comments and question below.

Please don't get me wrong, we do have similar checks for normal vanilla
RCU stall warnings, for example, this statement in print_other_cpu_stall()
in kernel/rcu/tree_stall.h:

pr_err("INFO: Stall ended before state dump start\n");

However, the approach used there did benefit from significant real-world
experience with its absence. ;-)

Thanx, Paul

> Signed-off-by: Zqiang <[email protected]>
> ---
> kernel/rcu/tasks.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 65d6e21a607a..b73a2b362d6b 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -1544,7 +1544,7 @@ static void rcu_tasks_trace_postgp(struct rcu_tasks *rtp)
> trc_wait,
> atomic_read(&trc_n_readers_need_end) == 0,
> READ_ONCE(rcu_task_stall_timeout));

If I understand correctly, this patch is intended to handle a situation
where the wait_event_idle_exclusive_timeout() timed out, but where the
value of trc_n_readers_need_end decreased to zero just at this point.
If so, please see my question below. If not, please show me the exact
sequence of events leading up to the problem.

> - if (ret)
> + if (ret || !atomic_read(&trc_n_readers_need_end))
> break; // Count reached zero.

Couldn't the value of trc_n_readers_need_end decrease to zero right
here, still resulting in invalid stall information?

> // Stall warning time, so make a list of the offenders.
> rcu_read_lock();
> --
> 2.25.1
>

2022-04-01 21:45:01

by Zqiang

[permalink] [raw]
Subject: RE: [PATCH] rcu-tasks: Check the atomic variable trc_n_readers_need_end again when wait timeout

On Thu, Mar 31, 2022 at 02:03:26AM +0000, Zhang, Qiang1 wrote:
>
> On Wed, Mar 30, 2022 at 07:20:14PM +0800, Zqiang wrote:
> > When the trc_wait waitqueue timeout, the atomic variable
> > trc_n_readers_need_end need to be checked again, perhaps the
> > conditions have been established at this time, avoid invalid stall
> > information output.
> >
> >But are you actually seeing this invalid stall information? Either way, please seem my comments and question below.
> >
> >Please don't get me wrong, we do have similar checks for normal vanilla RCU stall warnings, for example, this statement in print_other_cpu_stall() in kernel/rcu/tree_stall.h:
> >
> > pr_err("INFO: Stall ended before state dump start\n");
> >
> >However, the approach used there did benefit from significant
> >real-world experience with its absence. ;-)
> >
> > Thanx, Paul
> >
> > Signed-off-by: Zqiang <[email protected]>
> > ---
> > kernel/rcu/tasks.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index
> > 65d6e21a607a..b73a2b362d6b 100644
> > --- a/kernel/rcu/tasks.h
> > +++ b/kernel/rcu/tasks.h
> > @@ -1544,7 +1544,7 @@ static void rcu_tasks_trace_postgp(struct rcu_tasks *rtp)
> > trc_wait,
> > atomic_read(&trc_n_readers_need_end) == 0,
> > READ_ONCE(rcu_task_stall_timeout));
>
> >If I understand correctly, this patch is intended to handle a situation where the wait_event_idle_exclusive_timeout() timed out, but where the value of trc_n_readers_need_end decreased to zero just at this point.
> >>
> >>Yes, this patch is intended to handle a situation.
>
> >If so, please see my question below. If not, please show me the exact sequence of events leading up to the problem.
>
> > - if (ret)
> > + if (ret || !atomic_read(&trc_n_readers_need_end))
> > break; // Count reached zero.
>
> >Couldn't the value of trc_n_readers_need_end decrease to zero right here, still resulting in invalid stall information?
> >
> >>The value of trc_n_readers_need_end decrease to zero right here, indicates that the current grace period has been completed.
> >>Even if we go to print some task with condition 't->trc_reader_special.b.need_qs' as true, and these task will not affect the current grace period. Is my understanding correct?
> >>
>
> Is my commit information not clear? Is this description more clearly
>
> When the wait_event_idle_exclusive_timeout() timed out, if the value
> of trc_n_readers_need_end decrease to zero just at this point this
> indicates that the current grace period is just completed at this
> point, direct break and avoid printing stall information.

>Lack of clarity is not the problem. The problem is instead lack of justification. My questions were intended to help you understand how best to justify this sort of change, and also to help you determine on your own whether or not future changes can be justified.
>
>Please keep in mind that we reach that code when the grace period has been stalled for ten full -minutes-. If the system has 64 CPUs, this means that the interarrival rate of decrements to trc_n_readers_need_end is roughly 0.1 per second. The time between the wakeup-timeout test of trc_n_readers_need_end and the "if (ret)" will normally be small numbers of microseconds, let's say ten of them. The probability of this change making any difference is therefore on the order of one in a million.
>
>And the next step is to traverse the entire tasks list, which could take quite a few milliseconds. So there is is something like 1000 times more likely that the value of trc_n_readers_need_end will change in that loop than before the "if (ret)" check.
>
>Therefore, I am having a very hard time believing that this change is worthwhile.
>
>My turn. Is this explanation clear? ;-)

Thanks, this explanation was very detailed.

>
> Thanx, Paul

> >>Thanks
> >>Zqiang
>
> > // Stall warning time, so make a list of the offenders.
> > rcu_read_lock();
> > --
> > 2.25.1
> >

2022-04-02 13:43:31

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] rcu-tasks: Check the atomic variable trc_n_readers_need_end again when wait timeout

On Thu, Mar 31, 2022 at 02:03:26AM +0000, Zhang, Qiang1 wrote:
>
> On Wed, Mar 30, 2022 at 07:20:14PM +0800, Zqiang wrote:
> > When the trc_wait waitqueue timeout, the atomic variable
> > trc_n_readers_need_end need to be checked again, perhaps the
> > conditions have been established at this time, avoid invalid stall
> > information output.
> >
> >But are you actually seeing this invalid stall information? Either way, please seem my comments and question below.
> >
> >Please don't get me wrong, we do have similar checks for normal vanilla RCU stall warnings, for example, this statement in print_other_cpu_stall() in kernel/rcu/tree_stall.h:
> >
> > pr_err("INFO: Stall ended before state dump start\n");
> >
> >However, the approach used there did benefit from significant
> >real-world experience with its absence. ;-)
> >
> > Thanx, Paul
> >
> > Signed-off-by: Zqiang <[email protected]>
> > ---
> > kernel/rcu/tasks.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index
> > 65d6e21a607a..b73a2b362d6b 100644
> > --- a/kernel/rcu/tasks.h
> > +++ b/kernel/rcu/tasks.h
> > @@ -1544,7 +1544,7 @@ static void rcu_tasks_trace_postgp(struct rcu_tasks *rtp)
> > trc_wait,
> > atomic_read(&trc_n_readers_need_end) == 0,
> > READ_ONCE(rcu_task_stall_timeout));
>
> >If I understand correctly, this patch is intended to handle a situation where the wait_event_idle_exclusive_timeout() timed out, but where the value of trc_n_readers_need_end decreased to zero just at this point.
> >>
> >>Yes, this patch is intended to handle a situation.
>
> >If so, please see my question below. If not, please show me the exact sequence of events leading up to the problem.
>
> > - if (ret)
> > + if (ret || !atomic_read(&trc_n_readers_need_end))
> > break; // Count reached zero.
>
> >Couldn't the value of trc_n_readers_need_end decrease to zero right here, still resulting in invalid stall information?
> >
> >>The value of trc_n_readers_need_end decrease to zero right here, indicates that the current grace period has been completed.
> >>Even if we go to print some task with condition 't->trc_reader_special.b.need_qs' as true, and these task will not affect the current grace period. Is my understanding correct?
> >>
>
> Is my commit information not clear? Is this description more clearly
>
> When the wait_event_idle_exclusive_timeout() timed out, if the value of
> trc_n_readers_need_end decrease to zero just at this point
> this indicates that the current grace period is just completed at this point,
> direct break and avoid printing stall information.

Lack of clarity is not the problem. The problem is instead lack of
justification. My questions were intended to help you understand how
best to justify this sort of change, and also to help you determine on
your own whether or not future changes can be justified.

Please keep in mind that we reach that code when the grace period has
been stalled for ten full -minutes-. If the system has 64 CPUs, this
means that the interarrival rate of decrements to trc_n_readers_need_end
is roughly 0.1 per second. The time between the wakeup-timeout test of
trc_n_readers_need_end and the "if (ret)" will normally be small numbers
of microseconds, let's say ten of them. The probability of this change
making any difference is therefore on the order of one in a million.

And the next step is to traverse the entire tasks list, which could take
quite a few milliseconds. So there is is something like 1000 times more
likely that the value of trc_n_readers_need_end will change in that loop
than before the "if (ret)" check.

Therefore, I am having a very hard time believing that this change
is worthwhile.

My turn. Is this explanation clear? ;-)

Thanx, Paul

> >>Thanks
> >>Zqiang
>
> > // Stall warning time, so make a list of the offenders.
> > rcu_read_lock();
> > --
> > 2.25.1
> >