Subject: RE: [PATCH v4 4/4] rcu: Add RCU stall diagnosis information


> --- a/kernel/rcu/Kconfig.debug
> +++ b/kernel/rcu/Kconfig.debug
> @@ -95,6 +95,16 @@ config RCU_EXP_CPU_STALL_TIMEOUT
> says to use the RCU_CPU_STALL_TIMEOUT value converted from
> seconds to milliseconds.
>
> +config RCU_CPU_STALL_CPUTIME

Since the information might change and grow over time, consider
calling it "ADDITIONAL_INFO" rather than just "CPUTIME".

> + bool "Provide additional rcu stall debug information"
> + depends on RCU_STALL_COMMON
> + default n
> + help
> + Statistics during the period from RCU_CPU_STALL_TIMEOUT/2 to
> + RCU_CPU_STALL_TIMEOUT, such as the number of (hard interrupts,
> soft
> + interrupts, task switches) and the cputime of (hard interrupts,
> soft
> + interrupts, kerenl tasks) are added to the rcu stall report.

s/kerenl/kernel/

Those parenthetical words are not grammatically correct, and
nobody will remember to update Kconfig if they change the code.

Try making it is a little less precise:
"Collect statistics during the RCU stall timeout period
(e.g., counts and CPU time consumed by hard interrupts, soft
interrupts, task switches, and kernel tasks) and include
them in the RCU stall console error message."

...
> +
> + pr_err(" hardirqs softirqs csw/system\n");
> + pr_err(" number: %8ld %10d %12lld\n",

Everything after a %d is a "number". These are more specifically counts.

> + kstat_cpu_irqs_sum(cpu) - rsrp->nr_hardirqs,
> + kstat_cpu_softirqs_sum(cpu) - rsrp->nr_softirqs,
> + nr_context_switches_cpu(cpu) - rsrp->nr_csw);

Two columns are plural and one is not.

Within a few lines, this is referred to as "context_switches", cws,
and "task switches." One term would be better. vmstat uses "cs",
pidstat uses "cswch", and /proc/<pid>/status uses "ctxt_switches,"
so there's not one ideal term for that column. Matching one
of those might be better than using another unique name.

Since no other variants are shown, the "/system" suffix is a bit
confusing.


> + pr_err("cputime: %8lld %10lld %12lld ==> %lld(ms)\n",
> + div_u64(cpustat[CPUTIME_IRQ] - rsrp->cputime_irq, NSEC_PER_MSEC),
> + div_u64(cpustat[CPUTIME_SOFTIRQ] - rsrp->cputime_softirq, NSEC_PER_MSEC),
> + div_u64(cpustat[CPUTIME_SYSTEM] - rsrp->cputime_system, NSEC_PER_MSEC),
> + jiffies64_to_msecs(half_timeout));

There is no column header for that fourth number. "half_timeout"
might be good, dropping the "==>".

The "(ms)" at the end is awkward. Try moving it to the left
as "cputime (ms): "

> +int rcu_cpu_stall_cputime __read_mostly =
> IS_ENABLED(CONFIG_RCU_CPU_STALL_CPUTIME);

As a config option and module parameter, adding some more
instrumentation overhead might be worthwhile for other
likely causes of rcu stalls.

For example, if enabled, have these functions (if available
on the architecture) maintain a per-CPU running count of
their invocations, which also cause the CPU to be unavailable
for rcu:
- kernel_fpu_begin() calls - FPU/SIMD context preservation,
which also calls preempt_disable()
- preempt_disable() calls - scheduler context switches disabled
- local_irq_save() calls - interrupts disabled
- cond_resched() calls - lack of these is a problem

For kernel_fpu_begin and preempt_disable, knowing if it is
currently blocked for those reasons is probably the most
helpful.



2022-11-05 20:33:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] rcu: Add RCU stall diagnosis information

On Sat, Nov 05, 2022 at 01:58:28AM +0000, Elliott, Robert (Servers) wrote:
>
> > --- a/kernel/rcu/Kconfig.debug
> > +++ b/kernel/rcu/Kconfig.debug
> > @@ -95,6 +95,16 @@ config RCU_EXP_CPU_STALL_TIMEOUT
> > says to use the RCU_CPU_STALL_TIMEOUT value converted from
> > seconds to milliseconds.
> >
> > +config RCU_CPU_STALL_CPUTIME
>
> Since the information might change and grow over time, consider
> calling it "ADDITIONAL_INFO" rather than just "CPUTIME".

Except that I already redirected Zhen Lei from a generic description
to a more specific one. The reason for this is that I already get
complaints about the large volume of output from RCU CPU stall warnings,
which suggests that unless the additional information is quite small,
it should get its own Kconfig option and kernel boot parameter.

So we well be keeping RCU_CPU_STALL_CPUTIME.

And yes, I understand that high-end enterprise deployments might have
absolutely no problem with the additional volume, but Linux handles a
range of hardware and workloads that I would have thought impossible
only 20 years ago. And RCU needs to accommodate the full range.

> > + bool "Provide additional rcu stall debug information"
> > + depends on RCU_STALL_COMMON
> > + default n
> > + help
> > + Statistics during the period from RCU_CPU_STALL_TIMEOUT/2 to
> > + RCU_CPU_STALL_TIMEOUT, such as the number of (hard interrupts,
> > soft
> > + interrupts, task switches) and the cputime of (hard interrupts,
> > soft
> > + interrupts, kerenl tasks) are added to the rcu stall report.
>
> s/kerenl/kernel/
>
> Those parenthetical words are not grammatically correct, and
> nobody will remember to update Kconfig if they change the code.
>
> Try making it is a little less precise:
> "Collect statistics during the RCU stall timeout period
> (e.g., counts and CPU time consumed by hard interrupts, soft
> interrupts, task switches, and kernel tasks) and include
> them in the RCU stall console error message."

My usual practice when taking in patches from someone whose native
language differs greatly from English is to do a round of wordsmithing.
I run the wordsmithing past the author, which helps catch any errors
I might make translating the rest of the way to Engligh and also
(more frequently than not) helps the author's English skills.

I have taken care of all but part of Zhen Lei's lat patch, and will
pull that in later today.

> ...
> > +
> > + pr_err(" hardirqs softirqs csw/system\n");
> > + pr_err(" number: %8ld %10d %12lld\n",
>
> Everything after a %d is a "number". These are more specifically counts.

The distinction between "number" and "count" probably isn't going to
travel well, especially given that the next line could be argued to be
either a number or a count of milliseconds.

To see this, please use Google Translate to convert "George is just along
for the ride" to Chinese (simplified) and back and see what you get. ;-)

OK, here is what I am going to do. I am going to translate the last
patch's commit log, but leave the patch itself as Zhen Lei sent it.
Please feel free to send a follow-on patch that does the wordsmithing,
but please check your work with some non-native English speakers.
I am guessing that Zhen Lei would be happy to give you feedback from a
both a technical viewpoint and a Chinese-speaker viewpoint.

Please understand that this offer is not in any way sarcastic or
insincere. This sort of translation does take time and energy, and so
I would have to be quite stupid not to accept good help.

> > + kstat_cpu_irqs_sum(cpu) - rsrp->nr_hardirqs,
> > + kstat_cpu_softirqs_sum(cpu) - rsrp->nr_softirqs,
> > + nr_context_switches_cpu(cpu) - rsrp->nr_csw);
>
> Two columns are plural and one is not.
>
> Within a few lines, this is referred to as "context_switches", cws,
> and "task switches." One term would be better. vmstat uses "cs",
> pidstat uses "cswch", and /proc/<pid>/status uses "ctxt_switches,"
> so there's not one ideal term for that column. Matching one
> of those might be better than using another unique name.
>
> Since no other variants are shown, the "/system" suffix is a bit
> confusing.

These are good points, including the lack of standard current
nomenclature. Again, would you be interested in providing a translation
patch?

> > + pr_err("cputime: %8lld %10lld %12lld ==> %lld(ms)\n",
> > + div_u64(cpustat[CPUTIME_IRQ] - rsrp->cputime_irq, NSEC_PER_MSEC),
> > + div_u64(cpustat[CPUTIME_SOFTIRQ] - rsrp->cputime_softirq, NSEC_PER_MSEC),
> > + div_u64(cpustat[CPUTIME_SYSTEM] - rsrp->cputime_system, NSEC_PER_MSEC),
> > + jiffies64_to_msecs(half_timeout));
>
> There is no column header for that fourth number. "half_timeout"
> might be good, dropping the "==>".

Another approach is to print it as a fraction, in this case, 2425/2500.
Of course, adding the "/2500" to the other two columns would be annoying.
Yet another approach is to leave that last column, but prepend a "/",
as in "/2500(ms)".

> The "(ms)" at the end is awkward. Try moving it to the left
> as "cputime (ms): "

Good point, and that might also better disambiguate "number" than
any other choice.

> > +int rcu_cpu_stall_cputime __read_mostly =
> > IS_ENABLED(CONFIG_RCU_CPU_STALL_CPUTIME);
>
> As a config option and module parameter, adding some more
> instrumentation overhead might be worthwhile for other
> likely causes of rcu stalls.
>
> For example, if enabled, have these functions (if available
> on the architecture) maintain a per-CPU running count of
> their invocations, which also cause the CPU to be unavailable
> for rcu:
> - kernel_fpu_begin() calls - FPU/SIMD context preservation,
> which also calls preempt_disable()
> - preempt_disable() calls - scheduler context switches disabled
> - local_irq_save() calls - interrupts disabled
> - cond_resched() calls - lack of these is a problem
>
> For kernel_fpu_begin and preempt_disable, knowing if it is
> currently blocked for those reasons is probably the most
> helpful.

These could be additional patches.

But please tie any such patch to an actual use case. After all, if no
one actually uses that additional information, we have irritates untold
numbers of electrons for no purpose.

Also, some of those functions are on fastpaths, so adding unconditional
instrumentation overhead might result in an objection or three.

As always, choose wisely! ;-)

If I don't see anything from you by this coming Friday, I will fold my
usual wordsmithing into the patch.

Thanx, Paul

Subject: RE: [PATCH v4 4/4] rcu: Add RCU stall diagnosis information


> > > +config RCU_CPU_STALL_CPUTIME
> >
> > Since the information might change and grow over time, consider
> > calling it "ADDITIONAL_INFO" rather than just "CPUTIME".
>
> Except that I already redirected Zhen Lei from a generic description
> to a more specific one. The reason for this is that I already get
> complaints about the large volume of output from RCU CPU stall warnings,
> which suggests that unless the additional information is quite small,
> it should get its own Kconfig option and kernel boot parameter.
>
> So we well be keeping RCU_CPU_STALL_CPUTIME.

No problem - CPUTIME is easier to search for, and will likely
always remain as one of the values reported.

> But please tie any such patch to an actual use case. After all, if no
> one actually uses that additional information, we have irritates untold
> numbers of electrons for no purpose.

I've been working on fixing intermittent RCU stalls that turned out to be
caused by x86-optimized crypto modules holding kernel_fpu_begin()/end()
too long and crypto tests not calling cond_resched() often enough, so am
interested in features that will help identify and prevent such
problems.

> Also, some of those functions are on fastpaths, so adding unconditional
> instrumentation overhead might result in an objection or three.
>
> As always, choose wisely! ;-)
>
> If I don't see anything from you by this coming Friday, I will fold my
> usual wordsmithing into the patch.

I applied the series to a tree not including any crypto module
fixes, but an overnight run didn't trigger any RCU stalls. So, I
modified the tcrypt test module to simulate a problem by
running a lengthy loop after kernel_fpu_begin().
Here's an example of how the output looks:


[ 1816.698380] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 1816.704368] rcu: 0-....: (2999 ticks this GP) idle=1dbc/1/0x4000000000000000 softirq=39683/39683 fqs=751
[ 1816.714200] rcu: hardirqs softirqs csw/system
[ 1816.719922] rcu: number: 5 10 0
[ 1816.725643] rcu: cputime: 3 0 1492 ==> 1500(ms)
[ 1816.732669] (t=3030 jiffies g=89765 q=308 ncpus=56)
[ 1816.737857] CPU: 0 PID: 46826 Comm: modprobe Tainted: G W 6.0.0+ #5
[ 1816.745754] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/08/2022
[ 1816.754523] RIP: 0010:rude_sleep_cycles.constprop.0+0x1c/0x30 [tcrypt]
[ 1816.761290] Code: 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 0f 1f 44 00 00 0f 31 48 c1 e2 20 be ab 90 41 00 48 89 d1 48 c1 e6 0b 48 09 c1 0f 31 <48> c1 e2 20 48 09 c2 48 29 ca 48 39 d6 73 ef c3 cc cc cc cc 48 8b
...

That makes me realize what "csw/switch" means:
- "csw" context switches applies to the number line
- "system" applies the cputime line

Maybe this shouldn't be a table? Make it grep-friendly:
[ 1816.719922] rcu: half-timeout counts: hardirq =5 softirq=10 csw=0
[ 1816.725643] rcu: half_timeout cputimes (ms): time=1500 hardirq=3 softirq=0 system=1492

The new prints interfere with existing prints, pushing the "t=3030" line
further away from its context. That's from a pr_cont. Existing prints after
the "self-detect stall" message all start with \t, so they are roughly
related to the first line. The new prints should probably do the same.

Since the first line ends with \n, the pr_cont will never make it on the
same line, so it might be cleaner to use pr_err instead. That way it'll have
the "rcu:" prefix of the other lines.

That's from:
pr_err("INFO: %s self-detected stall on CPU\n", rcu_state.name);
raw_spin_lock_irqsave_rcu_node(rdp->mynode, flags);
print_cpu_stall_info(smp_processor_id());
[that's where this patch adds more prints]
raw_spin_unlock_irqrestore_rcu_node(rdp->mynode, flags);
for_each_possible_cpu(cpu)
totqlen += rcu_get_n_cbs_cpu(cpu);
pr_cont("\t(t=%lu jiffies g=%ld q=%lu ncpus=%d)\n",
jiffies - gps,
(long)rcu_seq_current(&rcu_state.gp_seq), totqlen, rcu_state.n_online_cpus);


One other print has similar construction:
pr_err("INFO: %s detected stalls on CPUs/tasks:\n", rcu_state.name);
...
print_cpu_stall_info(cpu);
...
pr_cont("\t(detected by %d, t=%ld jiffies, g=%ld, q=%lu ncpus=%d)\n",
smp_processor_id(), (long)(jiffies - gps),
(long)rcu_seq_current(&rcu_state.gp_seq), totqlen, rcu_state.n_online_cpus);