by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH V4 2/2] rcu: Update jiffies in rcu_cpu_stall_reset()

On Sun, Aug 27, 2023 at 06:11:40PM -0400, Joel Fernandes wrote:
> On Sun, Aug 27, 2023 at 1:51 AM Huacai Chen <[email protected]> wrote:
> [..]
> > > > > > The only way I know of to avoid these sorts of false positives is for
> > > > > > the user to manually suppress all timeouts (perhaps using a kernel-boot
> > > > > > parameter for your early-boot case), do the gdb work, and then unsuppress
> > > > > > all stalls. Even that won't work for networking, because the other
> > > > > > system's clock will be running throughout.
> > > > > >
> > > > > > In other words, from what I know now, there is no perfect solution.
> > > > > > Therefore, there are sharp limits to the complexity of any solution that
> > > > > > I will be willing to accept.
> > > > > I think the simplest solution is (I hope Joel will not angry):
> > > >
> > > > Not angry at all, just want to help. ;-). The problem is the 300*HZ solution
> > > > will also effect the VM workloads which also do a similar reset. Allow me few
> > > > days to see if I can take a shot at fixing it slightly differently. I am
> > > > trying Paul's idea of setting jiffies at a later time. I think it is doable.
> > > > I think the advantage of doing this is it will make stall detection more
> > > > robust in this face of these gaps in jiffie update. And that solution does
> > > > not even need us to rely on ktime (and all the issues that come with that).
> > > >
> > >
> > > I wrote a patch similar to Paul's idea and sent it out for review, the
> > > advantage being it purely is based on jiffies. Could you try it out
> > > and let me know?
> > If you can cc my gmail <[email protected]>, that could be better.
>
> Sure, will do.
>
> > I have read your patch, maybe the counter (nr_fqs_jiffies_stall)
> > should be atomic_t and we should use atomic operation to decrement its
> > value. Because rcu_gp_fqs() can be run concurrently, and we may miss
> > the (nr_fqs == 1) condition.
>
> I don't think so. There is only 1 place where RMW operation happens
> and rcu_gp_fqs() is called only from the GP kthread. So a concurrent
> RMW (and hence a lost update) is not possible.

Huacai, is your concern that the gdb user might have created a script
(for example, printing a variable or two, then automatically continuing),
so that breakpoints could happen in quick successsion, such that the
second breakpoint might run concurrently with rcu_gp_fqs()?

If this can really happen, the point that Joel makes is a good one, namely
that rcu_gp_fqs() is single-threaded and (absent rcutorture) runs only
once every few jiffies. And gdb breakpoints, even with scripting, should
also be rather rare. So if this is an issue, a global lock should do the
trick, perhaps even one of the existing locks in the rcu_state structure.
The result should then be just as performant/scalable and a lot simpler
than use of atomics.

> Could you test the patch for the issue you are seeing and provide your
> Tested-by tag? Thanks,

Either way, testing would of course be very good! ;-)

Thanx, Paul

2023-08-30 18:33:20

by Huacai Chen

[permalink] [raw]

Subject: Re: [PATCH V4 2/2] rcu: Update jiffies in rcu_cpu_stall_reset()

On Tue, Aug 29, 2023 at 10:46 PM Joel Fernandes <[email protected]> wrote:
>
> On Tue, Aug 29, 2023 at 12:08 AM Huacai Chen <[email protected]> wrote:
> >
> > Hi, Joel,
> >
> > On Tue, Aug 29, 2023 at 4:47 AM Joel Fernandes <[email protected]> wrote:
> > >
> > > Hi Huacai,
> > >
> > > On Mon, Aug 28, 2023 at 11:13 AM Huacai Chen <[email protected]> wrote:
> > > >
> > > [...]
> > > > >
> > > > > > [Huacai]
> > > > > > I also think the original patch should be OK, but I have another
> > > > > > question: what will happen if the current GP ends before
> > > > > > nr_fqs_jiffies_stall reaches zero?
> > > > >
> > > > > Nothing should happen. Stall detection only happens when a GP is in
> > > > > progress. If a new GP starts, it resets nr_fqs_jiffies_stall.
> > > > >
> > > > > Or can you elaborate your concern more?
> > > > OK, I will test your patch these days. Maybe putting
> > > > nr_fqs_jiffies_stall before jiffies_force_qs is better, because I
> > > > think putting an 'int' between two 'long' is wasting space. :)
> > >
> > > That's a good point and I'll look into that.
> > Another point, is it better to replace ULONG_MAX with ULONG_MAX/4 as
> > Paul suggested?
> >
>
> I could do that but I don't feel too strongly about it. I will keep it
> at ULONG_MAX if it's OK with everyone.
>
> > > Meanwhile I pushed the patch out to my 6.4 stable tree for testing on my fleet.
> > >
> > > Ideally, I'd like to change the stall detection test in the rcutorture
> > > to actually fail rcutorture if stalls don't happen in time. But at
> > > least I verified this manually using rcutorture.
> > >
> > > I should also add a documentation patch for stallwarn.rst to document
> > > the understandable sensitivity of RCU stall detection to jiffies
> > > updates (or lack thereof). Or if you have time, I'd appreciate support
> > > on such a patch (not mandatory but I thought it would not hurt to
> > > ask).
> > >
> > > Looking forward to how your testing goes as well!
> > I have tested, it works for KGDB.
>
> Thanks! If you don't mind, I will add your Tested-by tag to the patch
> and send it out soon. My tests also look good!
You can add my Tested-by, but Reported-by should be "Binbin Zhou
<[email protected]>"

Huacai
>
>
> - Joel