On Fri, Sep 17, 2021 at 11:34:06PM +0200, Guillaume Morin wrote:
> On 17 Sep 14:11, Paul E. McKenney wrote:
> > On Fri, Sep 17, 2021 at 09:15:57PM +0200, Guillaume Morin wrote:
> > > Hello Paul,
> > >
> > > I've been researching some RCU warnings we see that lead to full lockups
> > > with longterm 5.x kernels.
> > >
> > > Basically the rcu_advance_cbs() == true warning in
> > > rcu_advance_cbs_nowake() is firing then everything eventually gets
> > > stuck on RCU synchronization because the GP thread stays asleep while
> > > rcu_state.gp_flags & 1 == 1 (this is a bunch of nohz_full cpus)
> > >
> > > During that search I found your patch from July 12th
> > > https://www.spinics.net/lists/rcu/msg05731.html that seems related (all
> > > warnings we've seen happened in the __fput call path). Is there a reason
> > > this patch was not pushed? Is there an issue with this patch or did it
> > > fall just through the cracks?
> >
> > It is still in -rcu:
> >
> > 2431774f04d1 ("rcu: Mark accesses to rcu_state.n_force_qs")
> >
> > It is slated for the v5.16 merge window. But does it really fix the
> > problem that you are seeing?
>
> I am going to try it soon. Since I could not see it in Linus' tree, I
> wanted to make sure there was nothing wrong with the patch, hence my
> email :-)
>
> To my dismay, I can't reproduce this issue so this has made debugging
> and testing very complicated.
Welcome to my world! ;-)
> I have a few kdumps from 5.4 and 5.10 kernels (that's how I was able to
> observe that the gp thread was sleeping for a long time) and that
> rcu_state.gp_flags & 1 == 1.
>
> But this warning has happened a couple of dozen times on multiple
> machines in the __fput path (different kind of HW as well). Removing
> nohz_full from the command line makes the problem disappear.
>
> Most machines have had fairly long uptime (30+ days) before showing the
> warning, though it has happened on a couple occasions only after a few
> hours.
>
> That's pretty much all I have been able to gather so far, unfortunately.
What are these systems doing? Running mostly in nohz_full usermode?
Mostly idle? Something else?
If it happens again, could you please also capture the state of the
various rcuo kthreads? Of these, the rcuog kthreads start grace
periods and the rcuoc kthreads invoke callbacks.
> > > PS: FYI during my research, I've found another similar report in
> > > bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=208685
> >
> > Huh. First I have heard of it. It looks like they hit this after about
> > nine days of uptime. I have run way more than nine days of testing of
> > nohz_full RCU operation with rcutorture, and have never seen it myself.
> >
> > Can you reproduce this? If so, can you reproduce it on mainline kernels
> > (as opposed to -stable kernels as in that bugzilla)?
>
> I have at least one prod machine where the problem happens usually
> within a couple of days. All my attempts to reproduce on any testing
> environment have failed.
Again, welcome to my world!
> > The theory behind that WARN_ON_ONCE() is as follows:
> >
> > o The check of rcu_seq_state(rcu_seq_current(&rnp->gp_seq))
> > says that there is a grace period either in effect or just
> > now ending.
> >
> > o In the latter case, the grace-period cleanup has not yet
> > reached the current rcu_node structure, which means that
> > it has not yet checked to see if another grace period
> > is needed.
> >
> > o Either way, the RCU_GP_FLAG_INIT will cause the next grace
> > period to start. (This flag is protected by the root
> > rcu_node structure's ->lock.)
> >
> > Again, can you reproduce this, especially in mainline?
>
> I have not tried because running a mainline kernel in our prod
> enviroment is quite difficult and requires lot of work for validation.
> Though I could probably make it happen but it would take some time.
> Patches that I can apply on a stable kernel are much easier for me to
> try, as you probably have guessed.
OK, please see below. This is a complete shot in the dark, but could
potentially prevent the problem. Or make it worse, which would at the
very least speed up debugging. It might needs a bit of adjustment to
apply to the -stable kernels, but at first glance should apply cleanly.
Oh, and FYI I am having to manually paste your email address into the To:
line in order to get this to go back to you. Please check your email
configuration.
Which might mean that you need to pull this from my -rcu tree here:
1a792b59071b ("EXP rcu: Tighten rcu_advance_cbs_nowake() checks")
Thanx, Paul
------------------------------------------------------------------------
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 6a1e9d3374db..6d692a591f66 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1590,10 +1590,14 @@ static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp,
struct rcu_data *rdp)
{
rcu_lockdep_assert_cblist_protected(rdp);
- if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) ||
+ // Don't do anything unless the current grace period is guaranteed
+ // not to end. This means a grace period in progress and at least
+ // one holdout CPU.
+ if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) || !READ_ONCE(rnp->qsmask) ||
!raw_spin_trylock_rcu_node(rnp))
return;
- WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
+ if (rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) && READ_ONCE(rnp->qsmask))
+ WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
raw_spin_unlock_rcu_node(rnp);
}
On 17 Sep 15:07, Paul E. McKenney wrote:
> > I have a few kdumps from 5.4 and 5.10 kernels (that's how I was able to
> > observe that the gp thread was sleeping for a long time) and that
> > rcu_state.gp_flags & 1 == 1.
> >
> > But this warning has happened a couple of dozen times on multiple
> > machines in the __fput path (different kind of HW as well). Removing
> > nohz_full from the command line makes the problem disappear.
> >
> > Most machines have had fairly long uptime (30+ days) before showing the
> > warning, though it has happened on a couple occasions only after a few
> > hours.
> >
> > That's pretty much all I have been able to gather so far, unfortunately.
>
> What are these systems doing? Running mostly in nohz_full usermode?
> Mostly idle? Something else?
Running mostly in nohz_full usermode (non preempt), mostly busy but
it varies. I don't think I've seen this warning on a idle machine
though.
> If it happens again, could you please also capture the state of the
> various rcuo kthreads? Of these, the rcuog kthreads start grace
> periods and the rcuoc kthreads invoke callbacks.
You mean the task state? Or something else I can dig up from a kdump?
This one was taken about 32:24s after the warning happened.
crash> ps -m | grep rcu
[0 00:00:26.697] [IN] PID: 89 TASK: ffff93c940b60000 CPU: 0 COMMAND: "rcuog/12"
[0 00:00:30.443] [IN] PID: 114 TASK: ffff93c940c623c0 CPU: 0 COMMAND: "rcuog/16"
[0 00:00:30.483] [IN] PID: 20 TASK: ffff93c940920000 CPU: 0 COMMAND: "rcuog/1"
[0 00:00:30.490] [IN] PID: 64 TASK: ffff93c940a9c780 CPU: 0 COMMAND: "rcuog/8"
[0 00:00:31.373] [IN] PID: 39 TASK: ffff93c9409aa3c0 CPU: 0 COMMAND: "rcuog/4"
[0 00:32:24.007] [IN] PID: 58 TASK: ffff93c940a6c780 CPU: 0 COMMAND: "rcuos/7"
[0 00:32:24.007] [ID] PID: 12 TASK: ffff93c940854780 CPU: 0 COMMAND: "rcu_sched"
[0 00:32:24.080] [IN] PID: 27 TASK: ffff93c94094a3c0 CPU: 0 COMMAND: "rcuos/2"
[0 00:32:24.090] [IN] PID: 83 TASK: ffff93c940b38000 CPU: 0 COMMAND: "rcuos/11"
[0 00:32:24.200] [IN] PID: 115 TASK: ffff93c940c64780 CPU: 0 COMMAND: "rcuos/16"
[0 00:32:24.250] [IN] PID: 40 TASK: ffff93c9409ac780 CPU: 0 COMMAND: "rcuos/4"
[0 00:32:24.973] [IN] PID: 65 TASK: ffff93c940ab0000 CPU: 0 COMMAND: "rcuos/8"
[0 00:32:24.973] [IN] PID: 46 TASK: ffff93c9409d4780 CPU: 0 COMMAND: "rcuos/5"
[0 00:32:28.197] [IN] PID: 77 TASK: ffff93c940b08000 CPU: 0 COMMAND: "rcuos/10"
[0 00:39:04.800] [IN] PID: 52 TASK: ffff93c940a44780 CPU: 0 COMMAND: "rcuos/6"
[0 00:39:04.850] [IN] PID: 33 TASK: ffff93c94097a3c0 CPU: 0 COMMAND: "rcuos/3"
[0 02:36:51.923] [IN] PID: 102 TASK: ffff93c940bfa3c0 CPU: 0 COMMAND: "rcuos/14"
[0 04:21:46.806] [IN] PID: 121 TASK: ffff93c940c8c780 CPU: 0 COMMAND: "rcuos/17"
[0 04:21:46.806] [IN] PID: 108 TASK: ffff93c940c323c0 CPU: 0 COMMAND: "rcuos/15"
[0 04:25:49.033] [IN] PID: 21 TASK: ffff93c9409223c0 CPU: 0 COMMAND: "rcuos/1"
[0 04:25:49.033] [IN] PID: 96 TASK: ffff93c940bd23c0 CPU: 0 COMMAND: "rcuos/13"
[0 05:12:14.289] [IN] PID: 71 TASK: ffff93c940ad8000 CPU: 0 COMMAND: "rcuos/9"
[0 05:12:17.849] [IN] PID: 90 TASK: ffff93c940b623c0 CPU: 0 COMMAND: "rcuos/12"
[0 05:18:39.813] [IN] PID: 10 TASK: ffff93c940850000 CPU: 0 COMMAND: "rcu_tasks_trace"
[0 05:18:39.813] [IN] PID: 9 TASK: ffff93c940844780 CPU: 0 COMMAND: "rcu_tasks_rude_"
[0 05:18:39.813] [ID] PID: 4 TASK: ffff93c940828000 CPU: 0 COMMAND: "rcu_par_gp"
[0 05:18:39.813] [ID] PID: 3 TASK: ffff93c940804780 CPU: 0 COMMAND: "rcu_gp"
> OK, please see below. This is a complete shot in the dark, but could
> potentially prevent the problem. Or make it worse, which would at the
> very least speed up debugging. It might needs a bit of adjustment to
> apply to the -stable kernels, but at first glance should apply cleanly.
I can adjust, that's not a problem. But to be clear you'd rather have me
apply this instead of the other patch I mentioned
(https://www.spinics.net/lists/rcu/msg05731.html) or you're okay with me
trying with both applied?
> Oh, and FYI I am having to manually paste your email address into the To:
> line in order to get this to go back to you. Please check your email
> configuration.
Hmm I've adjusted the Reply-To. Let me know if it's better.
Guillaume.
--
Guillaume Morin <[email protected]>
On Sat, Sep 18, 2021 at 02:39:35AM +0200, Guillaume Morin wrote:
> On 17 Sep 15:07, Paul E. McKenney wrote:
> > > I have a few kdumps from 5.4 and 5.10 kernels (that's how I was able to
> > > observe that the gp thread was sleeping for a long time) and that
> > > rcu_state.gp_flags & 1 == 1.
> > >
> > > But this warning has happened a couple of dozen times on multiple
> > > machines in the __fput path (different kind of HW as well). Removing
> > > nohz_full from the command line makes the problem disappear.
> > >
> > > Most machines have had fairly long uptime (30+ days) before showing the
> > > warning, though it has happened on a couple occasions only after a few
> > > hours.
> > >
> > > That's pretty much all I have been able to gather so far, unfortunately.
> >
> > What are these systems doing? Running mostly in nohz_full usermode?
> > Mostly idle? Something else?
>
> Running mostly in nohz_full usermode (non preempt), mostly busy but
> it varies. I don't think I've seen this warning on a idle machine
> though.
OK, good to know.
> > If it happens again, could you please also capture the state of the
> > various rcuo kthreads? Of these, the rcuog kthreads start grace
> > periods and the rcuoc kthreads invoke callbacks.
>
> You mean the task state? Or something else I can dig up from a kdump?
>
> This one was taken about 32:24s after the warning happened.
>
> crash> ps -m | grep rcu
> [0 00:00:26.697] [IN] PID: 89 TASK: ffff93c940b60000 CPU: 0 COMMAND: "rcuog/12"
> [0 00:00:30.443] [IN] PID: 114 TASK: ffff93c940c623c0 CPU: 0 COMMAND: "rcuog/16"
> [0 00:00:30.483] [IN] PID: 20 TASK: ffff93c940920000 CPU: 0 COMMAND: "rcuog/1"
> [0 00:00:30.490] [IN] PID: 64 TASK: ffff93c940a9c780 CPU: 0 COMMAND: "rcuog/8"
> [0 00:00:31.373] [IN] PID: 39 TASK: ffff93c9409aa3c0 CPU: 0 COMMAND: "rcuog/4"
> [0 00:32:24.007] [IN] PID: 58 TASK: ffff93c940a6c780 CPU: 0 COMMAND: "rcuos/7"
> [0 00:32:24.007] [ID] PID: 12 TASK: ffff93c940854780 CPU: 0 COMMAND: "rcu_sched"
> [0 00:32:24.080] [IN] PID: 27 TASK: ffff93c94094a3c0 CPU: 0 COMMAND: "rcuos/2"
> [0 00:32:24.090] [IN] PID: 83 TASK: ffff93c940b38000 CPU: 0 COMMAND: "rcuos/11"
> [0 00:32:24.200] [IN] PID: 115 TASK: ffff93c940c64780 CPU: 0 COMMAND: "rcuos/16"
> [0 00:32:24.250] [IN] PID: 40 TASK: ffff93c9409ac780 CPU: 0 COMMAND: "rcuos/4"
> [0 00:32:24.973] [IN] PID: 65 TASK: ffff93c940ab0000 CPU: 0 COMMAND: "rcuos/8"
> [0 00:32:24.973] [IN] PID: 46 TASK: ffff93c9409d4780 CPU: 0 COMMAND: "rcuos/5"
> [0 00:32:28.197] [IN] PID: 77 TASK: ffff93c940b08000 CPU: 0 COMMAND: "rcuos/10"
> [0 00:39:04.800] [IN] PID: 52 TASK: ffff93c940a44780 CPU: 0 COMMAND: "rcuos/6"
> [0 00:39:04.850] [IN] PID: 33 TASK: ffff93c94097a3c0 CPU: 0 COMMAND: "rcuos/3"
> [0 02:36:51.923] [IN] PID: 102 TASK: ffff93c940bfa3c0 CPU: 0 COMMAND: "rcuos/14"
> [0 04:21:46.806] [IN] PID: 121 TASK: ffff93c940c8c780 CPU: 0 COMMAND: "rcuos/17"
> [0 04:21:46.806] [IN] PID: 108 TASK: ffff93c940c323c0 CPU: 0 COMMAND: "rcuos/15"
> [0 04:25:49.033] [IN] PID: 21 TASK: ffff93c9409223c0 CPU: 0 COMMAND: "rcuos/1"
> [0 04:25:49.033] [IN] PID: 96 TASK: ffff93c940bd23c0 CPU: 0 COMMAND: "rcuos/13"
> [0 05:12:14.289] [IN] PID: 71 TASK: ffff93c940ad8000 CPU: 0 COMMAND: "rcuos/9"
> [0 05:12:17.849] [IN] PID: 90 TASK: ffff93c940b623c0 CPU: 0 COMMAND: "rcuos/12"
> [0 05:18:39.813] [IN] PID: 10 TASK: ffff93c940850000 CPU: 0 COMMAND: "rcu_tasks_trace"
> [0 05:18:39.813] [IN] PID: 9 TASK: ffff93c940844780 CPU: 0 COMMAND: "rcu_tasks_rude_"
> [0 05:18:39.813] [ID] PID: 4 TASK: ffff93c940828000 CPU: 0 COMMAND: "rcu_par_gp"
> [0 05:18:39.813] [ID] PID: 3 TASK: ffff93c940804780 CPU: 0 COMMAND: "rcu_gp"
That is them! There are some flags that control their activities:
o rcu_data structure's ->nocb_gp_sleep field (rcuog)
o rcu_data structure's ->nocb_cb_sleep field (rcuoc)
> > OK, please see below. This is a complete shot in the dark, but could
> > potentially prevent the problem. Or make it worse, which would at the
> > very least speed up debugging. It might needs a bit of adjustment to
> > apply to the -stable kernels, but at first glance should apply cleanly.
>
> I can adjust, that's not a problem. But to be clear you'd rather have me
> apply this instead of the other patch I mentioned
> (https://www.spinics.net/lists/rcu/msg05731.html) or you're okay with me
> trying with both applied?
Trying both is fine.
> > Oh, and FYI I am having to manually paste your email address into the To:
> > line in order to get this to go back to you. Please check your email
> > configuration.
>
> Hmm I've adjusted the Reply-To. Let me know if it's better.
Much better. Still a bit unusual in that it puts everything on the
To: line instead of using Cc: as well, but close enough. Thank you!
Thanx, Paul
On 17 Sep 21:00, Paul E. McKenney wrote:
> That is them! There are some flags that control their activities:
>
> o rcu_data structure's ->nocb_gp_sleep field (rcuog)
> o rcu_data structure's ->nocb_cb_sleep field (rcuoc)
From the same kdump:
crash> pd rcu_data:all | grep -E 'nocb_cb_sleep|nocb_gp_sleep|per_cpu'
per_cpu(rcu_data, 0) = $69 = {
nocb_gp_sleep = 0 '\000',
nocb_cb_sleep = false,
per_cpu(rcu_data, 1) = $70 = {
nocb_gp_sleep = 1 '\001',
nocb_cb_sleep = true,
per_cpu(rcu_data, 2) = $71 = {
nocb_gp_sleep = 0 '\000',
nocb_cb_sleep = true,
per_cpu(rcu_data, 3) = $72 = {
nocb_gp_sleep = 0 '\000',
nocb_cb_sleep = true,
per_cpu(rcu_data, 4) = $73 = {
nocb_gp_sleep = 1 '\001',
nocb_cb_sleep = true,
per_cpu(rcu_data, 5) = $74 = {
nocb_gp_sleep = 0 '\000',
nocb_cb_sleep = true,
per_cpu(rcu_data, 6) = $75 = {
nocb_gp_sleep = 0 '\000',
nocb_cb_sleep = true,
per_cpu(rcu_data, 7) = $76 = {
nocb_gp_sleep = 0 '\000',
nocb_cb_sleep = true,
per_cpu(rcu_data, 8) = $77 = {
nocb_gp_sleep = 1 '\001',
nocb_cb_sleep = true,
per_cpu(rcu_data, 9) = $78 = {
nocb_gp_sleep = 0 '\000',
nocb_cb_sleep = true,
per_cpu(rcu_data, 10) = $79 = {
nocb_gp_sleep = 0 '\000',
nocb_cb_sleep = true,
per_cpu(rcu_data, 11) = $80 = {
nocb_gp_sleep = 0 '\000',
nocb_cb_sleep = true,
per_cpu(rcu_data, 12) = $81 = {
nocb_gp_sleep = 1 '\001',
nocb_cb_sleep = true,
per_cpu(rcu_data, 13) = $82 = {
nocb_gp_sleep = 0 '\000',
nocb_cb_sleep = true,
per_cpu(rcu_data, 14) = $83 = {
nocb_gp_sleep = 0 '\000',
nocb_cb_sleep = true,
per_cpu(rcu_data, 15) = $84 = {
nocb_gp_sleep = 0 '\000',
nocb_cb_sleep = true,
per_cpu(rcu_data, 16) = $85 = {
nocb_gp_sleep = 1 '\001',
nocb_cb_sleep = true,
per_cpu(rcu_data, 17) = $86 = {
nocb_gp_sleep = 0 '\000',
nocb_cb_sleep = true,
crash>
--
Guillaume Morin <[email protected]>
On Sat, Sep 18, 2021 at 09:08:38AM +0200, Guillaume Morin wrote:
> On 17 Sep 21:00, Paul E. McKenney wrote:
> > That is them! There are some flags that control their activities:
> >
> > o rcu_data structure's ->nocb_gp_sleep field (rcuog)
> > o rcu_data structure's ->nocb_cb_sleep field (rcuoc)
>
> >From the same kdump:
>
> crash> pd rcu_data:all | grep -E 'nocb_cb_sleep|nocb_gp_sleep|per_cpu'
> per_cpu(rcu_data, 0) = $69 = {
> nocb_gp_sleep = 0 '\000',
> nocb_cb_sleep = false,
> per_cpu(rcu_data, 1) = $70 = {
> nocb_gp_sleep = 1 '\001',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 2) = $71 = {
> nocb_gp_sleep = 0 '\000',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 3) = $72 = {
> nocb_gp_sleep = 0 '\000',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 4) = $73 = {
> nocb_gp_sleep = 1 '\001',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 5) = $74 = {
> nocb_gp_sleep = 0 '\000',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 6) = $75 = {
> nocb_gp_sleep = 0 '\000',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 7) = $76 = {
> nocb_gp_sleep = 0 '\000',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 8) = $77 = {
> nocb_gp_sleep = 1 '\001',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 9) = $78 = {
> nocb_gp_sleep = 0 '\000',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 10) = $79 = {
> nocb_gp_sleep = 0 '\000',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 11) = $80 = {
> nocb_gp_sleep = 0 '\000',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 12) = $81 = {
> nocb_gp_sleep = 1 '\001',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 13) = $82 = {
> nocb_gp_sleep = 0 '\000',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 14) = $83 = {
> nocb_gp_sleep = 0 '\000',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 15) = $84 = {
> nocb_gp_sleep = 0 '\000',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 16) = $85 = {
> nocb_gp_sleep = 1 '\001',
> nocb_cb_sleep = true,
> per_cpu(rcu_data, 17) = $86 = {
> nocb_gp_sleep = 0 '\000',
> nocb_cb_sleep = true,
> crash>
This is consistent with CPU 0's rcuoc kthread processing callbacks
and all of the rcuog threads waiting for more callbacks.
How is the testing of the patches going? (I am guessing nothing yet
based on the failure times, but who knows?)
Thanx, Paul
On 19 Sep 9:35, Paul E. McKenney wrote:
> How is the testing of the patches going? (I am guessing nothing yet
> based on the failure times, but who knows?)
Nothing yet. I think we'll have a better idea by wednesday.
Guillaume.
--
Guillaume Morin <[email protected]>
On 20 Sep 18:05, Guillaume Morin wrote:
> On 19 Sep 9:35, Paul E. McKenney wrote:
> > How is the testing of the patches going? (I am guessing nothing yet
> > based on the failure times, but who knows?)
>
> Nothing yet. I think we'll have a better idea by wednesday.
I am little afraid of jinxing it :) but so far so good. I have the a
patched kernel running on a few machines (including my most "reliable
crasher") and they've been stable so far.
It's definitely too early to declare victory though. I will keep you
posted.
Guillaume.
--
Guillaume Morin <[email protected]>
On Wed, Sep 22, 2021 at 09:14:07PM +0200, Guillaume Morin wrote:
> On 20 Sep 18:05, Guillaume Morin wrote:
> > On 19 Sep 9:35, Paul E. McKenney wrote:
> > > How is the testing of the patches going? (I am guessing nothing yet
> > > based on the failure times, but who knows?)
> >
> > Nothing yet. I think we'll have a better idea by wednesday.
>
> I am little afraid of jinxing it :) but so far so good. I have the a
> patched kernel running on a few machines (including my most "reliable
> crasher") and they've been stable so far.
>
> It's definitely too early to declare victory though. I will keep you
> posted.
Here is hoping! ;-)
Thanx, Paul
On 22 Sep 12:24, Paul E. McKenney wrote:
> On Wed, Sep 22, 2021 at 09:14:07PM +0200, Guillaume Morin wrote:
> > I am little afraid of jinxing it :) but so far so good. I have the a
> > patched kernel running on a few machines (including my most "reliable
> > crasher") and they've been stable so far.
> >
> > It's definitely too early to declare victory though. I will keep you
> > posted.
>
> Here is hoping! ;-)
Things are still stable. So I am pretty optimistic. How are you planning
to proceeed?
The first patch is already in your rcu tree and my gut feeling is that
it is the one that fixes the issue but you're the expert here... Though
I think it should be probably fast tracked and marked for stable?
Are you planning on committing the 2nd patch to your tree?
--
Guillaume Morin <[email protected]>
On Mon, Sep 27, 2021 at 05:38:42PM +0200, Guillaume Morin wrote:
> On 22 Sep 12:24, Paul E. McKenney wrote:
> > On Wed, Sep 22, 2021 at 09:14:07PM +0200, Guillaume Morin wrote:
> > > I am little afraid of jinxing it :) but so far so good. I have the a
> > > patched kernel running on a few machines (including my most "reliable
> > > crasher") and they've been stable so far.
> > >
> > > It's definitely too early to declare victory though. I will keep you
> > > posted.
> >
> > Here is hoping! ;-)
>
> Things are still stable. So I am pretty optimistic. How are you planning
> to proceeed?
Very good! Would you be willing to give me your Tested-by?
> The first patch is already in your rcu tree and my gut feeling is that
> it is the one that fixes the issue but you're the expert here... Though
> I think it should be probably fast tracked and marked for stable?
>
> Are you planning on committing the 2nd patch to your tree?
This is the second patch, correct? (Too many patches!)
If so, I add your Tested-by and fill out the commit log. It would be
slated for the v5.17 merge window by default, that is, not the upcoming
merge window but the one after that. Please let me know if you need
it sooner.
Thanx, Paul
------------------------------------------------------------------------
commit 1a792b59071b697defd4ccdc8b951cce49de9d2f
Author: Paul E. McKenney <[email protected]>
Date: Fri Sep 17 15:04:48 2021 -0700
EXP rcu: Tighten rcu_advance_cbs_nowake() checks
This is an experimental shot-in-the-dark debugging patch.
Signed-off-by: Paul E. McKenney <[email protected]>
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 6a1e9d3374db..6d692a591f66 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1590,10 +1590,14 @@ static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp,
struct rcu_data *rdp)
{
rcu_lockdep_assert_cblist_protected(rdp);
- if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) ||
+ // Don't do anything unless the current grace period is guaranteed
+ // not to end. This means a grace period in progress and at least
+ // one holdout CPU.
+ if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) || !READ_ONCE(rnp->qsmask) ||
!raw_spin_trylock_rcu_node(rnp))
return;
- WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
+ if (rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) && READ_ONCE(rnp->qsmask))
+ WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
raw_spin_unlock_rcu_node(rnp);
}
On 27 Sep 9:10, Paul E. McKenney wrote:
> Very good! Would you be willing to give me your Tested-by?
Of course! Added below after the patch.
> > The first patch is already in your rcu tree and my gut feeling is that
> > it is the one that fixes the issue but you're the expert here... Though
> > I think it should be probably fast tracked and marked for stable?
> >
> > Are you planning on committing the 2nd patch to your tree?
>
> This is the second patch, correct? (Too many patches!)
Correct. And to be 100% clear, the first one is
https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?id=2431774f04d1050292054c763070021bade7b151
> If so, I add your Tested-by and fill out the commit log. It would be
> slated for the v5.17 merge window by default, that is, not the upcoming
> merge window but the one after that. Please let me know if you need
> it sooner.
I personally don't need it sooner. But it's been broken for a while (5.4
based on the bugzilla report) and I can't imagine the original reporter
and we are the only ones hitting this. So my personal opinion would be
to get both patches into Linus's tree and stable branches asap, but
obviously this is entirely up to you.
I do appreciate the help!
> ------------------------------------------------------------------------
>
> commit 1a792b59071b697defd4ccdc8b951cce49de9d2f
> Author: Paul E. McKenney <[email protected]>
> Date: Fri Sep 17 15:04:48 2021 -0700
>
> EXP rcu: Tighten rcu_advance_cbs_nowake() checks
>
> This is an experimental shot-in-the-dark debugging patch.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 6a1e9d3374db..6d692a591f66 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1590,10 +1590,14 @@ static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp,
> struct rcu_data *rdp)
> {
> rcu_lockdep_assert_cblist_protected(rdp);
> - if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) ||
> + // Don't do anything unless the current grace period is guaranteed
> + // not to end. This means a grace period in progress and at least
> + // one holdout CPU.
> + if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) || !READ_ONCE(rnp->qsmask) ||
> !raw_spin_trylock_rcu_node(rnp))
> return;
> - WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
> + if (rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) && READ_ONCE(rnp->qsmask))
> + WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
> raw_spin_unlock_rcu_node(rnp);
> }
>
Tested-By: Guillaume Morin <[email protected]>
--
Guillaume Morin <[email protected]>
On Mon, Sep 27, 2021 at 06:49:45PM +0200, Guillaume Morin wrote:
> On 27 Sep 9:10, Paul E. McKenney wrote:
> > Very good! Would you be willing to give me your Tested-by?
>
> Of course! Added below after the patch.
>
> > > The first patch is already in your rcu tree and my gut feeling is that
> > > it is the one that fixes the issue but you're the expert here... Though
> > > I think it should be probably fast tracked and marked for stable?
> > >
> > > Are you planning on committing the 2nd patch to your tree?
> >
> > This is the second patch, correct? (Too many patches!)
>
> Correct. And to be 100% clear, the first one is
> https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?id=2431774f04d1050292054c763070021bade7b151
>
> > If so, I add your Tested-by and fill out the commit log. It would be
> > slated for the v5.17 merge window by default, that is, not the upcoming
> > merge window but the one after that. Please let me know if you need
> > it sooner.
>
> I personally don't need it sooner. But it's been broken for a while (5.4
> based on the bugzilla report) and I can't imagine the original reporter
> and we are the only ones hitting this. So my personal opinion would be
> to get both patches into Linus's tree and stable branches asap, but
> obviously this is entirely up to you.
I have it in -next with your Tested-by (thank you!), so let's see how
testing and review go.
> I do appreciate the help!
And thank you for giving that patch a go!
Thanx, Paul
> > ------------------------------------------------------------------------
> >
> > commit 1a792b59071b697defd4ccdc8b951cce49de9d2f
> > Author: Paul E. McKenney <[email protected]>
> > Date: Fri Sep 17 15:04:48 2021 -0700
> >
> > EXP rcu: Tighten rcu_advance_cbs_nowake() checks
> >
> > This is an experimental shot-in-the-dark debugging patch.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 6a1e9d3374db..6d692a591f66 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1590,10 +1590,14 @@ static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp,
> > struct rcu_data *rdp)
> > {
> > rcu_lockdep_assert_cblist_protected(rdp);
> > - if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) ||
> > + // Don't do anything unless the current grace period is guaranteed
> > + // not to end. This means a grace period in progress and at least
> > + // one holdout CPU.
> > + if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) || !READ_ONCE(rnp->qsmask) ||
> > !raw_spin_trylock_rcu_node(rnp))
> > return;
> > - WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
> > + if (rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) && READ_ONCE(rnp->qsmask))
> > + WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
> > raw_spin_unlock_rcu_node(rnp);
> > }
> >
>
> Tested-By: Guillaume Morin <[email protected]>
>
> --
> Guillaume Morin <[email protected]>
On 27 Sep 14:46, Paul E. McKenney wrote:
> On Mon, Sep 27, 2021 at 06:49:45PM +0200, Guillaume Morin wrote:
> > I personally don't need it sooner. But it's been broken for a while (5.4
> > based on the bugzilla report) and I can't imagine the original reporter
> > and we are the only ones hitting this. So my personal opinion would be
> > to get both patches into Linus's tree and stable branches asap, but
> > obviously this is entirely up to you.
>
> I have it in -next with your Tested-by (thank you!), so let's see how
> testing and review go.
Sounds good. Things are still stable so I am pretty confident this is
now fixed.
Guillaume.
--
Guillaume Morin <[email protected]>
On Fri, 17 Sep 2021 15:07:00 -0700, Paul E. McKenney wrote:
> OK, please see below. This is a complete shot in the dark, but could
> potentially prevent the problem. Or make it worse, which would at the
> very least speed up debugging. It might needs a bit of adjustment to
> apply to the -stable kernels, but at first glance should apply cleanly.
>
> Oh, and FYI I am having to manually paste your email address into the To:
> line in order to get this to go back to you. Please check your email
> configuration.
>
> Which might mean that you need to pull this from my -rcu tree here:
>
> 1a792b59071b ("EXP rcu: Tighten rcu_advance_cbs_nowake() checks")
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 6a1e9d3374db..6d692a591f66 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1590,10 +1590,14 @@ static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp,
> struct rcu_data *rdp)
> {
> rcu_lockdep_assert_cblist_protected(rdp);
> - if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) ||
> + // Don't do anything unless the current grace period is guaranteed
> + // not to end. This means a grace period in progress and at least
> + // one holdout CPU.
> + if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) || !READ_ONCE(rnp->qsmask) ||
> !raw_spin_trylock_rcu_node(rnp))
> return;
> - WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
> + if (rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) && READ_ONCE(rnp->qsmask))
> + WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
> raw_spin_unlock_rcu_node(rnp);
> }
>
Hello Paul,
We've received a few reports of this warning. Reviewing the code I don't really
see any reason for the READ_ONCE(rnp->qsmask) part here and hence I started
tracing the data before applying the patch to see the actual values before
and after the lock is acquired to better understand the situation.
This can be done with a short bash script:
~~~
perf probe 'prelock1=rcu_advance_cbs_nowake+0x29 gp_seq=%ax:x64 rnp->qsmask rnp->lock' # gp_seq from register after the condition check so this one will always be &3!=0
perf probe 'prelock2=rcu_advance_cbs_nowake+0x2c rnp->gp_seq rnp->qsmask rnp->lock' # gp_seq refetched from memory. it could already be &0x3==0
perf probe 'acquired=rcu_advance_cbs_nowake+0x35 rnp->gp_seq rnp->qsmask rnp->lock' # gp_seq refetched again after taking the lock, ditto - which is bug
perf probe 'warning_=rcu_advance_cbs_nowake+0x40 rnp->gp_seq rnp->qsmask rnp->lock condition=%ax:s8' # 'condition' is the return value from rcu_advance_cbs() call
trace-cmd stream \
-e probe:prelock1 \
-e probe:prelock2 -f '!gp_seq&3' \
-e probe:acquired -f '!gp_seq&3' \
-e probe:warning_ -f condition==1
~~~
The best part is that adding the kprobes opened the race window so that with
the tracing enabled I could reproduce the bug in matter of seconds on my VM.
One 'top' on an idle system is enough to hit it, though to accelerate I was
using a bunch of them (but still just enough so that the machine remains
mostly idle - the VM has 8 vCPUs):
# for i in {1..40}; do top -b -d 0.1 >/dev/null & done # kill %{1..40}
Note, that 'rcu_nocbs=1-7' kernel option needs to be used otherwise
rcu_advance_cbs_nowake() is not even beeing called at all as well as there
are no offload threads it could race with.
The results show that indeed (confirming the code review) the node qsmask can
be zero while still there is no warning and no subsequent stall. As long as
rcu_seq_state(...) is true, everything is fine.
Only when the GP state check is true before taking the lock and false after
acquiring it is when rcu_advance_cbs() returns with true and the system is
doomed (with the warning warmly announcing it) as the 'rcu_sched' thread is
never woken again. The system will eventually run out of memory or the tasks
get blocked on synchronize_rcu() indefinitely.
With this observation I was confident enough to finally apply just the grace
period part of your patch (below). After that the system survived 12 hours
over night. Since I could reproduce in matter of seconds before I call it a
success.
So what is your opinion about the quiescent state mask part? Is it needed or
rather really redundant? Perhaps upstream differs to RHEL kernel but on RHEL
I don't really see the need and the below patch is sufficient IMO.
Or perhaps I'm missing the part where the qsmask check is not really needed
but it's just an optimization because in that case we do not need to advance
the callbacks here as they will be advanced soon anyways?
With or without the qsmask part, in both cases I believe this should go to
stable 5.4+ and of course we want it in RHEL asap, so once Linus merges a
version of it, we are going to backport. Since this is only reproducible
with the 'rcu_nocbs' option I understand that the v5.17 merge window is
a reasonable target for upstream. Nevertheless this is still a bugfix.
--nX
----
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1aebb2dfbf90..96df7f68ff4d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1389,7 +1389,8 @@ static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp,
if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) ||
!raw_spin_trylock_rcu_node(rnp))
return;
- WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
+ if (rcu_seq_state(rcu_seq_current(&rnp->gp_seq)))
+ WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
raw_spin_unlock_rcu_node(rnp);
}
--
On Thu, Nov 18, 2021 at 07:41:28PM +0100, Daniel Vacek wrote:
> On Fri, 17 Sep 2021 15:07:00 -0700, Paul E. McKenney wrote:
> > OK, please see below. This is a complete shot in the dark, but could
> > potentially prevent the problem. Or make it worse, which would at the
> > very least speed up debugging. It might needs a bit of adjustment to
> > apply to the -stable kernels, but at first glance should apply cleanly.
> >
> > Oh, and FYI I am having to manually paste your email address into the To:
> > line in order to get this to go back to you. Please check your email
> > configuration.
> >
> > Which might mean that you need to pull this from my -rcu tree here:
> >
> > 1a792b59071b ("EXP rcu: Tighten rcu_advance_cbs_nowake() checks")
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 6a1e9d3374db..6d692a591f66 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1590,10 +1590,14 @@ static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp,
> > struct rcu_data *rdp)
> > {
> > rcu_lockdep_assert_cblist_protected(rdp);
> > - if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) ||
> > + // Don't do anything unless the current grace period is guaranteed
> > + // not to end. This means a grace period in progress and at least
> > + // one holdout CPU.
> > + if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) || !READ_ONCE(rnp->qsmask) ||
> > !raw_spin_trylock_rcu_node(rnp))
> > return;
> > - WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
> > + if (rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) && READ_ONCE(rnp->qsmask))
> > + WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
> > raw_spin_unlock_rcu_node(rnp);
> > }
> >
>
> Hello Paul,
>
> We've received a few reports of this warning. Reviewing the code I don't really
> see any reason for the READ_ONCE(rnp->qsmask) part here and hence I started
> tracing the data before applying the patch to see the actual values before
> and after the lock is acquired to better understand the situation.
>
> This can be done with a short bash script:
>
> ~~~
> perf probe 'prelock1=rcu_advance_cbs_nowake+0x29 gp_seq=%ax:x64 rnp->qsmask rnp->lock' # gp_seq from register after the condition check so this one will always be &3!=0
> perf probe 'prelock2=rcu_advance_cbs_nowake+0x2c rnp->gp_seq rnp->qsmask rnp->lock' # gp_seq refetched from memory. it could already be &0x3==0
> perf probe 'acquired=rcu_advance_cbs_nowake+0x35 rnp->gp_seq rnp->qsmask rnp->lock' # gp_seq refetched again after taking the lock, ditto - which is bug
> perf probe 'warning_=rcu_advance_cbs_nowake+0x40 rnp->gp_seq rnp->qsmask rnp->lock condition=%ax:s8' # 'condition' is the return value from rcu_advance_cbs() call
> trace-cmd stream \
> -e probe:prelock1 \
> -e probe:prelock2 -f '!gp_seq&3' \
> -e probe:acquired -f '!gp_seq&3' \
> -e probe:warning_ -f condition==1
> ~~~
>
> The best part is that adding the kprobes opened the race window so that with
> the tracing enabled I could reproduce the bug in matter of seconds on my VM.
> One 'top' on an idle system is enough to hit it, though to accelerate I was
> using a bunch of them (but still just enough so that the machine remains
> mostly idle - the VM has 8 vCPUs):
>
> # for i in {1..40}; do top -b -d 0.1 >/dev/null & done # kill %{1..40}
>
> Note, that 'rcu_nocbs=1-7' kernel option needs to be used otherwise
> rcu_advance_cbs_nowake() is not even beeing called at all as well as there
> are no offload threads it could race with.
Agreed.
> The results show that indeed (confirming the code review) the node qsmask can
> be zero while still there is no warning and no subsequent stall. As long as
> rcu_seq_state(...) is true, everything is fine.
>
> Only when the GP state check is true before taking the lock and false after
> acquiring it is when rcu_advance_cbs() returns with true and the system is
> doomed (with the warning warmly announcing it) as the 'rcu_sched' thread is
> never woken again. The system will eventually run out of memory or the tasks
> get blocked on synchronize_rcu() indefinitely.
>
> With this observation I was confident enough to finally apply just the grace
> period part of your patch (below). After that the system survived 12 hours
> over night. Since I could reproduce in matter of seconds before I call it a
> success.
>
> So what is your opinion about the quiescent state mask part? Is it needed or
> rather really redundant? Perhaps upstream differs to RHEL kernel but on RHEL
> I don't really see the need and the below patch is sufficient IMO.
Well, let's see...
We hold the rcu_node structure's ->lock, which means that if the grace
period is in progress (rcu_seq_state(rcu_seq_current(&rnp->gp_seq))),
we know that rcu_gp_cleanup() will be visiting that rcu_node structure.
When it does so, it will check rcu_future_gp_cleanup(). This function
looks at rnp->gp_seq_needed to see if more grace periods are required.
And this field will be updated by rcu_start_this_gp(), which is invoked
from rcu_accelerate_cbs() which is in turn invoked from rcu_advance_cbs().
And the return value from rcu_start_this_gp() is passed through by both
rcu_accelerate_cbs() and rcu_advance_cbs(). And rcu_start_this_gp()
is guaranteed to return false (thus avoiding triggering the assertion)
when a grace period is in progress. And, as noted above, because the
grace period cannot officially end while we are holding the ->lock of
an rcu_node structure that believes that a grace period is in progress,
we are guaranteed of that false return.
So good catch!
(My paranoia stemmed from the fact that there was a time when the
rcu_state strucure's idea of the grace period was updated before those
of the rcu_node structures. In case you were wondering.)
> Or perhaps I'm missing the part where the qsmask check is not really needed
> but it's just an optimization because in that case we do not need to advance
> the callbacks here as they will be advanced soon anyways?
Well, the lockless ->qsmask check could be considered an optimization
because it would reduce the probability of the grace period ending while
we were acquiring the lock. But that optimization is not likely to be
worth it.
> With or without the qsmask part, in both cases I believe this should go to
> stable 5.4+ and of course we want it in RHEL asap, so once Linus merges a
> version of it, we are going to backport. Since this is only reproducible
> with the 'rcu_nocbs' option I understand that the v5.17 merge window is
> a reasonable target for upstream. Nevertheless this is still a bugfix.
If I was sending it upstream as-is, it -might- be worth jamming into v5.16.
But I am updating that commit with attribution, so some additional time
testing is not a bad thing.
Either way, I believe you are good backporting this patch given suitable
additional testing, at your option (your distro, your rules!).
Thanx, Paul
> --nX
>
> ----
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 1aebb2dfbf90..96df7f68ff4d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1389,7 +1389,8 @@ static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp,
> if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) ||
> !raw_spin_trylock_rcu_node(rnp))
> return;
> - WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
> + if (rcu_seq_state(rcu_seq_current(&rnp->gp_seq)))
> + WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp));
> raw_spin_unlock_rcu_node(rnp);
> }
>
> --
>