Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Thu, 13 Dec 2018 18:15:27 -0800
From:   "Paul E. McKenney" <paulmck@linux.ibm.com>
To:     "He, Bo" <bo.he@intel.com>
Cc:     "Zhang, Jun" <jun.zhang@intel.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "josh@joshtriplett.org" <josh@joshtriplett.org>,
        "mathieu.desnoyers@efficios.com" <mathieu.desnoyers@efficios.com>,
        "jiangshanlai@gmail.com" <jiangshanlai@gmail.com>,
        "Xiao, Jin" <jin.xiao@intel.com>,
        "Zhang, Yanmin" <yanmin.zhang@intel.com>,
        "Bai, Jie A" <jie.a.bai@intel.com>,
        "Sun, Yi J" <yi.j.sun@intel.com>
Subject: Re: rcu_preempt caused oom
Reply-To: paulmck@linux.ibm.com
References: <20181212210316.GA14777@linux.ibm.com>
 <CD6925E8781EFD4D8E11882D20FC406D52A193C3@SHSMSX104.ccr.corp.intel.com>
 <20181213001214.GE4170@linux.ibm.com>
 <88DC34334CA3444C85D647DBFA962C2735AD5F77@SHSMSX104.ccr.corp.intel.com>
 <20181213024234.GF4170@linux.ibm.com>
 <88DC34334CA3444C85D647DBFA962C2735AD5F9E@SHSMSX104.ccr.corp.intel.com>
 <20181213044020.GA19765@linux.ibm.com>
 <CD6925E8781EFD4D8E11882D20FC406D52A197EC@SHSMSX104.ccr.corp.intel.com>
 <20181213181136.GL4170@linux.ibm.com>
 <CD6925E8781EFD4D8E11882D20FC406D52A198C7@SHSMSX104.ccr.corp.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CD6925E8781EFD4D8E11882D20FC406D52A198C7@SHSMSX104.ccr.corp.intel.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Message-Id: <20181214021527.GR4170@linux.ibm.com>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Fri, Dec 14, 2018 at 01:30:04AM +0000, He, Bo wrote:
> as you mentioned CONFIG_FAST_NO_HZ, do you mean CONFIG_RCU_FAST_NO_HZ? I double checked there is no FAST_NO_HZ in .config:

Yes, you are correct, CONFIG_RCU_FAST_NO_HZ.  OK, you do not have it set,
which means several code paths can be ignored.  Also CONFIG_HZ=1000, so
300 second delay.

							Thanx, Paul

> Here is the grep from .config:
> egrep "HZ|RCU" .config
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_HZ_PERIODIC is not set
> CONFIG_NO_HZ_IDLE=y
> # CONFIG_NO_HZ_FULL is not set
> CONFIG_NO_HZ=y
> # RCU Subsystem
> CONFIG_PREEMPT_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_SRCU=y
> CONFIG_TREE_SRCU=y
> CONFIG_TASKS_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> # CONFIG_HZ_100 is not set
> # CONFIG_HZ_250 is not set
> # CONFIG_HZ_300 is not set
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
> # CONFIG_MACHZ_WDT is not set
> # RCU Debugging
> CONFIG_PROVE_RCU=y
> CONFIG_RCU_PERF_TEST=m
> CONFIG_RCU_TORTURE_TEST=m
> CONFIG_RCU_CPU_STALL_TIMEOUT=7
> CONFIG_RCU_TRACE=y
> CONFIG_RCU_EQS_DEBUG=y
> 
> -----Original Message-----
> From: Paul E. McKenney <paulmck@linux.ibm.com> 
> Sent: Friday, December 14, 2018 2:12 AM
> To: He, Bo <bo.he@intel.com>
> Cc: Zhang, Jun <jun.zhang@intel.com>; Steven Rostedt <rostedt@goodmis.org>; linux-kernel@vger.kernel.org; josh@joshtriplett.org; mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; Xiao, Jin <jin.xiao@intel.com>; Zhang, Yanmin <yanmin.zhang@intel.com>; Bai, Jie A <jie.a.bai@intel.com>; Sun, Yi J <yi.j.sun@intel.com>
> Subject: Re: rcu_preempt caused oom
> 
> On Thu, Dec 13, 2018 at 03:26:08PM +0000, He, Bo wrote:
> > one of the board reproduce the issue with the show_rcu_gp_kthreads(), I also enclosed the logs as attachment.
> > 
> > [17818.936032] rcu: rcu_preempt: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258 ->gp_wake_seq       21808189 ->gp_seq 21808192 ->gp_seq_needed 21808196 ->gp_flags 0x1
> 
> This is quite helpful, thank you!
> 
> The "RCU lockdep checking is enabled" says that CONFIG_PROVE_RCU=y, which is good.  The "RCU_GP_WAIT_GPS(1)" means that the rcu_preempt task is waiting for a new grace-period request.  The "->state: 0x402" means that it is sleeping, neither running nor in the process of waking up.
> The "delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258" means that it has been more than 300,000 jiffies since the rcu_preempt task did anything or was requested to do anything.
> 
> The "->gp_wake_seq 21808189 ->gp_seq 21808192" says that the last attempt to awaken the rcu_preempt task happened during the last grace period.
> The "->gp_seq_needed 21808196 ->gp_flags 0x1" nevertheless says that someone requested a new grace period.  So if the rcu_preempt task were to wake up, it would process the new grace period.  Note again also the ->gp_req_activity 308256, which indicates that ->gp_flags was set more than 300,000 jiffies ago, just after the last recorded activity of the rcu_preempt task.
> 
> But this is exactly the situation that rcu_check_gp_start_stall() is designed to warn about (and does warn about for me when I comment out the wakeup code).  So why is rcu_check_gp_start_stall() not being called?  Here are a couple of possibilities:
> 
> 1.	Because rcu_check_gp_start_stall() is only ever invoked from
> 	RCU_SOFTIRQ, it is possible that softirqs are stalled for
> 	whatever reason.
> 
> 2.	Because RCU_SOFTIRQ is invoked primarily from the scheduler-clock
> 	interrupt handler, it is possible that the scheduler tick has
> 	somehow been disabled.  Traces from earlier runs showed a great
> 	deal of RCU callbacks queued, which would have caused RCU to
> 	refuse to allow the scheduler tick to be disabled, even if the
> 	corresponding CPU was idle.
> 
> 3.	You have CONFIG_FAST_NO_HZ=y (which you probably do, given
> 	that you are building for a battery-powered device) and all of the
> 	CPU's callbacks are lazy.  Except that your earlier traces showed
> 	lots of non-lazy callbacks.  Besides, even if all callbacks were
> 	lazy, there would still be a scheduling-clock interrupt every
> 	six seconds, and there are quite a few six-second intervals
> 	in a two-minute watchdog timeout.
> 
> 	But if we cannot find the problem quickly, I will likely ask
> 	you to try reproducing with CONFIG_FAST_NO_HZ=n.  This could
> 	be thought of as bisecting the RCU code looking for the bug.
> 
> The first two of these seem unlikely given that the watchdog timer was still firing.  Still, I don't see how 300,000 jiffies elapsed with a grace period requested and not started otherwise.  Could you please check?
> One way to do so would be to enable ftrace on rcu_check_callbacks(), __rcu_process_callbacks(), and rcu_check_gp_start_stall().  It might be necessary to no-inline rcu_check_gp_start_stall().  You might have better ways to collect this information.
> 
> Without this information, the only workaround patch I can give you will degrade battery lifetime, which might not be what you want.
> 
> You do have a lockdep complaint early at boot.  Although I don't immediately see how this self-deadlock would affect RCU, please do get it fixed.  Sometimes the consequences of this sort of deadlock can propagate to unexepected places.
> 
> Regardless of why rcu_check_gp_start_stall() failed to complain, it looks like this was set after the rcu_preempt task slept for the last time, and so there should have been a wakeup the last time that ->gp_flags was set.  Perhaps there is some code path that drops the wakeup.
> I did check this in current -rcu, but you are instead running v4.19, so I should also check there.
> 
> The ->gp_flags has its RCU_GP_FLAG_INIT bit set in rcu_start_this_gp() and in rcu_gp_cleanup().  We can eliminate rcu_gp_cleanup() from consideration because only the rcu_preempt task will execute that code, and we know that this task was asleep at the last time this bit was set.
> Now rcu_start_this_gp() returns a flag indicating whether or not a wakeup is needed, and the caller must do the wakeup once it is safe to do so, that is, after the various rcu_node locks have been released (doing a wakeup while holding any of those locks results in deadlock).
> 
> The following functions invoke rcu_start_this_gp: rcu_accelerate_cbs() and rcu_nocb_wait_gp().  We can eliminate rcu_nocb_wait_gp() because you are building with CONFIG_RCU_NOCB_CPU=n.  Then rcu_accelerate_cbs() is invoked from:
> 
> o	rcu_accelerate_cbs_unlocked(), which does the following, thus
> 	properly awakening the rcu_preempt task when needed:
> 
> 	needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> 	raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> 	if (needwake)
> 		rcu_gp_kthread_wake(rsp);
> 
> o	rcu_advance_cbs(), which returns the value returned by
> 	rcu_accelerate_cbs(), thus pushing the problem off to its
> 	callers, which are called out below.
> 
> o	__note_gp_changes(), which also returns the value returned by
> 	rcu_accelerate_cbs(), thus pushing the problem off to its callers,
> 	which are called out below.
> 
> o	rcu_gp_cleanup(), which is only ever invoked by RCU grace-period
> 	kthreads such as the rcu_preempt task.	Therefore, this function
> 	never needs to awaken the rcu_preempt task, because the fact
> 	that this function is executing means that this task is already
> 	awake.	(Also, as noted above, we can eliminate this code from
> 	consideration because this task is known to have been sleeping
> 	at the last time that the RCU_GP_FLAG_INIT bit was set.)
> 
> o	rcu_report_qs_rdp(), which does the following, thus properly
> 	awakening the rcu_preempt task when needed:
> 
> 		needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> 
> 		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> 		/* ^^^ Released rnp->lock */
> 		if (needwake)
> 			rcu_gp_kthread_wake(rsp);
> 
> o	rcu_prepare_for_idle(), which does the following, thus properly
> 	awakening the rcu_preempt task when needed:
> 
> 		needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> 		raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> 		if (needwake)
> 			rcu_gp_kthread_wake(rsp);
> 
> Now for rcu_advance_cbs():
> 
> o	__note_gp_changes(), which which also returns the value returned
> 	by rcu_advance_cbs(), thus pushing the problem off to its callers,
> 	which are called out below.
> 
> o	rcu_migrate_callbacks(), which does the following, thus properly
> 	awakening the rcu_preempt task when needed:
> 
> 	needwake = rcu_advance_cbs(rsp, rnp_root, rdp) ||
> 		   rcu_advance_cbs(rsp, rnp_root, my_rdp);
> 	rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist);
> 	WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) !=
> 		     !rcu_segcblist_n_cbs(&my_rdp->cblist));
> 	raw_spin_unlock_irqrestore_rcu_node(rnp_root, flags);
> 	if (needwake)
> 		rcu_gp_kthread_wake(rsp);
> 
> Now for __note_gp_changes():
> 
> o	note_gp_changes(), which does the following, thus properly
> 	awakening the rcu_preempt task when needed:
> 
> 	needwake = __note_gp_changes(rsp, rnp, rdp);
> 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> 	if (needwake)
> 		rcu_gp_kthread_wake(rsp);
> 
> o	rcu_gp_init() which is only ever invoked by RCU grace-period
> 	kthreads such as the rcu_preempt task, which makes wakeups
> 	unnecessary, just as for rcu_gp_cleanup() above.
> 
> o	rcu_gp_cleanup(), ditto.
> 
> So I am not seeing how I am losing a wakeup, but please do feel free to double-check my analysis.  One way to do that is using event tracing.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> lockdep complaint:
> ------------------------------------------------------------------------
> 
> [    2.895507] ======================================================
> [    2.895511] WARNING: possible circular locking dependency detected
> [    2.895517] 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1 Tainted: G     U           
> [    2.895521] ------------------------------------------------------
> [    2.895525] earlyEvs/1839 is trying to acquire lock:
> [    2.895530] 00000000ff344115 (&asd->mutex){+.+.}, at: ipu_isys_subdev_get_ffmt+0x32/0x90
> [    2.895546] 
> [    2.895546] but task is already holding lock:
> [    2.895550] 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> [    2.895561] 
> [    2.895561] which lock already depends on the new lock.
> [    2.895561] 
> [    2.895566] 
> [    2.895566] the existing dependency chain (in reverse order) is:
> [    2.895570] 
> [    2.895570] -> #1 (&mdev->graph_mutex){+.+.}:
> [    2.895583]        __mutex_lock+0x80/0x9a0
> [    2.895588]        mutex_lock_nested+0x1b/0x20
> [    2.895593]        media_device_register_entity+0x92/0x1e0
> [    2.895598]        v4l2_device_register_subdev+0xc2/0x1b0
> [    2.895604]        ipu_isys_csi2_init+0x22c/0x520
> [    2.895608]        isys_probe+0x6cb/0xed0
> [    2.895613]        ipu_bus_probe+0xfd/0x2e0
> [    2.895620]        really_probe+0x268/0x3d0
> [    2.895625]        driver_probe_device+0x11a/0x130
> [    2.895630]        __device_attach_driver+0x86/0x100
> [    2.895635]        bus_for_each_drv+0x6e/0xb0
> [    2.895640]        __device_attach+0xdf/0x160
> [    2.895645]        device_initial_probe+0x13/0x20
> [    2.895650]        bus_probe_device+0xa6/0xc0
> [    2.895655]        deferred_probe_work_func+0x88/0xe0
> [    2.895661]        process_one_work+0x220/0x5c0
> [    2.895665]        worker_thread+0x1da/0x3b0
> [    2.895670]        kthread+0x12c/0x150
> [    2.895675]        ret_from_fork+0x3a/0x50
> [    2.895678] 
> [    2.895678] -> #0 (&asd->mutex){+.+.}:
> [    2.895688]        lock_acquire+0x95/0x1a0
> [    2.895693]        __mutex_lock+0x80/0x9a0
> [    2.895698]        mutex_lock_nested+0x1b/0x20
> [    2.895703]        ipu_isys_subdev_get_ffmt+0x32/0x90
> [    2.895708]        ipu_isys_csi2_get_fmt+0x14/0x30
> [    2.895713]        v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> [    2.895718]        v4l2_subdev_link_validate_one+0x67/0x120
> [    2.895723]        v4l2_subdev_link_validate+0x246/0x490
> [    2.895728]        csi2_link_validate+0xc6/0x220
> [    2.895733]        __media_pipeline_start+0x15b/0x2f0
> [    2.895738]        media_pipeline_start+0x33/0x50
> [    2.895743]        ipu_isys_video_prepare_streaming+0x1e0/0x610
> [    2.895748]        start_streaming+0x186/0x3a0
> [    2.895753]        vb2_start_streaming+0x6d/0x130
> [    2.895758]        vb2_core_streamon+0x108/0x140
> [    2.895762]        vb2_streamon+0x29/0x50
> [    2.895767]        vb2_ioctl_streamon+0x42/0x50
> [    2.895772]        v4l_streamon+0x20/0x30
> [    2.895776]        __video_do_ioctl+0x1af/0x3c0
> [    2.895781]        video_usercopy+0x27e/0x7e0
> [    2.895785]        video_ioctl2+0x15/0x20
> [    2.895789]        v4l2_ioctl+0x49/0x50
> [    2.895794]        do_video_ioctl+0x93c/0x2360
> [    2.895799]        v4l2_compat_ioctl32+0x93/0xe0
> [    2.895806]        __ia32_compat_sys_ioctl+0x73a/0x1c90
> [    2.895813]        do_fast_syscall_32+0x9a/0x2d6
> [    2.895818]        entry_SYSENTER_compat+0x6d/0x7c
> [    2.895821] 
> [    2.895821] other info that might help us debug this:
> [    2.895821] 
> [    2.895826]  Possible unsafe locking scenario:
> [    2.895826] 
> [    2.895830]        CPU0                    CPU1
> [    2.895833]        ----                    ----
> [    2.895836]   lock(&mdev->graph_mutex);
> [    2.895842]                                lock(&asd->mutex);
> [    2.895847]                                lock(&mdev->graph_mutex);
> [    2.895852]   lock(&asd->mutex);
> [    2.895857] 
> [    2.895857]  *** DEADLOCK ***
> [    2.895857] 
> [    2.895863] 3 locks held by earlyEvs/1839:
> [    2.895866]  #0: 00000000ed860090 (&av->mutex){+.+.}, at: __video_do_ioctl+0xbf/0x3c0
> [    2.895876]  #1: 000000000cb253e7 (&isys->stream_mutex){+.+.}, at: start_streaming+0x5c/0x3a0
> [    2.895886]  #2: 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> [    2.895896] 
> [    2.895896] stack backtrace:
> [    2.895903] CPU: 0 PID: 1839 Comm: earlyEvs Tainted: G     U            4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1
> [    2.895907] Call Trace:
> [    2.895915]  dump_stack+0x70/0xa5
> [    2.895921]  print_circular_bug.isra.35+0x1d8/0x1e6
> [    2.895927]  __lock_acquire+0x1284/0x1340
> [    2.895931]  ? __lock_acquire+0x2b5/0x1340
> [    2.895940]  lock_acquire+0x95/0x1a0
> [    2.895945]  ? lock_acquire+0x95/0x1a0
> [    2.895950]  ? ipu_isys_subdev_get_ffmt+0x32/0x90
> [    2.895956]  ? ipu_isys_subdev_get_ffmt+0x32/0x90
> [    2.895961]  __mutex_lock+0x80/0x9a0
> [    2.895966]  ? ipu_isys_subdev_get_ffmt+0x32/0x90
> [    2.895971]  ? crlmodule_get_format+0x43/0x50
> [    2.895979]  mutex_lock_nested+0x1b/0x20
> [    2.895984]  ? mutex_lock_nested+0x1b/0x20
> [    2.895989]  ipu_isys_subdev_get_ffmt+0x32/0x90
> [    2.895995]  ipu_isys_csi2_get_fmt+0x14/0x30
> [    2.896001]  v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> [    2.896006]  v4l2_subdev_link_validate_one+0x67/0x120
> [    2.896011]  ? crlmodule_get_format+0x2a/0x50
> [    2.896018]  ? find_held_lock+0x35/0xa0
> [    2.896023]  ? crlmodule_get_format+0x43/0x50
> [    2.896030]  v4l2_subdev_link_validate+0x246/0x490
> [    2.896035]  ? __mutex_unlock_slowpath+0x58/0x2f0
> [    2.896042]  ? mutex_unlock+0x12/0x20
> [    2.896046]  ? crlmodule_get_format+0x43/0x50
> [    2.896052]  ? v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> [    2.896057]  ? v4l2_subdev_link_validate_one+0x67/0x120
> [    2.896065]  ? __is_insn_slot_addr+0xad/0x120
> [    2.896070]  ? kernel_text_address+0xc4/0x100
> [    2.896078]  ? v4l2_subdev_link_validate+0x246/0x490
> [    2.896085]  ? kernel_text_address+0xc4/0x100
> [    2.896092]  ? __lock_acquire+0x1106/0x1340
> [    2.896096]  ? __lock_acquire+0x1169/0x1340
> [    2.896103]  csi2_link_validate+0xc6/0x220
> [    2.896110]  ? __lock_is_held+0x5a/0xa0
> [    2.896115]  ? mark_held_locks+0x58/0x80
> [    2.896122]  ? __kmalloc+0x207/0x2e0
> [    2.896127]  ? __lock_is_held+0x5a/0xa0
> [    2.896134]  ? rcu_read_lock_sched_held+0x81/0x90
> [    2.896139]  ? __kmalloc+0x2a3/0x2e0
> [    2.896144]  ? media_pipeline_start+0x28/0x50
> [    2.896150]  ? __media_entity_enum_init+0x33/0x70
> [    2.896155]  ? csi2_has_route+0x18/0x20
> [    2.896160]  ? media_graph_walk_next.part.9+0xac/0x290
> [    2.896166]  __media_pipeline_start+0x15b/0x2f0
> [    2.896173]  ? rcu_read_lock_sched_held+0x81/0x90
> [    2.896179]  media_pipeline_start+0x33/0x50
> [    2.896186]  ipu_isys_video_prepare_streaming+0x1e0/0x610
> [    2.896191]  ? __lock_acquire+0x132e/0x1340
> [    2.896198]  ? __lock_acquire+0x2b5/0x1340
> [    2.896204]  ? lock_acquire+0x95/0x1a0
> [    2.896209]  ? start_streaming+0x5c/0x3a0
> [    2.896215]  ? start_streaming+0x5c/0x3a0
> [    2.896221]  ? __mutex_lock+0x391/0x9a0
> [    2.896226]  ? v4l_enable_media_source+0x2d/0x70
> [    2.896233]  ? find_held_lock+0x35/0xa0
> [    2.896238]  ? v4l_enable_media_source+0x57/0x70
> [    2.896245]  start_streaming+0x186/0x3a0
> [    2.896250]  ? __mutex_unlock_slowpath+0x58/0x2f0
> [    2.896257]  vb2_start_streaming+0x6d/0x130
> [    2.896262]  ? vb2_start_streaming+0x6d/0x130
> [    2.896267]  vb2_core_streamon+0x108/0x140
> [    2.896273]  vb2_streamon+0x29/0x50
> [    2.896278]  vb2_ioctl_streamon+0x42/0x50
> [    2.896284]  v4l_streamon+0x20/0x30
> [    2.896288]  __video_do_ioctl+0x1af/0x3c0
> [    2.896296]  ? __might_fault+0x85/0x90
> [    2.896302]  video_usercopy+0x27e/0x7e0
> [    2.896307]  ? copy_overflow+0x20/0x20
> [    2.896313]  ? find_held_lock+0x35/0xa0
> [    2.896319]  ? __might_fault+0x3e/0x90
> [    2.896325]  video_ioctl2+0x15/0x20
> [    2.896330]  v4l2_ioctl+0x49/0x50
> [    2.896335]  do_video_ioctl+0x93c/0x2360
> [    2.896343]  v4l2_compat_ioctl32+0x93/0xe0
> [    2.896349]  __ia32_compat_sys_ioctl+0x73a/0x1c90
> [    2.896354]  ? lockdep_hardirqs_on+0xef/0x180
> [    2.896359]  ? do_fast_syscall_32+0x3b/0x2d6
> [    2.896364]  do_fast_syscall_32+0x9a/0x2d6
> [    2.896370]  entry_SYSENTER_compat+0x6d/0x7c
> [    2.896377] RIP: 0023:0xf7e79b79
> [    2.896382] Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 0c 24 c3 8b 1c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
> [    2.896387] RSP: 002b:00000000f76816bc EFLAGS: 00000292 ORIG_RAX: 0000000000000036
> [    2.896393] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 0000000040045612
> [    2.896396] RDX: 00000000f768172c RSI: 00000000f7d42d9c RDI: 00000000f768172c
> [    2.896400] RBP: 00000000f7681708 R08: 0000000000000000 R09: 0000000000000000
> [    2.896404] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [    2.896408] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> 
> ------------------------------------------------------------------------
> 
> > [17818.936039] rcu:     rcu_node 0:3 ->gp_seq 21808192 ->gp_seq_needed 21808196
> > [17818.936048] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 101730 ->gp_req_activity 101732 ->gp_wake_time 101730 ->gp_wake_seq 1357 -  >gp_seq 1360 ->gp_seq_needed 1360 ->gp_flags 0x0                                                                                                                             
> > [17818.936056] rcu: rcu_bh: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 4312486108 ->gp_req_activity 4312486108 ->gp_wake_time 4312486108 -            >gp_wake_seq 0 ->gp_seq -1200 ->gp_seq_needed -1200 ->gp_flags 0x0
> > 
> > -----Original Message-----
> > From: Paul E. McKenney <paulmck@linux.ibm.com>
> > Sent: Thursday, December 13, 2018 12:40 PM
> > To: Zhang, Jun <jun.zhang@intel.com>
> > Cc: He, Bo <bo.he@intel.com>; Steven Rostedt <rostedt@goodmis.org>; 
> > linux-kernel@vger.kernel.org; josh@joshtriplett.org; 
> > mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; Xiao, Jin 
> > <jin.xiao@intel.com>; Zhang, Yanmin <yanmin.zhang@intel.com>; Bai, Jie 
> > A <jie.a.bai@intel.com>; Sun, Yi J <yi.j.sun@intel.com>
> > Subject: Re: rcu_preempt caused oom
> > 
> > On Thu, Dec 13, 2018 at 03:28:46AM +0000, Zhang, Jun wrote:
> > > Ok, we will test it, thanks!
> > 
> > But please also try the sysrq-y with the earlier patch after a hang!
> > 
> > 							Thanx, Paul
> > 
> > > -----Original Message-----
> > > From: Paul E. McKenney [mailto:paulmck@linux.ibm.com]
> > > Sent: Thursday, December 13, 2018 10:43
> > > To: Zhang, Jun <jun.zhang@intel.com>
> > > Cc: He, Bo <bo.he@intel.com>; Steven Rostedt <rostedt@goodmis.org>; 
> > > linux-kernel@vger.kernel.org; josh@joshtriplett.org; 
> > > mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; Xiao, Jin 
> > > <jin.xiao@intel.com>; Zhang, Yanmin <yanmin.zhang@intel.com>; Bai, 
> > > Jie A <jie.a.bai@intel.com>; Sun, Yi J <yi.j.sun@intel.com>
> > > Subject: Re: rcu_preempt caused oom
> > > 
> > > On Thu, Dec 13, 2018 at 02:11:35AM +0000, Zhang, Jun wrote:
> > > > Hello, Paul
> > > > 
> > > > I think the next patch is better.
> > > > Because ULONG_CMP_GE could cause double write, which has risk that write back old value.
> > > > Please help review.
> > > > I don't test it. If you agree, we will test it.
> > > 
> > > Just to make sure that I understand, you are worried about something like the following, correct?
> > > 
> > > o	__note_gp_changes() compares rnp->gp_seq_needed and rdp->gp_seq_needed
> > > 	and finds them equal.
> > > 
> > > o	At just this time something like rcu_start_this_gp() assigns a new
> > > 	(larger) value to rdp->gp_seq_needed.
> > > 
> > > o	Then __note_gp_changes() overwrites rdp->gp_seq_needed with the
> > > 	old value.
> > > 
> > > This cannot happen because __note_gp_changes() runs with interrupts disabled on the CPU corresponding to the rcu_data structure referenced by the rdp pointer.  So there is no way for rcu_start_this_gp() to be invoked on the same CPU during this "if" statement.
> > > 
> > > Of course, there could be bugs.  For example:
> > > 
> > > o	__note_gp_changes() might be called on a different CPU than that
> > > 	corresponding to rdp.  You can check this with something like:
> > > 
> > > 	WARN_ON_ONCE(rdp->cpu != smp_processor_id());
> > > 
> > > o	The same things could happen with rcu_start_this_gp(), and the
> > > 	above WARN_ON_ONCE() would work there as well.
> > > 
> > > o	rcutree_prepare_cpu() is a special case, but is irrelevant unless
> > > 	you are doing CPU-hotplug operations.  (It can run on a CPU other
> > > 	than rdp->cpu, but only at times when rdp->cpu is offline.)
> > > 
> > > o	Interrupts might not really be disabled.
> > > 
> > > That said, your patch could reduce overhead slightly, given that the two values will be equal much of the time.  So it might be worth testing just for that reason.
> > > 
> > > So why not just test it anyway?  If it makes the bug go away, I will 
> > > be surprised, but it would not be the first surprise for me.  ;-)
> > > 
> > > 							Thanx, Paul
> > > 
> > > > Thanks!
> > > > 
> > > > 
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 
> > > > 0b760c1..c00f34e 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
> > > >                 zero_cpu_stall_ticks(rdp);
> > > >         }
> > > >         rdp->gp_seq = rnp->gp_seq;  /* Remember new grace-period state. */
> > > > -       if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
> > > > +       if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed) 
> > > > + ||
> > > > + rdp->gpwrap)
> > > >                 rdp->gp_seq_needed = rnp->gp_seq_needed;
> > > >         WRITE_ONCE(rdp->gpwrap, false);
> > > >         rcu_gpnum_ovf(rnp, rdp);
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Paul E. McKenney [mailto:paulmck@linux.ibm.com]
> > > > Sent: Thursday, December 13, 2018 08:12
> > > > To: He, Bo <bo.he@intel.com>
> > > > Cc: Steven Rostedt <rostedt@goodmis.org>; 
> > > > linux-kernel@vger.kernel.org; josh@joshtriplett.org; 
> > > > mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; Zhang, Jun 
> > > > <jun.zhang@intel.com>; Xiao, Jin <jin.xiao@intel.com>; Zhang, 
> > > > Yanmin <yanmin.zhang@intel.com>; Bai, Jie A <jie.a.bai@intel.com>; 
> > > > Sun, Yi J <yi.j.sun@intel.com>
> > > > Subject: Re: rcu_preempt caused oom
> > > > 
> > > > On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> > > > > I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> > > > > Please correct me if I have something wrong.
> > > > 
> > > > That would be because I sent you the wrong patch, apologies!  :-/
> > > > 
> > > > Please instead see the one below, which does add sysrq_rcu.
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > -----Original Message-----
> > > > > From: Paul E. McKenney <paulmck@linux.ibm.com>
> > > > > Sent: Thursday, December 13, 2018 5:03 AM
> > > > > To: He, Bo <bo.he@intel.com>
> > > > > Cc: Steven Rostedt <rostedt@goodmis.org>; 
> > > > > linux-kernel@vger.kernel.org; josh@joshtriplett.org; 
> > > > > mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; Zhang, 
> > > > > Jun <jun.zhang@intel.com>; Xiao, Jin <jin.xiao@intel.com>; 
> > > > > Zhang, Yanmin <yanmin.zhang@intel.com>; Bai, Jie A 
> > > > > <jie.a.bai@intel.com>
> > > > > Subject: Re: rcu_preempt caused oom
> > > > > 
> > > > > On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > > > > > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > > > > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > > > > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> > > > > > 
> > > > > > Strange.
> > > > > > 
> > > > > > Are the systems responsive to sysrq keys once failure occurs?  
> > > > > > If so, I will provide you a sysrq-R or some such to dump out the RCU state.
> > > > > 
> > > > > Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below.  Only lightly tested.
> > > > 
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > --
> > > > 
> > > > commit 04b6245c8458e8725f4169e62912c1fadfdf8141
> > > > Author: Paul E. McKenney <paulmck@linux.ibm.com>
> > > > Date:   Wed Dec 12 16:10:09 2018 -0800
> > > > 
> > > >     rcu: Add sysrq rcu_node-dump capability
> > > >     
> > > >     Backported from v4.21/v5.0
> > > >     
> > > >     Life is hard if RCU manages to get stuck without triggering RCU CPU
> > > >     stall warnings or triggering the rcu_check_gp_start_stall() checks
> > > >     for failing to start a grace period.  This commit therefore adds a
> > > >     boot-time-selectable sysrq key (commandeering "y") that allows manually
> > > >     dumping Tree RCU state.  The new rcutree.sysrq_rcu kernel boot parameter
> > > >     must be set for this sysrq to be available.
> > > >     
> > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
> > > > 
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > > 0b760c1369f7..e9392a9d6291 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -61,6 +61,7 @@
> > > >  #include <linux/trace_events.h>
> > > >  #include <linux/suspend.h>
> > > >  #include <linux/ftrace.h>
> > > > +#include <linux/sysrq.h>
> > > >  
> > > >  #include "tree.h"
> > > >  #include "rcu.h"
> > > > @@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT;  int 
> > > > rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes 
> > > > in use. */
> > > >  /* panic() on RCU Stall sysctl. */  int sysctl_panic_on_rcu_stall 
> > > > __read_mostly;
> > > > +/* Commandeer a sysrq key to dump RCU's tree. */ static bool 
> > > > +sysrq_rcu; module_param(sysrq_rcu, bool, 0444);
> > > >  
> > > >  /*
> > > >   * The rcu_scheduler_active variable is initialized to the value 
> > > > @@
> > > > -662,6 +666,27 @@ void show_rcu_gp_kthreads(void)  } 
> > > > EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
> > > >  
> > > > +/* Dump grace-period-request information due to commandeered sysrq. 
> > > > +*/ static void sysrq_show_rcu(int key) {
> > > > +	show_rcu_gp_kthreads();
> > > > +}
> > > > +
> > > > +static struct sysrq_key_op sysrq_rcudump_op = {
> > > > +	.handler = sysrq_show_rcu,
> > > > +	.help_msg = "show-rcu(y)",
> > > > +	.action_msg = "Show RCU tree",
> > > > +	.enable_mask = SYSRQ_ENABLE_DUMP, };
> > > > +
> > > > +static int __init rcu_sysrq_init(void) {
> > > > +	if (sysrq_rcu)
> > > > +		return register_sysrq_key('y', &sysrq_rcudump_op);
> > > > +	return 0;
> > > > +}
> > > > +early_initcall(rcu_sysrq_init);
> > > > +
> > > >  /*
> > > >   * Send along grace-period-related data for rcutorture diagnostics.
> > > >   */
> > > > 
> > > 
> > 
> 
>