2018-11-29 08:50:31

by He, Bo

[permalink] [raw]
Subject: rcu_preempt caused oom

Hi,
we test on kernel 4.19.0 on android, after run more than 24 Hours monkey stress test, we see OOM on 1/10 2G memory board, the issue is not seen on the 4.14 kernel.
we have done some debugs:
1. OOM is due to the filp consume too many memory: 300M vs 2G board.
2. with the 120s hung task detect, most of the tasks will block at __wait_rcu_gp: wait_for_completion(&rs_array[i].completion);
[47571.863839] Kernel panic - not syncing: hung_task: blocked tasks
[47571.875446] CPU: 1 PID: 13626 Comm: FinalizerDaemon Tainted: G U O 4.19.0-quilt-2e5dc0ac-gf3f313245eb6 #1
[47571.887603] Call Trace:
[47571.890547] dump_stack+0x70/0xa5
[47571.894456] panic+0xe3/0x241
[47571.897977] ? wait_for_completion_timeout+0x72/0x1b0
[47571.903830] __wait_rcu_gp+0x17b/0x180
[47571.908226] synchronize_rcu.part.76+0x38/0x50
[47571.913393] ? __call_rcu.constprop.79+0x3a0/0x3a0
[47571.918948] ? __bpf_trace_rcu_invoke_callback+0x10/0x10
[47571.925094] synchronize_rcu+0x43/0x50
[47571.929487] evdev_detach_client+0x59/0x60
[47571.934264] evdev_release+0x4e/0xd0
[47571.938464] __fput+0xfa/0x1f0
[47571.942072] ____fput+0xe/0x10
[47571.945683] task_work_run+0x90/0xc0
[47571.949884] exit_to_usermode_loop+0x9f/0xb0
[47571.954855] do_syscall_64+0xfa/0x110
[47571.959151] entry_SYSCALL_64_after_hwframe+0x49/0xbe
3. after enable the rcu trace, we don't see rcu_quiescent_state_report trace in a long time, we see rcu_callback: rcu_preempt will never response with the rcu_invoke_callback.
[47572.040668] ps-12388 1d..1 47566097572us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
[47572.040707] ps-12388 1d... 47566097621us : rcu_callback: rcu_preempt rhp=00000000783a728b func=file_free_rcu 4354/82824
[47572.040734] ps-12388 1d..1 47566097622us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
[47572.040756] ps-12388 1d..1 47566097623us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
[47572.040778] ps-12388 1d..1 47566097623us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
[47572.040802] ps-12388 1d... 47566097674us : rcu_callback: rcu_preempt rhp=0000000042c76521 func=file_free_rcu 4354/82825
[47572.040824] ps-12388 1d..1 47566097676us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
[47572.040847] ps-12388 1d..1 47566097676us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
[47572.040868] ps-12388 1d..1 47566097676us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
[47572.040895] ps-12388 1d..1 47566097716us : rcu_callback: rcu_preempt rhp=000000005e40fde2 func=avc_node_free 4354/82826
[47572.040919] ps-12388 1d..1 47566097735us : rcu_callback: rcu_preempt rhp=00000000f80fe353 func=avc_node_free 4354/82827
[47572.040943] ps-12388 1d..1 47566097758us : rcu_callback: rcu_preempt rhp=000000007486f400 func=avc_node_free 4354/82828
[47572.040967] ps-12388 1d..1 47566097760us : rcu_callback: rcu_preempt rhp=00000000b87872a8 func=avc_node_free 4354/82829
[47572.040990] ps-12388 1d... 47566097789us : rcu_callback: rcu_preempt rhp=000000008c656343 func=file_free_rcu 4354/82830
[47572.041013] ps-12388 1d..1 47566097790us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
[47572.041036] ps-12388 1d..1 47566097790us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
[47572.041057] ps-12388 1d..1 47566097791us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
[47572.041081] ps-12388 1d... 47566097871us : rcu_callback: rcu_preempt rhp=000000007e6c898c func=file_free_rcu 4354/82831
[47572.041103] ps-12388 1d..1 47566097872us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
[47572.041126] ps-12388 1d..1 47566097872us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
[47572.041147] ps-12388 1d..1 47566097873us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
[47572.041170] ps-12388 1d... 47566097945us : rcu_callback: rcu_preempt rhp=0000000032f4f174 func=file_free_rcu 4354/82832
[47572.041193] ps-12388 1d..1 47566097946us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf

Do you have any suggestions to debug the issue?


2018-11-29 13:09:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Thu, Nov 29, 2018 at 08:49:35AM +0000, He, Bo wrote:
> Hi,
> we test on kernel 4.19.0 on android, after run more than 24 Hours monkey stress test, we see OOM on 1/10 2G memory board, the issue is not seen on the 4.14 kernel.
> we have done some debugs:
> 1. OOM is due to the filp consume too many memory: 300M vs 2G board.
> 2. with the 120s hung task detect, most of the tasks will block at __wait_rcu_gp: wait_for_completion(&rs_array[i].completion);
> [47571.863839] Kernel panic - not syncing: hung_task: blocked tasks
> [47571.875446] CPU: 1 PID: 13626 Comm: FinalizerDaemon Tainted: G U O 4.19.0-quilt-2e5dc0ac-gf3f313245eb6 #1
> [47571.887603] Call Trace:
> [47571.890547] dump_stack+0x70/0xa5
> [47571.894456] panic+0xe3/0x241
> [47571.897977] ? wait_for_completion_timeout+0x72/0x1b0
> [47571.903830] __wait_rcu_gp+0x17b/0x180
> [47571.908226] synchronize_rcu.part.76+0x38/0x50
> [47571.913393] ? __call_rcu.constprop.79+0x3a0/0x3a0
> [47571.918948] ? __bpf_trace_rcu_invoke_callback+0x10/0x10
> [47571.925094] synchronize_rcu+0x43/0x50
> [47571.929487] evdev_detach_client+0x59/0x60
> [47571.934264] evdev_release+0x4e/0xd0
> [47571.938464] __fput+0xfa/0x1f0
> [47571.942072] ____fput+0xe/0x10
> [47571.945683] task_work_run+0x90/0xc0
> [47571.949884] exit_to_usermode_loop+0x9f/0xb0
> [47571.954855] do_syscall_64+0xfa/0x110
> [47571.959151] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 3. after enable the rcu trace, we don't see rcu_quiescent_state_report trace in a long time, we see rcu_callback: rcu_preempt will never response with the rcu_invoke_callback.
> [47572.040668] ps-12388 1d..1 47566097572us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> [47572.040707] ps-12388 1d... 47566097621us : rcu_callback: rcu_preempt rhp=00000000783a728b func=file_free_rcu 4354/82824
> [47572.040734] ps-12388 1d..1 47566097622us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> [47572.040756] ps-12388 1d..1 47566097623us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> [47572.040778] ps-12388 1d..1 47566097623us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> [47572.040802] ps-12388 1d... 47566097674us : rcu_callback: rcu_preempt rhp=0000000042c76521 func=file_free_rcu 4354/82825
> [47572.040824] ps-12388 1d..1 47566097676us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> [47572.040847] ps-12388 1d..1 47566097676us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> [47572.040868] ps-12388 1d..1 47566097676us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> [47572.040895] ps-12388 1d..1 47566097716us : rcu_callback: rcu_preempt rhp=000000005e40fde2 func=avc_node_free 4354/82826
> [47572.040919] ps-12388 1d..1 47566097735us : rcu_callback: rcu_preempt rhp=00000000f80fe353 func=avc_node_free 4354/82827
> [47572.040943] ps-12388 1d..1 47566097758us : rcu_callback: rcu_preempt rhp=000000007486f400 func=avc_node_free 4354/82828
> [47572.040967] ps-12388 1d..1 47566097760us : rcu_callback: rcu_preempt rhp=00000000b87872a8 func=avc_node_free 4354/82829
> [47572.040990] ps-12388 1d... 47566097789us : rcu_callback: rcu_preempt rhp=000000008c656343 func=file_free_rcu 4354/82830
> [47572.041013] ps-12388 1d..1 47566097790us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> [47572.041036] ps-12388 1d..1 47566097790us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> [47572.041057] ps-12388 1d..1 47566097791us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> [47572.041081] ps-12388 1d... 47566097871us : rcu_callback: rcu_preempt rhp=000000007e6c898c func=file_free_rcu 4354/82831
> [47572.041103] ps-12388 1d..1 47566097872us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> [47572.041126] ps-12388 1d..1 47566097872us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> [47572.041147] ps-12388 1d..1 47566097873us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> [47572.041170] ps-12388 1d... 47566097945us : rcu_callback: rcu_preempt rhp=0000000032f4f174 func=file_free_rcu 4354/82832
> [47572.041193] ps-12388 1d..1 47566097946us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
>
> Do you have any suggestions to debug the issue?

If you do not already have CONFIG_RCU_BOOST=y set, could you please
rebuild with that?

Could you also please send your .config file?

Thanx, Paul


2018-11-29 14:28:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Thu, Nov 29, 2018 at 05:06:47AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 29, 2018 at 08:49:35AM +0000, He, Bo wrote:
> > Hi,
> > we test on kernel 4.19.0 on android, after run more than 24 Hours monkey stress test, we see OOM on 1/10 2G memory board, the issue is not seen on the 4.14 kernel.
> > we have done some debugs:
> > 1. OOM is due to the filp consume too many memory: 300M vs 2G board.
> > 2. with the 120s hung task detect, most of the tasks will block at __wait_rcu_gp: wait_for_completion(&rs_array[i].completion);

Did you did see any RCU CPU stall warnings? Or have those been disabled?
If they have been disabled, could you please rerun with them enabled?

> > [47571.863839] Kernel panic - not syncing: hung_task: blocked tasks
> > [47571.875446] CPU: 1 PID: 13626 Comm: FinalizerDaemon Tainted: G U O 4.19.0-quilt-2e5dc0ac-gf3f313245eb6 #1
> > [47571.887603] Call Trace:
> > [47571.890547] dump_stack+0x70/0xa5
> > [47571.894456] panic+0xe3/0x241
> > [47571.897977] ? wait_for_completion_timeout+0x72/0x1b0
> > [47571.903830] __wait_rcu_gp+0x17b/0x180
> > [47571.908226] synchronize_rcu.part.76+0x38/0x50
> > [47571.913393] ? __call_rcu.constprop.79+0x3a0/0x3a0
> > [47571.918948] ? __bpf_trace_rcu_invoke_callback+0x10/0x10
> > [47571.925094] synchronize_rcu+0x43/0x50
> > [47571.929487] evdev_detach_client+0x59/0x60
> > [47571.934264] evdev_release+0x4e/0xd0
> > [47571.938464] __fput+0xfa/0x1f0
> > [47571.942072] ____fput+0xe/0x10
> > [47571.945683] task_work_run+0x90/0xc0
> > [47571.949884] exit_to_usermode_loop+0x9f/0xb0
> > [47571.954855] do_syscall_64+0xfa/0x110
> > [47571.959151] entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is indeed a task waiting on synchronize_rcu().

> > 3. after enable the rcu trace, we don't see rcu_quiescent_state_report trace in a long time, we see rcu_callback: rcu_preempt will never response with the rcu_invoke_callback.
> > [47572.040668] ps-12388 1d..1 47566097572us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > [47572.040707] ps-12388 1d... 47566097621us : rcu_callback: rcu_preempt rhp=00000000783a728b func=file_free_rcu 4354/82824
> > [47572.040734] ps-12388 1d..1 47566097622us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > [47572.040756] ps-12388 1d..1 47566097623us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > [47572.040778] ps-12388 1d..1 47566097623us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > [47572.040802] ps-12388 1d... 47566097674us : rcu_callback: rcu_preempt rhp=0000000042c76521 func=file_free_rcu 4354/82825
> > [47572.040824] ps-12388 1d..1 47566097676us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > [47572.040847] ps-12388 1d..1 47566097676us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > [47572.040868] ps-12388 1d..1 47566097676us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > [47572.040895] ps-12388 1d..1 47566097716us : rcu_callback: rcu_preempt rhp=000000005e40fde2 func=avc_node_free 4354/82826
> > [47572.040919] ps-12388 1d..1 47566097735us : rcu_callback: rcu_preempt rhp=00000000f80fe353 func=avc_node_free 4354/82827
> > [47572.040943] ps-12388 1d..1 47566097758us : rcu_callback: rcu_preempt rhp=000000007486f400 func=avc_node_free 4354/82828
> > [47572.040967] ps-12388 1d..1 47566097760us : rcu_callback: rcu_preempt rhp=00000000b87872a8 func=avc_node_free 4354/82829
> > [47572.040990] ps-12388 1d... 47566097789us : rcu_callback: rcu_preempt rhp=000000008c656343 func=file_free_rcu 4354/82830
> > [47572.041013] ps-12388 1d..1 47566097790us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > [47572.041036] ps-12388 1d..1 47566097790us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > [47572.041057] ps-12388 1d..1 47566097791us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > [47572.041081] ps-12388 1d... 47566097871us : rcu_callback: rcu_preempt rhp=000000007e6c898c func=file_free_rcu 4354/82831
> > [47572.041103] ps-12388 1d..1 47566097872us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > [47572.041126] ps-12388 1d..1 47566097872us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > [47572.041147] ps-12388 1d..1 47566097873us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > [47572.041170] ps-12388 1d... 47566097945us : rcu_callback: rcu_preempt rhp=0000000032f4f174 func=file_free_rcu 4354/82832
> > [47572.041193] ps-12388 1d..1 47566097946us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf

Callbacks are being queued and future grace periods to handle them are being
requested, but as you say, no progress on the current grace period.

Is it possible to start the trace earlier?

> > Do you have any suggestions to debug the issue?
>
> If you do not already have CONFIG_RCU_BOOST=y set, could you please
> rebuild with that?
>
> Could you also please send your .config file?

So, to summarize:

1. If you don't have RCU CPU stall warnings enabled,
please enable them. For example, please remove
rcupdate.rcu_cpu_stall_suppress from the kernel boot
parameters if it is there.

Getting an RCU CPU stall warning would be extremely
helpful. It contains many useful diagnostics.

2. If possible, please start the trace before the last
grace period starts.

3. If CONFIG_RCU_BOOST=y is not set, please try setting it.

4. Please send me your .config file.

Thanx, Paul


2018-11-30 08:06:25

by He, Bo

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

Thanks for your great suggestions.
After enable the CONFIG_RCU_BOOST=y, we don't reproduce the issue until now, we will keep it running and update you with the test results.

The enclosed is the kernel config, here is the config I grep with the RCU, we don't enable the CONFIG_RCU_BOOST in our build.
# RCU Subsystem
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# RCU Debugging
CONFIG_RCU_PERF_TEST=m
CONFIG_RCU_TORTURE_TEST=m
CONFIG_RCU_CPU_STALL_TIMEOUT=21
CONFIG_RCU_TRACE=y
CONFIG_RCU_EQS_DEBUG=y


-----Original Message-----
From: Paul E. McKenney <[email protected]>
Sent: Thursday, November 29, 2018 10:27 PM
To: He, Bo <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>
Subject: Re: rcu_preempt caused oom

On Thu, Nov 29, 2018 at 05:06:47AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 29, 2018 at 08:49:35AM +0000, He, Bo wrote:
> > Hi,
> > we test on kernel 4.19.0 on android, after run more than 24 Hours monkey stress test, we see OOM on 1/10 2G memory board, the issue is not seen on the 4.14 kernel.
> > we have done some debugs:
> > 1. OOM is due to the filp consume too many memory: 300M vs 2G board.
> > 2. with the 120s hung task detect, most of the tasks will block at
> > __wait_rcu_gp: wait_for_completion(&rs_array[i].completion);

Did you did see any RCU CPU stall warnings? Or have those been disabled?
If they have been disabled, could you please rerun with them enabled?

> > [47571.863839] Kernel panic - not syncing: hung_task: blocked tasks
> > [47571.875446] CPU: 1 PID: 13626 Comm: FinalizerDaemon Tainted: G U O 4.19.0-quilt-2e5dc0ac-gf3f313245eb6 #1
> > [47571.887603] Call Trace:
> > [47571.890547] dump_stack+0x70/0xa5 [47571.894456]
> > panic+0xe3/0x241 [47571.897977] ?
> > wait_for_completion_timeout+0x72/0x1b0
> > [47571.903830] __wait_rcu_gp+0x17b/0x180 [47571.908226]
> > synchronize_rcu.part.76+0x38/0x50 [47571.913393] ?
> > __call_rcu.constprop.79+0x3a0/0x3a0
> > [47571.918948] ? __bpf_trace_rcu_invoke_callback+0x10/0x10
> > [47571.925094] synchronize_rcu+0x43/0x50 [47571.929487]
> > evdev_detach_client+0x59/0x60 [47571.934264]
> > evdev_release+0x4e/0xd0 [47571.938464] __fput+0xfa/0x1f0
> > [47571.942072] ____fput+0xe/0x10 [47571.945683]
> > task_work_run+0x90/0xc0 [47571.949884]
> > exit_to_usermode_loop+0x9f/0xb0 [47571.954855]
> > do_syscall_64+0xfa/0x110 [47571.959151]
> > entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is indeed a task waiting on synchronize_rcu().

> > 3. after enable the rcu trace, we don't see rcu_quiescent_state_report trace in a long time, we see rcu_callback: rcu_preempt will never response with the rcu_invoke_callback.
> > [47572.040668] ps-12388 1d..1 47566097572us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > [47572.040707] ps-12388 1d... 47566097621us : rcu_callback: rcu_preempt rhp=00000000783a728b func=file_free_rcu 4354/82824
> > [47572.040734] ps-12388 1d..1 47566097622us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > [47572.040756] ps-12388 1d..1 47566097623us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > [47572.040778] ps-12388 1d..1 47566097623us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > [47572.040802] ps-12388 1d... 47566097674us : rcu_callback: rcu_preempt rhp=0000000042c76521 func=file_free_rcu 4354/82825
> > [47572.040824] ps-12388 1d..1 47566097676us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > [47572.040847] ps-12388 1d..1 47566097676us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > [47572.040868] ps-12388 1d..1 47566097676us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > [47572.040895] ps-12388 1d..1 47566097716us : rcu_callback: rcu_preempt rhp=000000005e40fde2 func=avc_node_free 4354/82826
> > [47572.040919] ps-12388 1d..1 47566097735us : rcu_callback: rcu_preempt rhp=00000000f80fe353 func=avc_node_free 4354/82827
> > [47572.040943] ps-12388 1d..1 47566097758us : rcu_callback: rcu_preempt rhp=000000007486f400 func=avc_node_free 4354/82828
> > [47572.040967] ps-12388 1d..1 47566097760us : rcu_callback: rcu_preempt rhp=00000000b87872a8 func=avc_node_free 4354/82829
> > [47572.040990] ps-12388 1d... 47566097789us : rcu_callback: rcu_preempt rhp=000000008c656343 func=file_free_rcu 4354/82830
> > [47572.041013] ps-12388 1d..1 47566097790us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > [47572.041036] ps-12388 1d..1 47566097790us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > [47572.041057] ps-12388 1d..1 47566097791us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > [47572.041081] ps-12388 1d... 47566097871us : rcu_callback: rcu_preempt rhp=000000007e6c898c func=file_free_rcu 4354/82831
> > [47572.041103] ps-12388 1d..1 47566097872us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > [47572.041126] ps-12388 1d..1 47566097872us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > [47572.041147] ps-12388 1d..1 47566097873us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > [47572.041170] ps-12388 1d... 47566097945us : rcu_callback: rcu_preempt rhp=0000000032f4f174 func=file_free_rcu 4354/82832
> > [47572.041193] ps-12388 1d..1 47566097946us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf

Callbacks are being queued and future grace periods to handle them are being requested, but as you say, no progress on the current grace period.

Is it possible to start the trace earlier?

> > Do you have any suggestions to debug the issue?
>
> If you do not already have CONFIG_RCU_BOOST=y set, could you please
> rebuild with that?
>
> Could you also please send your .config file?

So, to summarize:

1. If you don't have RCU CPU stall warnings enabled,
please enable them. For example, please remove
rcupdate.rcu_cpu_stall_suppress from the kernel boot
parameters if it is there.

Getting an RCU CPU stall warning would be extremely
helpful. It contains many useful diagnostics.

2. If possible, please start the trace before the last
grace period starts.

3. If CONFIG_RCU_BOOST=y is not set, please try setting it.

4. Please send me your .config file.

Thanx, Paul


Attachments:
kernel_config (165.00 kB)
kernel_config

2018-11-30 14:44:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Fri, Nov 30, 2018 at 08:03:38AM +0000, He, Bo wrote:
> Thanks for your great suggestions.
> After enable the CONFIG_RCU_BOOST=y, we don't reproduce the issue until now, we will keep it running and update you with the test results.
>
> The enclosed is the kernel config, here is the config I grep with the RCU, we don't enable the CONFIG_RCU_BOOST in our build.
> # RCU Subsystem
> CONFIG_PREEMPT_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_SRCU=y
> CONFIG_TREE_SRCU=y
> CONFIG_TASKS_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> # RCU Debugging
> CONFIG_RCU_PERF_TEST=m
> CONFIG_RCU_TORTURE_TEST=m
> CONFIG_RCU_CPU_STALL_TIMEOUT=21
> CONFIG_RCU_TRACE=y
> CONFIG_RCU_EQS_DEBUG=y

Thank you!

What likely happened is that a low-priority RCU reader was preempted
indefinitely. Though I would have expected an RCU CPU stall warning
in that case, so it might well be that something else is going on.
Could you please send me your list of kernel boot parameters? They
usually appear near the start of your console output.

Thanx, Paul

> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Thursday, November 29, 2018 10:27 PM
> To: He, Bo <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Thu, Nov 29, 2018 at 05:06:47AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 29, 2018 at 08:49:35AM +0000, He, Bo wrote:
> > > Hi,
> > > we test on kernel 4.19.0 on android, after run more than 24 Hours monkey stress test, we see OOM on 1/10 2G memory board, the issue is not seen on the 4.14 kernel.
> > > we have done some debugs:
> > > 1. OOM is due to the filp consume too many memory: 300M vs 2G board.
> > > 2. with the 120s hung task detect, most of the tasks will block at
> > > __wait_rcu_gp: wait_for_completion(&rs_array[i].completion);
>
> Did you did see any RCU CPU stall warnings? Or have those been disabled?
> If they have been disabled, could you please rerun with them enabled?
>
> > > [47571.863839] Kernel panic - not syncing: hung_task: blocked tasks
> > > [47571.875446] CPU: 1 PID: 13626 Comm: FinalizerDaemon Tainted: G U O 4.19.0-quilt-2e5dc0ac-gf3f313245eb6 #1
> > > [47571.887603] Call Trace:
> > > [47571.890547] dump_stack+0x70/0xa5 [47571.894456]
> > > panic+0xe3/0x241 [47571.897977] ?
> > > wait_for_completion_timeout+0x72/0x1b0
> > > [47571.903830] __wait_rcu_gp+0x17b/0x180 [47571.908226]
> > > synchronize_rcu.part.76+0x38/0x50 [47571.913393] ?
> > > __call_rcu.constprop.79+0x3a0/0x3a0
> > > [47571.918948] ? __bpf_trace_rcu_invoke_callback+0x10/0x10
> > > [47571.925094] synchronize_rcu+0x43/0x50 [47571.929487]
> > > evdev_detach_client+0x59/0x60 [47571.934264]
> > > evdev_release+0x4e/0xd0 [47571.938464] __fput+0xfa/0x1f0
> > > [47571.942072] ____fput+0xe/0x10 [47571.945683]
> > > task_work_run+0x90/0xc0 [47571.949884]
> > > exit_to_usermode_loop+0x9f/0xb0 [47571.954855]
> > > do_syscall_64+0xfa/0x110 [47571.959151]
> > > entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> This is indeed a task waiting on synchronize_rcu().
>
> > > 3. after enable the rcu trace, we don't see rcu_quiescent_state_report trace in a long time, we see rcu_callback: rcu_preempt will never response with the rcu_invoke_callback.
> > > [47572.040668] ps-12388 1d..1 47566097572us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > > [47572.040707] ps-12388 1d... 47566097621us : rcu_callback: rcu_preempt rhp=00000000783a728b func=file_free_rcu 4354/82824
> > > [47572.040734] ps-12388 1d..1 47566097622us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > > [47572.040756] ps-12388 1d..1 47566097623us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > > [47572.040778] ps-12388 1d..1 47566097623us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > > [47572.040802] ps-12388 1d... 47566097674us : rcu_callback: rcu_preempt rhp=0000000042c76521 func=file_free_rcu 4354/82825
> > > [47572.040824] ps-12388 1d..1 47566097676us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > > [47572.040847] ps-12388 1d..1 47566097676us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > > [47572.040868] ps-12388 1d..1 47566097676us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > > [47572.040895] ps-12388 1d..1 47566097716us : rcu_callback: rcu_preempt rhp=000000005e40fde2 func=avc_node_free 4354/82826
> > > [47572.040919] ps-12388 1d..1 47566097735us : rcu_callback: rcu_preempt rhp=00000000f80fe353 func=avc_node_free 4354/82827
> > > [47572.040943] ps-12388 1d..1 47566097758us : rcu_callback: rcu_preempt rhp=000000007486f400 func=avc_node_free 4354/82828
> > > [47572.040967] ps-12388 1d..1 47566097760us : rcu_callback: rcu_preempt rhp=00000000b87872a8 func=avc_node_free 4354/82829
> > > [47572.040990] ps-12388 1d... 47566097789us : rcu_callback: rcu_preempt rhp=000000008c656343 func=file_free_rcu 4354/82830
> > > [47572.041013] ps-12388 1d..1 47566097790us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > > [47572.041036] ps-12388 1d..1 47566097790us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > > [47572.041057] ps-12388 1d..1 47566097791us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > > [47572.041081] ps-12388 1d... 47566097871us : rcu_callback: rcu_preempt rhp=000000007e6c898c func=file_free_rcu 4354/82831
> > > [47572.041103] ps-12388 1d..1 47566097872us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > > [47572.041126] ps-12388 1d..1 47566097872us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > > [47572.041147] ps-12388 1d..1 47566097873us : rcu_grace_period: rcu_preempt 23716088 AccWaitCB
> > > [47572.041170] ps-12388 1d... 47566097945us : rcu_callback: rcu_preempt rhp=0000000032f4f174 func=file_free_rcu 4354/82832
> > > [47572.041193] ps-12388 1d..1 47566097946us : rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
>
> Callbacks are being queued and future grace periods to handle them are being requested, but as you say, no progress on the current grace period.
>
> Is it possible to start the trace earlier?
>
> > > Do you have any suggestions to debug the issue?
> >
> > If you do not already have CONFIG_RCU_BOOST=y set, could you please
> > rebuild with that?
> >
> > Could you also please send your .config file?
>
> So, to summarize:
>
> 1. If you don't have RCU CPU stall warnings enabled,
> please enable them. For example, please remove
> rcupdate.rcu_cpu_stall_suppress from the kernel boot
> parameters if it is there.
>
> Getting an RCU CPU stall warning would be extremely
> helpful. It contains many useful diagnostics.
>
> 2. If possible, please start the trace before the last
> grace period starts.
>
> 3. If CONFIG_RCU_BOOST=y is not set, please try setting it.
>
> 4. Please send me your .config file.
>
> Thanx, Paul
>



2018-11-30 15:18:33

by Steven Rostedt

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Fri, 30 Nov 2018 06:43:17 -0800
"Paul E. McKenney" <[email protected]> wrote:

> Could you please send me your list of kernel boot parameters? They
> usually appear near the start of your console output.

Or just: cat /proc/cmdline

-- Steve

2018-11-30 15:20:54

by He, Bo

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

Here is the kernel cmdline:

Kernel command line: androidboot.acpio_idx=0 androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinger-06_03-userdebug androidboot.diskbus=00.0 androidboot.verifiedbootstate=green androidboot.bootreason=power-on androidboot.serialno=R1J56L6006a7bb g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr nopti ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off gpt loglevel=4 androidboot.hardware=gordon_peak firmware_class.path=/vendor/firmware relative_sleep_states=1 enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10 androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:00/properties/android/ pstore.backend=ramoops memmap=0x1400000$0x50000000 ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000 ramoops.record_size=0x4000 ramoops.console_size=0x1000000 ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current
i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1 drm.vblankoffdelay=

-----Original Message-----
From: Steven Rostedt <[email protected]>
Sent: Friday, November 30, 2018 11:17 PM
To: Paul E. McKenney <[email protected]>
Cc: He, Bo <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>
Subject: Re: rcu_preempt caused oom

On Fri, 30 Nov 2018 06:43:17 -0800
"Paul E. McKenney" <[email protected]> wrote:

> Could you please send me your list of kernel boot parameters? They
> usually appear near the start of your console output.

Or just: cat /proc/cmdline

-- Steve

2018-11-30 16:50:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote:
> Here is the kernel cmdline:

Thank you!

> Kernel command line: androidboot.acpio_idx=0 androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinger-06_03-userdebug androidboot.diskbus=00.0 androidboot.verifiedbootstate=green androidboot.bootreason=power-on androidboot.serialno=R1J56L6006a7bb g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr nopti ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off gpt loglevel=4 androidboot.hardware=gordon_peak firmware_class.path=/vendor/firmware relative_sleep_states=1 enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10 androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:00/properties/android/ pstore.backend=ramoops memmap=0x1400000$0x50000000 ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000 ramoops.record_size=0x4000 ramoops.console_size=0x1000000 ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current
> i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1 drm.vblankoffdelay=

And no sign of any suppression of RCU CPU stall warnings. Hmmm...
It does take more than 21 seconds to OOM? Or do things happen faster
than that? If they do happen faster than that, then on approach would
be to add something like this to the kernel command line:

rcupdate.rcu_cpu_stall_timeout=7

This would set the stall timeout to seven seconds. Note that timeouts
less than three seconds are silently interpreted as three seconds.

Thanx, Paul

> -----Original Message-----
> From: Steven Rostedt <[email protected]>
> Sent: Friday, November 30, 2018 11:17 PM
> To: Paul E. McKenney <[email protected]>
> Cc: He, Bo <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Fri, 30 Nov 2018 06:43:17 -0800
> "Paul E. McKenney" <[email protected]> wrote:
>
> > Could you please send me your list of kernel boot parameters? They
> > usually appear near the start of your console output.
>
> Or just: cat /proc/cmdline
>
> -- Steve
>


2018-12-03 07:46:12

by He, Bo

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

Thanks, we have run the test for the whole weekend and not reproduce the issue, so we confirm the CONFIG_RCU_BOOST can fix the issue.

We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set panic on rcu stall and will see if we can see the panic, will keep you posed with the test results.
echo 1 > /proc/sys/kernel/panic_on_rcu_stall

-----Original Message-----
From: Paul E. McKenney <[email protected]>
Sent: Saturday, December 1, 2018 12:49 AM
To: He, Bo <[email protected]>
Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>
Subject: Re: rcu_preempt caused oom

On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote:
> Here is the kernel cmdline:

Thank you!

> Kernel command line: androidboot.acpio_idx=0
> androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinger-06_03-
> userdebug androidboot.diskbus=00.0 androidboot.verifiedbootstate=green
> androidboot.bootreason=power-on androidboot.serialno=R1J56L6006a7bb
> g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves
> reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr nopti
> ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off gpt
> loglevel=4 androidboot.hardware=gordon_peak
> firmware_class.path=/vendor/firmware relative_sleep_states=1
> enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10
> androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:00/prope
> rties/android/ pstore.backend=ramoops memmap=0x1400000$0x50000000
> ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000
> ramoops.record_size=0x4000 ramoops.console_size=0x1000000
> ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current
> i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1
> drm.vblankoffdelay=

And no sign of any suppression of RCU CPU stall warnings. Hmmm...
It does take more than 21 seconds to OOM? Or do things happen faster than that? If they do happen faster than that, then on approach would be to add something like this to the kernel command line:

rcupdate.rcu_cpu_stall_timeout=7

This would set the stall timeout to seven seconds. Note that timeouts less than three seconds are silently interpreted as three seconds.

Thanx, Paul

> -----Original Message-----
> From: Steven Rostedt <[email protected]>
> Sent: Friday, November 30, 2018 11:17 PM
> To: Paul E. McKenney <[email protected]>
> Cc: He, Bo <[email protected]>; [email protected];
> [email protected]; [email protected];
> [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin
> <[email protected]>; Zhang, Yanmin <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Fri, 30 Nov 2018 06:43:17 -0800
> "Paul E. McKenney" <[email protected]> wrote:
>
> > Could you please send me your list of kernel boot parameters? They
> > usually appear near the start of your console output.
>
> Or just: cat /proc/cmdline
>
> -- Steve
>


2018-12-03 13:58:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Mon, Dec 03, 2018 at 07:44:03AM +0000, He, Bo wrote:
> Thanks, we have run the test for the whole weekend and not reproduce the issue, so we confirm the CONFIG_RCU_BOOST can fix the issue.

Very good, that is encouraging. Perhaps I should think about making
CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at least
for architectures for which rt_mutexes are implemented.

> We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set panic on rcu stall and will see if we can see the panic, will keep you posed with the test results.
> echo 1 > /proc/sys/kernel/panic_on_rcu_stall

Looking forward to seeing what is going on! Of course, to reproduce, you
will need to again build with CONFIG_RCU_BOOST=n.

Thanx, Paul

> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Saturday, December 1, 2018 12:49 AM
> To: He, Bo <[email protected]>
> Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote:
> > Here is the kernel cmdline:
>
> Thank you!
>
> > Kernel command line: androidboot.acpio_idx=0
> > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinger-06_03-
> > userdebug androidboot.diskbus=00.0 androidboot.verifiedbootstate=green
> > androidboot.bootreason=power-on androidboot.serialno=R1J56L6006a7bb
> > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves
> > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr nopti
> > ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off gpt
> > loglevel=4 androidboot.hardware=gordon_peak
> > firmware_class.path=/vendor/firmware relative_sleep_states=1
> > enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10
> > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:00/prope
> > rties/android/ pstore.backend=ramoops memmap=0x1400000$0x50000000
> > ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000
> > ramoops.record_size=0x4000 ramoops.console_size=0x1000000
> > ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current
> > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1
> > drm.vblankoffdelay=
>
> And no sign of any suppression of RCU CPU stall warnings. Hmmm...
> It does take more than 21 seconds to OOM? Or do things happen faster than that? If they do happen faster than that, then on approach would be to add something like this to the kernel command line:
>
> rcupdate.rcu_cpu_stall_timeout=7
>
> This would set the stall timeout to seven seconds. Note that timeouts less than three seconds are silently interpreted as three seconds.
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Steven Rostedt <[email protected]>
> > Sent: Friday, November 30, 2018 11:17 PM
> > To: Paul E. McKenney <[email protected]>
> > Cc: He, Bo <[email protected]>; [email protected];
> > [email protected]; [email protected];
> > [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin
> > <[email protected]>; Zhang, Yanmin <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Fri, 30 Nov 2018 06:43:17 -0800
> > "Paul E. McKenney" <[email protected]> wrote:
> >
> > > Could you please send me your list of kernel boot parameters? They
> > > usually appear near the start of your console output.
> >
> > Or just: cat /proc/cmdline
> >
> > -- Steve
> >
>


2018-12-04 07:52:14

by He, Bo

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

Hi, Paul:
the enclosed is the log trigger the 120s hung_task_panic without other debug patches, the hung task is blocked at __wait_rcu_gp, it means the rcu_cpu_stall can't detect the scenario:
echo 1 > /proc/sys/kernel/panic_on_rcu_stall
echo 7 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout


-----Original Message-----
From: Paul E. McKenney <[email protected]>
Sent: Monday, December 3, 2018 9:57 PM
To: He, Bo <[email protected]>
Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>
Subject: Re: rcu_preempt caused oom

On Mon, Dec 03, 2018 at 07:44:03AM +0000, He, Bo wrote:
> Thanks, we have run the test for the whole weekend and not reproduce the issue, so we confirm the CONFIG_RCU_BOOST can fix the issue.

Very good, that is encouraging. Perhaps I should think about making CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at least for architectures for which rt_mutexes are implemented.

> We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set panic on rcu stall and will see if we can see the panic, will keep you posed with the test results.
> echo 1 > /proc/sys/kernel/panic_on_rcu_stall

Looking forward to seeing what is going on! Of course, to reproduce, you will need to again build with CONFIG_RCU_BOOST=n.

Thanx, Paul

> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Saturday, December 1, 2018 12:49 AM
> To: He, Bo <[email protected]>
> Cc: Steven Rostedt <[email protected]>;
> [email protected]; [email protected];
> [email protected]; [email protected]; Zhang, Jun
> <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote:
> > Here is the kernel cmdline:
>
> Thank you!
>
> > Kernel command line: androidboot.acpio_idx=0
> > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinger-06_0
> > 3- userdebug androidboot.diskbus=00.0
> > androidboot.verifiedbootstate=green
> > androidboot.bootreason=power-on androidboot.serialno=R1J56L6006a7bb
> > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves
> > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr nopti
> > ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off gpt
> > loglevel=4 androidboot.hardware=gordon_peak
> > firmware_class.path=/vendor/firmware relative_sleep_states=1
> > enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10
> > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:00/pro
> > pe rties/android/ pstore.backend=ramoops memmap=0x1400000$0x50000000
> > ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000
> > ramoops.record_size=0x4000 ramoops.console_size=0x1000000
> > ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current
> > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1
> > drm.vblankoffdelay=
>
> And no sign of any suppression of RCU CPU stall warnings. Hmmm...
> It does take more than 21 seconds to OOM? Or do things happen faster than that? If they do happen faster than that, then on approach would be to add something like this to the kernel command line:
>
> rcupdate.rcu_cpu_stall_timeout=7
>
> This would set the stall timeout to seven seconds. Note that timeouts less than three seconds are silently interpreted as three seconds.
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Steven Rostedt <[email protected]>
> > Sent: Friday, November 30, 2018 11:17 PM
> > To: Paul E. McKenney <[email protected]>
> > Cc: He, Bo <[email protected]>; [email protected];
> > [email protected]; [email protected];
> > [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin
> > <[email protected]>; Zhang, Yanmin <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Fri, 30 Nov 2018 06:43:17 -0800
> > "Paul E. McKenney" <[email protected]> wrote:
> >
> > > Could you please send me your list of kernel boot parameters?
> > > They usually appear near the start of your console output.
> >
> > Or just: cat /proc/cmdline
> >
> > -- Steve
> >
>


Attachments:
apanic_console (26.67 kB)
apanic_console

2018-12-04 19:50:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Tue, Dec 04, 2018 at 07:50:04AM +0000, He, Bo wrote:
> Hi, Paul:
> the enclosed is the log trigger the 120s hung_task_panic without other debug patches, the hung task is blocked at __wait_rcu_gp, it means the rcu_cpu_stall can't detect the scenario:
> echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> echo 7 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout

Not necessarily. If there is an RCU CPU stall warning, blocking within
__wait_rcu_gp() is expected behavior. It is possible that the problem is
that although the grace period is completing as required, the callbacks
are not being invoked in a timely fashion. And that could happen if you
had CONFIG_NO_HZ_FULL and a bunch of nohz_full CPUs, or, alternatively,
callback offloading enabled. But I don't see these in your previous
emails. Another possible cause is that the grace-period kthread is being
delayed, so that the grace period never starts. This seems unlikely,
but it is the only thing thus far that matches the symptoms.

CONFIG_RCU_BOOST=y has the side-effect of causing RCU's kthreads to
be run at SCHED_FIFO priority 1, and that would help in the case where
RCU's grace-period kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks,
all of which execute in the rcu_gp_kthread() function) was being starved
of CPU time.

Does that sound likely?

Thanx, Paul

> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Monday, December 3, 2018 9:57 PM
> To: He, Bo <[email protected]>
> Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Mon, Dec 03, 2018 at 07:44:03AM +0000, He, Bo wrote:
> > Thanks, we have run the test for the whole weekend and not reproduce the issue, so we confirm the CONFIG_RCU_BOOST can fix the issue.
>
> Very good, that is encouraging. Perhaps I should think about making CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at least for architectures for which rt_mutexes are implemented.
>
> > We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set panic on rcu stall and will see if we can see the panic, will keep you posed with the test results.
> > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
>
> Looking forward to seeing what is going on! Of course, to reproduce, you will need to again build with CONFIG_RCU_BOOST=n.
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Saturday, December 1, 2018 12:49 AM
> > To: He, Bo <[email protected]>
> > Cc: Steven Rostedt <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Zhang, Jun
> > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote:
> > > Here is the kernel cmdline:
> >
> > Thank you!
> >
> > > Kernel command line: androidboot.acpio_idx=0
> > > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinger-06_0
> > > 3- userdebug androidboot.diskbus=00.0
> > > androidboot.verifiedbootstate=green
> > > androidboot.bootreason=power-on androidboot.serialno=R1J56L6006a7bb
> > > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves
> > > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr nopti
> > > ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off gpt
> > > loglevel=4 androidboot.hardware=gordon_peak
> > > firmware_class.path=/vendor/firmware relative_sleep_states=1
> > > enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10
> > > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:00/pro
> > > pe rties/android/ pstore.backend=ramoops memmap=0x1400000$0x50000000
> > > ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000
> > > ramoops.record_size=0x4000 ramoops.console_size=0x1000000
> > > ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current
> > > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1
> > > drm.vblankoffdelay=
> >
> > And no sign of any suppression of RCU CPU stall warnings. Hmmm...
> > It does take more than 21 seconds to OOM? Or do things happen faster than that? If they do happen faster than that, then on approach would be to add something like this to the kernel command line:
> >
> > rcupdate.rcu_cpu_stall_timeout=7
> >
> > This would set the stall timeout to seven seconds. Note that timeouts less than three seconds are silently interpreted as three seconds.
> >
> > Thanx, Paul
> >
> > > -----Original Message-----
> > > From: Steven Rostedt <[email protected]>
> > > Sent: Friday, November 30, 2018 11:17 PM
> > > To: Paul E. McKenney <[email protected]>
> > > Cc: He, Bo <[email protected]>; [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin
> > > <[email protected]>; Zhang, Yanmin <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Fri, 30 Nov 2018 06:43:17 -0800
> > > "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > > Could you please send me your list of kernel boot parameters?
> > > > They usually appear near the start of your console output.
> > >
> > > Or just: cat /proc/cmdline
> > >
> > > -- Steve
> > >
> >
>



2018-12-05 08:43:47

by He, Bo

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

I double checked the .config, we don't enable CONFIG_NO_HZ_FULL .
Our previous logs can dump all the task backtrace, and kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks) are all in "I" state not in "R" state, my understandings are if it's the side-effect of causing RCU's kthreads to be run at SCHED_FIFO priority 1, the kthreads should be in R state.

I will do more experiments and keep you update once we have more findings:
1. set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST and see if the issue can reproduce.
2. check more ftrace to double confirm why there is no trace_rcu_quiescent_state_report and most of the trace_rcu_grace_period are in "AccWaitCB".

-----Original Message-----
From: Paul E. McKenney <[email protected]>
Sent: Wednesday, December 5, 2018 3:50 AM
To: He, Bo <[email protected]>
Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>
Subject: Re: rcu_preempt caused oom

On Tue, Dec 04, 2018 at 07:50:04AM +0000, He, Bo wrote:
> Hi, Paul:
> the enclosed is the log trigger the 120s hung_task_panic without other debug patches, the hung task is blocked at __wait_rcu_gp, it means the rcu_cpu_stall can't detect the scenario:
> echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> echo 7 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout

Not necessarily. If there is an RCU CPU stall warning, blocking within
__wait_rcu_gp() is expected behavior. It is possible that the problem is that although the grace period is completing as required, the callbacks are not being invoked in a timely fashion. And that could happen if you had CONFIG_NO_HZ_FULL and a bunch of nohz_full CPUs, or, alternatively, callback offloading enabled. But I don't see these in your previous emails. Another possible cause is that the grace-period kthread is being delayed, so that the grace period never starts. This seems unlikely, but it is the only thing thus far that matches the symptoms.

CONFIG_RCU_BOOST=y has the side-effect of causing RCU's kthreads to be run at SCHED_FIFO priority 1, and that would help in the case where RCU's grace-period kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks, all of which execute in the rcu_gp_kthread() function) was being starved of CPU time.

Does that sound likely?

Thanx, Paul

> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Monday, December 3, 2018 9:57 PM
> To: He, Bo <[email protected]>
> Cc: Steven Rostedt <[email protected]>;
> [email protected]; [email protected];
> [email protected]; [email protected]; Zhang, Jun
> <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Mon, Dec 03, 2018 at 07:44:03AM +0000, He, Bo wrote:
> > Thanks, we have run the test for the whole weekend and not reproduce the issue, so we confirm the CONFIG_RCU_BOOST can fix the issue.
>
> Very good, that is encouraging. Perhaps I should think about making CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at least for architectures for which rt_mutexes are implemented.
>
> > We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set panic on rcu stall and will see if we can see the panic, will keep you posed with the test results.
> > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
>
> Looking forward to seeing what is going on! Of course, to reproduce, you will need to again build with CONFIG_RCU_BOOST=n.
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Saturday, December 1, 2018 12:49 AM
> > To: He, Bo <[email protected]>
> > Cc: Steven Rostedt <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Zhang, Jun
> > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote:
> > > Here is the kernel cmdline:
> >
> > Thank you!
> >
> > > Kernel command line: androidboot.acpio_idx=0
> > > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinger-06
> > > _0
> > > 3- userdebug androidboot.diskbus=00.0
> > > androidboot.verifiedbootstate=green
> > > androidboot.bootreason=power-on
> > > androidboot.serialno=R1J56L6006a7bb
> > > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves
> > > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr nopti
> > > ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off gpt
> > > loglevel=4 androidboot.hardware=gordon_peak
> > > firmware_class.path=/vendor/firmware relative_sleep_states=1
> > > enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10
> > > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:00/p
> > > ro pe rties/android/ pstore.backend=ramoops
> > > memmap=0x1400000$0x50000000
> > > ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000
> > > ramoops.record_size=0x4000 ramoops.console_size=0x1000000
> > > ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current
> > > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1
> > > drm.vblankoffdelay=
> >
> > And no sign of any suppression of RCU CPU stall warnings. Hmmm...
> > It does take more than 21 seconds to OOM? Or do things happen faster than that? If they do happen faster than that, then on approach would be to add something like this to the kernel command line:
> >
> > rcupdate.rcu_cpu_stall_timeout=7
> >
> > This would set the stall timeout to seven seconds. Note that timeouts less than three seconds are silently interpreted as three seconds.
> >
> > Thanx, Paul
> >
> > > -----Original Message-----
> > > From: Steven Rostedt <[email protected]>
> > > Sent: Friday, November 30, 2018 11:17 PM
> > > To: Paul E. McKenney <[email protected]>
> > > Cc: He, Bo <[email protected]>; [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; Zhang, Jun <[email protected]>; Xiao,
> > > Jin <[email protected]>; Zhang, Yanmin <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Fri, 30 Nov 2018 06:43:17 -0800 "Paul E. McKenney"
> > > <[email protected]> wrote:
> > >
> > > > Could you please send me your list of kernel boot parameters?
> > > > They usually appear near the start of your console output.
> > >
> > > Or just: cat /proc/cmdline
> > >
> > > -- Steve
> > >
> >
>



2018-12-05 17:45:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Wed, Dec 05, 2018 at 08:42:54AM +0000, He, Bo wrote:
> I double checked the .config, we don't enable CONFIG_NO_HZ_FULL .
> Our previous logs can dump all the task backtrace, and kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks) are all in "I" state not in "R" state, my understandings are if it's the side-effect of causing RCU's kthreads to be run at SCHED_FIFO priority 1, the kthreads should be in R state.

Hmmm... Well, the tasks could in theory be waiting on a blocking mutex.
But in practice the grace-period kthreads wait on events, so that makes
no sense.

Is it possible for you to dump out the grace-period kthread's stack,
for example, with sysreq-t? (Steve might know a better way to do this.)

> I will do more experiments and keep you update once we have more findings:
> 1. set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST and see if the issue can reproduce.

That sounds like a most excellent experiment!

> 2. check more ftrace to double confirm why there is no trace_rcu_quiescent_state_report and most of the trace_rcu_grace_period are in "AccWaitCB".

As noted earlier, to see something interesting, you will need to start
the ftrace before the grace period starts. This would probably mean
having ftrace running before starting the test. Starting the ftrace
after the hang commences is unlikely to produce useful information.

Thanx, Paul

> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Wednesday, December 5, 2018 3:50 AM
> To: He, Bo <[email protected]>
> Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Tue, Dec 04, 2018 at 07:50:04AM +0000, He, Bo wrote:
> > Hi, Paul:
> > the enclosed is the log trigger the 120s hung_task_panic without other debug patches, the hung task is blocked at __wait_rcu_gp, it means the rcu_cpu_stall can't detect the scenario:
> > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > echo 7 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
>
> Not necessarily. If there is an RCU CPU stall warning, blocking within
> __wait_rcu_gp() is expected behavior. It is possible that the problem is that although the grace period is completing as required, the callbacks are not being invoked in a timely fashion. And that could happen if you had CONFIG_NO_HZ_FULL and a bunch of nohz_full CPUs, or, alternatively, callback offloading enabled. But I don't see these in your previous emails. Another possible cause is that the grace-period kthread is being delayed, so that the grace period never starts. This seems unlikely, but it is the only thing thus far that matches the symptoms.
>
> CONFIG_RCU_BOOST=y has the side-effect of causing RCU's kthreads to be run at SCHED_FIFO priority 1, and that would help in the case where RCU's grace-period kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks, all of which execute in the rcu_gp_kthread() function) was being starved of CPU time.
>
> Does that sound likely?
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Monday, December 3, 2018 9:57 PM
> > To: He, Bo <[email protected]>
> > Cc: Steven Rostedt <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Zhang, Jun
> > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Mon, Dec 03, 2018 at 07:44:03AM +0000, He, Bo wrote:
> > > Thanks, we have run the test for the whole weekend and not reproduce the issue, so we confirm the CONFIG_RCU_BOOST can fix the issue.
> >
> > Very good, that is encouraging. Perhaps I should think about making CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at least for architectures for which rt_mutexes are implemented.
> >
> > > We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set panic on rcu stall and will see if we can see the panic, will keep you posed with the test results.
> > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> >
> > Looking forward to seeing what is going on! Of course, to reproduce, you will need to again build with CONFIG_RCU_BOOST=n.
> >
> > Thanx, Paul
> >
> > > -----Original Message-----
> > > From: Paul E. McKenney <[email protected]>
> > > Sent: Saturday, December 1, 2018 12:49 AM
> > > To: He, Bo <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>;
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Zhang, Jun
> > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > > <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote:
> > > > Here is the kernel cmdline:
> > >
> > > Thank you!
> > >
> > > > Kernel command line: androidboot.acpio_idx=0
> > > > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinger-06
> > > > _0
> > > > 3- userdebug androidboot.diskbus=00.0
> > > > androidboot.verifiedbootstate=green
> > > > androidboot.bootreason=power-on
> > > > androidboot.serialno=R1J56L6006a7bb
> > > > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves
> > > > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr nopti
> > > > ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off gpt
> > > > loglevel=4 androidboot.hardware=gordon_peak
> > > > firmware_class.path=/vendor/firmware relative_sleep_states=1
> > > > enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10
> > > > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:00/p
> > > > ro pe rties/android/ pstore.backend=ramoops
> > > > memmap=0x1400000$0x50000000
> > > > ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000
> > > > ramoops.record_size=0x4000 ramoops.console_size=0x1000000
> > > > ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current
> > > > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1
> > > > drm.vblankoffdelay=
> > >
> > > And no sign of any suppression of RCU CPU stall warnings. Hmmm...
> > > It does take more than 21 seconds to OOM? Or do things happen faster than that? If they do happen faster than that, then on approach would be to add something like this to the kernel command line:
> > >
> > > rcupdate.rcu_cpu_stall_timeout=7
> > >
> > > This would set the stall timeout to seven seconds. Note that timeouts less than three seconds are silently interpreted as three seconds.
> > >
> > > Thanx, Paul
> > >
> > > > -----Original Message-----
> > > > From: Steven Rostedt <[email protected]>
> > > > Sent: Friday, November 30, 2018 11:17 PM
> > > > To: Paul E. McKenney <[email protected]>
> > > > Cc: He, Bo <[email protected]>; [email protected];
> > > > [email protected]; [email protected];
> > > > [email protected]; Zhang, Jun <[email protected]>; Xiao,
> > > > Jin <[email protected]>; Zhang, Yanmin <[email protected]>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Fri, 30 Nov 2018 06:43:17 -0800 "Paul E. McKenney"
> > > > <[email protected]> wrote:
> > > >
> > > > > Could you please send me your list of kernel boot parameters?
> > > > > They usually appear near the start of your console output.
> > > >
> > > > Or just: cat /proc/cmdline
> > > >
> > > > -- Steve
> > > >
> > >
> >
>
>


2018-12-06 17:40:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Thu, Dec 06, 2018 at 01:23:01PM +0000, He, Bo wrote:
> 1. The test is positive after set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST, the issue is not reproduced until now.
> 2. Here is previous log enable the ftrace_dump, and we can get 4 seconds ftrace. The panic log was triggered with the enclosed debug patch, replaced the wait_for_completion(&rs_array[i].completion) with wait_for_completion_timeout(&rs_array[i].completion, 3*HZ) in __wait_rcu_gp(). The logs enabled the lockdep to dump the locks, and dump all tasks backtrace.

Thank you for collecting this information!

(By the way, the usual downside of the priority increase is increased
context-switch rate and thus CPU overhead.)

And all three grace-period kthreads are blocked apparently in their
top-level loops (though inlining and all that). There are quite a few
preemptions ("72738.702815: rcu_preempt_task: rcu_preempt"), but they
are all blocking the next grace period (29041008), not the current one
(29041004). And the "rcu_unlock_preempted_task" trace records flag the
current grace-period sequence number as 29041004, which means that there
is no grace period in progress, that is, RCU is idle.

Which explains why there is no RCU CPU stall warning -- after all, if
there is no grace period in flight, it is not possible to stall that
non-existent grace period.

That also could explain why increasing the priority of the grace-period
kthreads gets things going again. There have been a great number of
requests for a new grace period (for example, "rcu_future_grace_period:
rcu_preempt 29041004 29041008 0 0 3 Startleaf"), so as soon as the
grace-period kthread wakes up, a new grace period will start.

Except that the rcu_preempt task says "I" rather than "R", as you noted
in an earlier email.

And there should have been multiple attempts to wake up the grace-period
kthread, because there are lots of callbacks queued as in 136,045 of
them ("rcu_callback: rcu_preempt rhp=0000000066f735c9 func=file_free_rcu
2811/136045"). Which is of course why you are seeing the OOM.

So the question becomes "Why is the grace-period kthread being awakened
so many times, but not actually waking up?" In the past, there was a
scheduler bug that could cause that, but that was -way- before the v4.19
that you are running. More recently, there have been timer-related
problems, but those only happened while a grace period was active,
and where also long before v4.19.

Hmmm... One possibility is that you have somehow managed to invoke
call_rcu() with interrupts disabled, which would in turn disable the
extra wakeups that RCU sends when it sees excessive numbers of callbacks.
Except that in that case, boosting the priority wouldn't help. Besides,
the scheduling-clock interrupt should also check for this, and should
push things forward if need be.

If RCU managed to put all of its callbacks into the RCU_NEXT_READY_TAIL
bucket on all CPUs, that would defeat the wakeup-if-no-grace-period
checks (RCU is supposed to have started the relevant grace period before
putting callbacks into that bucket). But that cannot be the case here,
because new callbacks are being enqueued throughout, and these would
then trigger RCU's start-a-new-grace-period checks.

But it would be good to confirm that this is actually working like I would
expect it to. Could you please add scheduler wakeup to your tracing,
if possible, only displaying those sent to the rcu_preempt task?

Thanx, Paul

> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Thursday, December 6, 2018 1:45 AM
> To: He, Bo <[email protected]>
> Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Wed, Dec 05, 2018 at 08:42:54AM +0000, He, Bo wrote:
> > I double checked the .config, we don't enable CONFIG_NO_HZ_FULL .
> > Our previous logs can dump all the task backtrace, and kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks) are all in "I" state not in "R" state, my understandings are if it's the side-effect of causing RCU's kthreads to be run at SCHED_FIFO priority 1, the kthreads should be in R state.
>
> Hmmm... Well, the tasks could in theory be waiting on a blocking mutex.
> But in practice the grace-period kthreads wait on events, so that makes no sense.
>
> Is it possible for you to dump out the grace-period kthread's stack, for example, with sysreq-t? (Steve might know a better way to do this.)
>
> > I will do more experiments and keep you update once we have more findings:
> > 1. set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST and see if the issue can reproduce.
>
> That sounds like a most excellent experiment!
>
> > 2. check more ftrace to double confirm why there is no trace_rcu_quiescent_state_report and most of the trace_rcu_grace_period are in "AccWaitCB".
>
> As noted earlier, to see something interesting, you will need to start the ftrace before the grace period starts. This would probably mean having ftrace running before starting the test. Starting the ftrace after the hang commences is unlikely to produce useful information.
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Wednesday, December 5, 2018 3:50 AM
> > To: He, Bo <[email protected]>
> > Cc: Steven Rostedt <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Zhang, Jun
> > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > <[email protected]>; Bai, Jie A <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Tue, Dec 04, 2018 at 07:50:04AM +0000, He, Bo wrote:
> > > Hi, Paul:
> > > the enclosed is the log trigger the 120s hung_task_panic without other debug patches, the hung task is blocked at __wait_rcu_gp, it means the rcu_cpu_stall can't detect the scenario:
> > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > > echo 7 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> >
> > Not necessarily. If there is an RCU CPU stall warning, blocking
> > within
> > __wait_rcu_gp() is expected behavior. It is possible that the problem is that although the grace period is completing as required, the callbacks are not being invoked in a timely fashion. And that could happen if you had CONFIG_NO_HZ_FULL and a bunch of nohz_full CPUs, or, alternatively, callback offloading enabled. But I don't see these in your previous emails. Another possible cause is that the grace-period kthread is being delayed, so that the grace period never starts. This seems unlikely, but it is the only thing thus far that matches the symptoms.
> >
> > CONFIG_RCU_BOOST=y has the side-effect of causing RCU's kthreads to be run at SCHED_FIFO priority 1, and that would help in the case where RCU's grace-period kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks, all of which execute in the rcu_gp_kthread() function) was being starved of CPU time.
> >
> > Does that sound likely?
> >
> > Thanx, Paul
> >
> > > -----Original Message-----
> > > From: Paul E. McKenney <[email protected]>
> > > Sent: Monday, December 3, 2018 9:57 PM
> > > To: He, Bo <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>;
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Zhang, Jun
> > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > > <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Mon, Dec 03, 2018 at 07:44:03AM +0000, He, Bo wrote:
> > > > Thanks, we have run the test for the whole weekend and not reproduce the issue, so we confirm the CONFIG_RCU_BOOST can fix the issue.
> > >
> > > Very good, that is encouraging. Perhaps I should think about making CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at least for architectures for which rt_mutexes are implemented.
> > >
> > > > We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set panic on rcu stall and will see if we can see the panic, will keep you posed with the test results.
> > > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > >
> > > Looking forward to seeing what is going on! Of course, to reproduce, you will need to again build with CONFIG_RCU_BOOST=n.
> > >
> > > Thanx, Paul
> > >
> > > > -----Original Message-----
> > > > From: Paul E. McKenney <[email protected]>
> > > > Sent: Saturday, December 1, 2018 12:49 AM
> > > > To: He, Bo <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>;
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected]; Zhang, Jun
> > > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang,
> > > > Yanmin <[email protected]>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote:
> > > > > Here is the kernel cmdline:
> > > >
> > > > Thank you!
> > > >
> > > > > Kernel command line: androidboot.acpio_idx=0
> > > > > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinger-
> > > > > 06
> > > > > _0
> > > > > 3- userdebug androidboot.diskbus=00.0
> > > > > androidboot.verifiedbootstate=green
> > > > > androidboot.bootreason=power-on
> > > > > androidboot.serialno=R1J56L6006a7bb
> > > > > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves
> > > > > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr nopti
> > > > > ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off gpt
> > > > > loglevel=4 androidboot.hardware=gordon_peak
> > > > > firmware_class.path=/vendor/firmware relative_sleep_states=1
> > > > > enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10
> > > > > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:00
> > > > > /p ro pe rties/android/ pstore.backend=ramoops
> > > > > memmap=0x1400000$0x50000000
> > > > > ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000
> > > > > ramoops.record_size=0x4000 ramoops.console_size=0x1000000
> > > > > ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current
> > > > > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1
> > > > > drm.vblankoffdelay=
> > > >
> > > > And no sign of any suppression of RCU CPU stall warnings. Hmmm...
> > > > It does take more than 21 seconds to OOM? Or do things happen faster than that? If they do happen faster than that, then on approach would be to add something like this to the kernel command line:
> > > >
> > > > rcupdate.rcu_cpu_stall_timeout=7
> > > >
> > > > This would set the stall timeout to seven seconds. Note that timeouts less than three seconds are silently interpreted as three seconds.
> > > >
> > > > Thanx, Paul
> > > >
> > > > > -----Original Message-----
> > > > > From: Steven Rostedt <[email protected]>
> > > > > Sent: Friday, November 30, 2018 11:17 PM
> > > > > To: Paul E. McKenney <[email protected]>
> > > > > Cc: He, Bo <[email protected]>; [email protected];
> > > > > [email protected]; [email protected];
> > > > > [email protected]; Zhang, Jun <[email protected]>; Xiao,
> > > > > Jin <[email protected]>; Zhang, Yanmin <[email protected]>
> > > > > Subject: Re: rcu_preempt caused oom
> > > > >
> > > > > On Fri, 30 Nov 2018 06:43:17 -0800 "Paul E. McKenney"
> > > > > <[email protected]> wrote:
> > > > >
> > > > > > Could you please send me your list of kernel boot parameters?
> > > > > > They usually appear near the start of your console output.
> > > > >
> > > > > Or just: cat /proc/cmdline
> > > > >
> > > > > -- Steve
> > > > >
> > > >
> > >
> >
> >
>



2018-12-07 14:12:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Fri, Dec 07, 2018 at 08:25:09AM +0000, He, Bo wrote:
> Bad news, the issue is still reproduced after 61 Hours monkey test on 1/6 boards with the CONFIG_RCU_BOOST=y, and the issue is not seen on kernel 4.14, the CONFIG_RCU_BOOST is also not enabled in our kernel 4.14 config.
> Here enclosed is the logs.
>
> > So the question becomes "Why is the grace-period kthread being awakened so many times, but not actually waking up?"
> maybe it's not schedule issue, I have two suspects:
> we can see tons of grace_period with 117392312:
> [219346.919405, 0] showmap-31232 [000] d..1 219346.136035: rcu_future_grace_period: rcu_preempt 117392312 117392316 0 0 3 Startleaf
> [219346.919417, 0] showmap-31232 [000] d..1 219346.136036: rcu_future_grace_period: rcu_preempt 117392312 117392316 0 0 3 Prestarted
> [219346.919429, 0] showmap-31232 [000] d..1 219346.136036: rcu_grace_period: rcu_preempt 117392312 AccWaitCB
>
> "Startleaf" means start the grace period, "Prestarted" means the grace period is already started or other conditions blocked, RCU_GP_FLAG_INIT should follow the "Startedroot", then the kthread can be wakeup.

Yes, when "Startleaf" is followed by "Prestarted", that means that we
reached an rcu_node structure that is already aware that the requested
grace period is needed. Breaking down the relevant "if" statement in
rcu_start_this_gp():

if (ULONG_CMP_GE(rnp->gp_seq_needed, gp_seq_req) ||
// A. GP already requested at this rcu_node
rcu_seq_started(&rnp->gp_seq, gp_seq_req) ||
// B. The requested grace period already started
(rnp != rnp_start &&
rcu_seq_state(rcu_seq_current(&rnp->gp_seq)))) {
// C. Leaf rcu_node recorded request, and
// some grace period is in progress

A: In this case, the "Startedroot" should be taken care of by some
other thread, or one of B or C held earlier.

B: This cannot be the case, because your earlier trace showed that
the requested grace period had not started.

C: This cannot be the case because both traces above are on the
leaf rcu_node structure. If it were the case, the currently
running grace period would notice the need for the requested
grace period when it ended, and would start the grace period
at that time.

So you are saying that your trace goes back far enough to capture the
very first "Startleaf" for this new grace period, and you don't ever see a
"Startedroot"? This would be OK if the later "Startedleaf" showed up at
that point. If you do have such a trace, could you please send it to me
(or post it somewhere and send me the URL)?

In any case, this code has bee reworked recently, so I will take a closer
look, which will take some time. Please feel free to continue to do so
as well, of course!

> I do experiment to dump the backtrace, the rcu_quiescent_state_report is called in softirq context:
> <idle>-0 [000] dNs2 24471.669280: rcu_quiescent_state_report: rcu_preempt 3562401 1>0 0 0 3 0
> <idle>-0 [000] dNs2 24471.669293: <stack trace>
> => rcu_report_qs_rnp+0x1e2/0x2a0
> => rcu_process_callbacks+0x2f1/0x3c0
> => __do_softirq+0x12a/0x386
> => irq_exit+0xb1/0xc0
> => smp_apic_timer_interrupt+0xd4/0x1e0
> => apic_timer_interrupt+0xf/0x20
> => cpuidle_enter_state+0xb1/0x340
> => cpuidle_enter+0x17/0x20
> => call_cpuidle+0x23/0x40
> => do_idle+0x1ed/0x250
> => cpu_startup_entry+0x73/0x80
> => rest_init+0xf3/0x100
> => start_kernel+0x46f/0x490
> => x86_64_start_reservations+0x2a/0x2c
> => x86_64_start_kernel+0x72/0x75
> => secondary_startup_64+0xa4/0xb0
> rcu_report_qs_rnp=>rcu_report_qs_rdp
>
> and in the rcu_report_qs_rdp(), rcu_report_qs_rnp is follow the rcu_accelerate_cbs, we can see AccWaitCB log, but can't see rcu_quiescent_state_report, mostly it's due to condition rnp->qsmask & mask blocked.
>
> static void
> rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
> {
> ...
> if ((rnp->qsmask & mask) == 0) {
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> } else {
> rdp->core_needs_qs = false;
> needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
>
> if (needwake)
> rcu_gp_kthread_wake(rsp);
> }
> }

This is a completely different code path. The rcu_start_this_gp()
function is trying to start a new grace period. In contrast, this
rcu_report_qs_rdp() function reports a quiescent state for a currently
running grace period. In your earlier trace, there was no currently
running grace period, so rcu_report_qs_rdp() exiting early is expected
behavior.

Thanx, Paul

> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Friday, December 7, 2018 1:38 AM
> To: He, Bo <[email protected]>
> Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Thu, Dec 06, 2018 at 01:23:01PM +0000, He, Bo wrote:
> > 1. The test is positive after set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST, the issue is not reproduced until now.
> > 2. Here is previous log enable the ftrace_dump, and we can get 4 seconds ftrace. The panic log was triggered with the enclosed debug patch, replaced the wait_for_completion(&rs_array[i].completion) with wait_for_completion_timeout(&rs_array[i].completion, 3*HZ) in __wait_rcu_gp(). The logs enabled the lockdep to dump the locks, and dump all tasks backtrace.
>
> Thank you for collecting this information!
>
> (By the way, the usual downside of the priority increase is increased context-switch rate and thus CPU overhead.)
>
> And all three grace-period kthreads are blocked apparently in their top-level loops (though inlining and all that). There are quite a few preemptions ("72738.702815: rcu_preempt_task: rcu_preempt"), but they are all blocking the next grace period (29041008), not the current one (29041004). And the "rcu_unlock_preempted_task" trace records flag the current grace-period sequence number as 29041004, which means that there is no grace period in progress, that is, RCU is idle.
>
> Which explains why there is no RCU CPU stall warning -- after all, if there is no grace period in flight, it is not possible to stall that non-existent grace period.
>
> That also could explain why increasing the priority of the grace-period kthreads gets things going again. There have been a great number of requests for a new grace period (for example, "rcu_future_grace_period:
> rcu_preempt 29041004 29041008 0 0 3 Startleaf"), so as soon as the grace-period kthread wakes up, a new grace period will start.
>
> Except that the rcu_preempt task says "I" rather than "R", as you noted in an earlier email.
>
> And there should have been multiple attempts to wake up the grace-period kthread, because there are lots of callbacks queued as in 136,045 of them ("rcu_callback: rcu_preempt rhp=0000000066f735c9 func=file_free_rcu 2811/136045"). Which is of course why you are seeing the OOM.
>
> So the question becomes "Why is the grace-period kthread being awakened so many times, but not actually waking up?" In the past, there was a scheduler bug that could cause that, but that was -way- before the v4.19 that you are running. More recently, there have been timer-related problems, but those only happened while a grace period was active, and where also long before v4.19.
>
> Hmmm... One possibility is that you have somehow managed to invoke
> call_rcu() with interrupts disabled, which would in turn disable the extra wakeups that RCU sends when it sees excessive numbers of callbacks.
> Except that in that case, boosting the priority wouldn't help. Besides, the scheduling-clock interrupt should also check for this, and should push things forward if need be.
>
> If RCU managed to put all of its callbacks into the RCU_NEXT_READY_TAIL bucket on all CPUs, that would defeat the wakeup-if-no-grace-period checks (RCU is supposed to have started the relevant grace period before putting callbacks into that bucket). But that cannot be the case here, because new callbacks are being enqueued throughout, and these would then trigger RCU's start-a-new-grace-period checks.
>
> But it would be good to confirm that this is actually working like I would expect it to. Could you please add scheduler wakeup to your tracing, if possible, only displaying those sent to the rcu_preempt task?
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Thursday, December 6, 2018 1:45 AM
> > To: He, Bo <[email protected]>
> > Cc: Steven Rostedt <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Zhang, Jun
> > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > <[email protected]>; Bai, Jie A <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Wed, Dec 05, 2018 at 08:42:54AM +0000, He, Bo wrote:
> > > I double checked the .config, we don't enable CONFIG_NO_HZ_FULL .
> > > Our previous logs can dump all the task backtrace, and kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks) are all in "I" state not in "R" state, my understandings are if it's the side-effect of causing RCU's kthreads to be run at SCHED_FIFO priority 1, the kthreads should be in R state.
> >
> > Hmmm... Well, the tasks could in theory be waiting on a blocking mutex.
> > But in practice the grace-period kthreads wait on events, so that makes no sense.
> >
> > Is it possible for you to dump out the grace-period kthread's stack,
> > for example, with sysreq-t? (Steve might know a better way to do
> > this.)
> >
> > > I will do more experiments and keep you update once we have more findings:
> > > 1. set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST and see if the issue can reproduce.
> >
> > That sounds like a most excellent experiment!
> >
> > > 2. check more ftrace to double confirm why there is no trace_rcu_quiescent_state_report and most of the trace_rcu_grace_period are in "AccWaitCB".
> >
> > As noted earlier, to see something interesting, you will need to start the ftrace before the grace period starts. This would probably mean having ftrace running before starting the test. Starting the ftrace after the hang commences is unlikely to produce useful information.
> >
> > Thanx, Paul
> >
> > > -----Original Message-----
> > > From: Paul E. McKenney <[email protected]>
> > > Sent: Wednesday, December 5, 2018 3:50 AM
> > > To: He, Bo <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>;
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Zhang, Jun
> > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > > <[email protected]>; Bai, Jie A <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Tue, Dec 04, 2018 at 07:50:04AM +0000, He, Bo wrote:
> > > > Hi, Paul:
> > > > the enclosed is the log trigger the 120s hung_task_panic without other debug patches, the hung task is blocked at __wait_rcu_gp, it means the rcu_cpu_stall can't detect the scenario:
> > > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > > > echo 7 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> > >
> > > Not necessarily. If there is an RCU CPU stall warning, blocking
> > > within
> > > __wait_rcu_gp() is expected behavior. It is possible that the problem is that although the grace period is completing as required, the callbacks are not being invoked in a timely fashion. And that could happen if you had CONFIG_NO_HZ_FULL and a bunch of nohz_full CPUs, or, alternatively, callback offloading enabled. But I don't see these in your previous emails. Another possible cause is that the grace-period kthread is being delayed, so that the grace period never starts. This seems unlikely, but it is the only thing thus far that matches the symptoms.
> > >
> > > CONFIG_RCU_BOOST=y has the side-effect of causing RCU's kthreads to be run at SCHED_FIFO priority 1, and that would help in the case where RCU's grace-period kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks, all of which execute in the rcu_gp_kthread() function) was being starved of CPU time.
> > >
> > > Does that sound likely?
> > >
> > > Thanx, Paul
> > >
> > > > -----Original Message-----
> > > > From: Paul E. McKenney <[email protected]>
> > > > Sent: Monday, December 3, 2018 9:57 PM
> > > > To: He, Bo <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>;
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected]; Zhang, Jun
> > > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang,
> > > > Yanmin <[email protected]>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Mon, Dec 03, 2018 at 07:44:03AM +0000, He, Bo wrote:
> > > > > Thanks, we have run the test for the whole weekend and not reproduce the issue, so we confirm the CONFIG_RCU_BOOST can fix the issue.
> > > >
> > > > Very good, that is encouraging. Perhaps I should think about making CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at least for architectures for which rt_mutexes are implemented.
> > > >
> > > > > We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set panic on rcu stall and will see if we can see the panic, will keep you posed with the test results.
> > > > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > > >
> > > > Looking forward to seeing what is going on! Of course, to reproduce, you will need to again build with CONFIG_RCU_BOOST=n.
> > > >
> > > > Thanx, Paul
> > > >
> > > > > -----Original Message-----
> > > > > From: Paul E. McKenney <[email protected]>
> > > > > Sent: Saturday, December 1, 2018 12:49 AM
> > > > > To: He, Bo <[email protected]>
> > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > [email protected]; [email protected];
> > > > > [email protected]; [email protected]; Zhang,
> > > > > Jun <[email protected]>; Xiao, Jin <[email protected]>;
> > > > > Zhang, Yanmin <[email protected]>
> > > > > Subject: Re: rcu_preempt caused oom
> > > > >
> > > > > On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote:
> > > > > > Here is the kernel cmdline:
> > > > >
> > > > > Thank you!
> > > > >
> > > > > > Kernel command line: androidboot.acpio_idx=0
> > > > > > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinge
> > > > > > r-
> > > > > > 06
> > > > > > _0
> > > > > > 3- userdebug androidboot.diskbus=00.0
> > > > > > androidboot.verifiedbootstate=green
> > > > > > androidboot.bootreason=power-on
> > > > > > androidboot.serialno=R1J56L6006a7bb
> > > > > > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves
> > > > > > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr
> > > > > > nopti ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off
> > > > > > gpt
> > > > > > loglevel=4 androidboot.hardware=gordon_peak
> > > > > > firmware_class.path=/vendor/firmware relative_sleep_states=1
> > > > > > enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10
> > > > > > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:
> > > > > > 00 /p ro pe rties/android/ pstore.backend=ramoops
> > > > > > memmap=0x1400000$0x50000000
> > > > > > ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000
> > > > > > ramoops.record_size=0x4000 ramoops.console_size=0x1000000
> > > > > > ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current
> > > > > > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1
> > > > > > drm.vblankoffdelay=
> > > > >
> > > > > And no sign of any suppression of RCU CPU stall warnings. Hmmm...
> > > > > It does take more than 21 seconds to OOM? Or do things happen faster than that? If they do happen faster than that, then on approach would be to add something like this to the kernel command line:
> > > > >
> > > > > rcupdate.rcu_cpu_stall_timeout=7
> > > > >
> > > > > This would set the stall timeout to seven seconds. Note that timeouts less than three seconds are silently interpreted as three seconds.
> > > > >
> > > > > Thanx, Paul
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Steven Rostedt <[email protected]>
> > > > > > Sent: Friday, November 30, 2018 11:17 PM
> > > > > > To: Paul E. McKenney <[email protected]>
> > > > > > Cc: He, Bo <[email protected]>; [email protected];
> > > > > > [email protected]; [email protected];
> > > > > > [email protected]; Zhang, Jun <[email protected]>;
> > > > > > Xiao, Jin <[email protected]>; Zhang, Yanmin
> > > > > > <[email protected]>
> > > > > > Subject: Re: rcu_preempt caused oom
> > > > > >
> > > > > > On Fri, 30 Nov 2018 06:43:17 -0800 "Paul E. McKenney"
> > > > > > <[email protected]> wrote:
> > > > > >
> > > > > > > Could you please send me your list of kernel boot parameters?
> > > > > > > They usually appear near the start of your console output.
> > > > > >
> > > > > > Or just: cat /proc/cmdline
> > > > > >
> > > > > > -- Steve
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
>
>



2018-12-09 19:56:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Fri, Dec 07, 2018 at 06:11:31AM -0800, Paul E. McKenney wrote:
> On Fri, Dec 07, 2018 at 08:25:09AM +0000, He, Bo wrote:
> > Bad news, the issue is still reproduced after 61 Hours monkey test on 1/6 boards with the CONFIG_RCU_BOOST=y, and the issue is not seen on kernel 4.14, the CONFIG_RCU_BOOST is also not enabled in our kernel 4.14 config.
> > Here enclosed is the logs.
> >
> > > So the question becomes "Why is the grace-period kthread being awakened so many times, but not actually waking up?"
> > maybe it's not schedule issue, I have two suspects:
> > we can see tons of grace_period with 117392312:
> > [219346.919405, 0] showmap-31232 [000] d..1 219346.136035: rcu_future_grace_period: rcu_preempt 117392312 117392316 0 0 3 Startleaf
> > [219346.919417, 0] showmap-31232 [000] d..1 219346.136036: rcu_future_grace_period: rcu_preempt 117392312 117392316 0 0 3 Prestarted
> > [219346.919429, 0] showmap-31232 [000] d..1 219346.136036: rcu_grace_period: rcu_preempt 117392312 AccWaitCB
> >
> > "Startleaf" means start the grace period, "Prestarted" means the grace period is already started or other conditions blocked, RCU_GP_FLAG_INIT should follow the "Startedroot", then the kthread can be wakeup.
>
> Yes, when "Startleaf" is followed by "Prestarted", that means that we
> reached an rcu_node structure that is already aware that the requested
> grace period is needed. Breaking down the relevant "if" statement in
> rcu_start_this_gp():
>
> if (ULONG_CMP_GE(rnp->gp_seq_needed, gp_seq_req) ||
> // A. GP already requested at this rcu_node
> rcu_seq_started(&rnp->gp_seq, gp_seq_req) ||
> // B. The requested grace period already started
> (rnp != rnp_start &&
> rcu_seq_state(rcu_seq_current(&rnp->gp_seq)))) {
> // C. Leaf rcu_node recorded request, and
> // some grace period is in progress
>
> A: In this case, the "Startedroot" should be taken care of by some
> other thread, or one of B or C held earlier.
>
> B: This cannot be the case, because your earlier trace showed that
> the requested grace period had not started.
>
> C: This cannot be the case because both traces above are on the
> leaf rcu_node structure. If it were the case, the currently
> running grace period would notice the need for the requested
> grace period when it ended, and would start the grace period
> at that time.
>
> So you are saying that your trace goes back far enough to capture the
> very first "Startleaf" for this new grace period, and you don't ever see a
> "Startedroot"? This would be OK if the later "Startedleaf" showed up at
> that point. If you do have such a trace, could you please send it to me
> (or post it somewhere and send me the URL)?
>
> In any case, this code has bee reworked recently, so I will take a closer
> look, which will take some time. Please feel free to continue to do so
> as well, of course!

Hmmm... Could you please build with CONFIG_PROVE_RCU=y and run the
original (for example, CONFIG_RCU_BOOST=n)? I would expect this to
trigger the warning in rcu_check_gp_start_stall(). Of course, if it
does not trigger, that would be valuable information as well.

Thanx, Paul

> > I do experiment to dump the backtrace, the rcu_quiescent_state_report is called in softirq context:
> > <idle>-0 [000] dNs2 24471.669280: rcu_quiescent_state_report: rcu_preempt 3562401 1>0 0 0 3 0
> > <idle>-0 [000] dNs2 24471.669293: <stack trace>
> > => rcu_report_qs_rnp+0x1e2/0x2a0
> > => rcu_process_callbacks+0x2f1/0x3c0
> > => __do_softirq+0x12a/0x386
> > => irq_exit+0xb1/0xc0
> > => smp_apic_timer_interrupt+0xd4/0x1e0
> > => apic_timer_interrupt+0xf/0x20
> > => cpuidle_enter_state+0xb1/0x340
> > => cpuidle_enter+0x17/0x20
> > => call_cpuidle+0x23/0x40
> > => do_idle+0x1ed/0x250
> > => cpu_startup_entry+0x73/0x80
> > => rest_init+0xf3/0x100
> > => start_kernel+0x46f/0x490
> > => x86_64_start_reservations+0x2a/0x2c
> > => x86_64_start_kernel+0x72/0x75
> > => secondary_startup_64+0xa4/0xb0
> > rcu_report_qs_rnp=>rcu_report_qs_rdp
> >
> > and in the rcu_report_qs_rdp(), rcu_report_qs_rnp is follow the rcu_accelerate_cbs, we can see AccWaitCB log, but can't see rcu_quiescent_state_report, mostly it's due to condition rnp->qsmask & mask blocked.
> >
> > static void
> > rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
> > {
> > ...
> > if ((rnp->qsmask & mask) == 0) {
> > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > } else {
> > rdp->core_needs_qs = false;
> > needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> > rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> >
> > if (needwake)
> > rcu_gp_kthread_wake(rsp);
> > }
> > }
>
> This is a completely different code path. The rcu_start_this_gp()
> function is trying to start a new grace period. In contrast, this
> rcu_report_qs_rdp() function reports a quiescent state for a currently
> running grace period. In your earlier trace, there was no currently
> running grace period, so rcu_report_qs_rdp() exiting early is expected
> behavior.
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Friday, December 7, 2018 1:38 AM
> > To: He, Bo <[email protected]>
> > Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Thu, Dec 06, 2018 at 01:23:01PM +0000, He, Bo wrote:
> > > 1. The test is positive after set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST, the issue is not reproduced until now.
> > > 2. Here is previous log enable the ftrace_dump, and we can get 4 seconds ftrace. The panic log was triggered with the enclosed debug patch, replaced the wait_for_completion(&rs_array[i].completion) with wait_for_completion_timeout(&rs_array[i].completion, 3*HZ) in __wait_rcu_gp(). The logs enabled the lockdep to dump the locks, and dump all tasks backtrace.
> >
> > Thank you for collecting this information!
> >
> > (By the way, the usual downside of the priority increase is increased context-switch rate and thus CPU overhead.)
> >
> > And all three grace-period kthreads are blocked apparently in their top-level loops (though inlining and all that). There are quite a few preemptions ("72738.702815: rcu_preempt_task: rcu_preempt"), but they are all blocking the next grace period (29041008), not the current one (29041004). And the "rcu_unlock_preempted_task" trace records flag the current grace-period sequence number as 29041004, which means that there is no grace period in progress, that is, RCU is idle.
> >
> > Which explains why there is no RCU CPU stall warning -- after all, if there is no grace period in flight, it is not possible to stall that non-existent grace period.
> >
> > That also could explain why increasing the priority of the grace-period kthreads gets things going again. There have been a great number of requests for a new grace period (for example, "rcu_future_grace_period:
> > rcu_preempt 29041004 29041008 0 0 3 Startleaf"), so as soon as the grace-period kthread wakes up, a new grace period will start.
> >
> > Except that the rcu_preempt task says "I" rather than "R", as you noted in an earlier email.
> >
> > And there should have been multiple attempts to wake up the grace-period kthread, because there are lots of callbacks queued as in 136,045 of them ("rcu_callback: rcu_preempt rhp=0000000066f735c9 func=file_free_rcu 2811/136045"). Which is of course why you are seeing the OOM.
> >
> > So the question becomes "Why is the grace-period kthread being awakened so many times, but not actually waking up?" In the past, there was a scheduler bug that could cause that, but that was -way- before the v4.19 that you are running. More recently, there have been timer-related problems, but those only happened while a grace period was active, and where also long before v4.19.
> >
> > Hmmm... One possibility is that you have somehow managed to invoke
> > call_rcu() with interrupts disabled, which would in turn disable the extra wakeups that RCU sends when it sees excessive numbers of callbacks.
> > Except that in that case, boosting the priority wouldn't help. Besides, the scheduling-clock interrupt should also check for this, and should push things forward if need be.
> >
> > If RCU managed to put all of its callbacks into the RCU_NEXT_READY_TAIL bucket on all CPUs, that would defeat the wakeup-if-no-grace-period checks (RCU is supposed to have started the relevant grace period before putting callbacks into that bucket). But that cannot be the case here, because new callbacks are being enqueued throughout, and these would then trigger RCU's start-a-new-grace-period checks.
> >
> > But it would be good to confirm that this is actually working like I would expect it to. Could you please add scheduler wakeup to your tracing, if possible, only displaying those sent to the rcu_preempt task?
> >
> > Thanx, Paul
> >
> > > -----Original Message-----
> > > From: Paul E. McKenney <[email protected]>
> > > Sent: Thursday, December 6, 2018 1:45 AM
> > > To: He, Bo <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>;
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Zhang, Jun
> > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > > <[email protected]>; Bai, Jie A <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Wed, Dec 05, 2018 at 08:42:54AM +0000, He, Bo wrote:
> > > > I double checked the .config, we don't enable CONFIG_NO_HZ_FULL .
> > > > Our previous logs can dump all the task backtrace, and kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks) are all in "I" state not in "R" state, my understandings are if it's the side-effect of causing RCU's kthreads to be run at SCHED_FIFO priority 1, the kthreads should be in R state.
> > >
> > > Hmmm... Well, the tasks could in theory be waiting on a blocking mutex.
> > > But in practice the grace-period kthreads wait on events, so that makes no sense.
> > >
> > > Is it possible for you to dump out the grace-period kthread's stack,
> > > for example, with sysreq-t? (Steve might know a better way to do
> > > this.)
> > >
> > > > I will do more experiments and keep you update once we have more findings:
> > > > 1. set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST and see if the issue can reproduce.
> > >
> > > That sounds like a most excellent experiment!
> > >
> > > > 2. check more ftrace to double confirm why there is no trace_rcu_quiescent_state_report and most of the trace_rcu_grace_period are in "AccWaitCB".
> > >
> > > As noted earlier, to see something interesting, you will need to start the ftrace before the grace period starts. This would probably mean having ftrace running before starting the test. Starting the ftrace after the hang commences is unlikely to produce useful information.
> > >
> > > Thanx, Paul
> > >
> > > > -----Original Message-----
> > > > From: Paul E. McKenney <[email protected]>
> > > > Sent: Wednesday, December 5, 2018 3:50 AM
> > > > To: He, Bo <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>;
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected]; Zhang, Jun
> > > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > > > <[email protected]>; Bai, Jie A <[email protected]>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Tue, Dec 04, 2018 at 07:50:04AM +0000, He, Bo wrote:
> > > > > Hi, Paul:
> > > > > the enclosed is the log trigger the 120s hung_task_panic without other debug patches, the hung task is blocked at __wait_rcu_gp, it means the rcu_cpu_stall can't detect the scenario:
> > > > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > > > > echo 7 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> > > >
> > > > Not necessarily. If there is an RCU CPU stall warning, blocking
> > > > within
> > > > __wait_rcu_gp() is expected behavior. It is possible that the problem is that although the grace period is completing as required, the callbacks are not being invoked in a timely fashion. And that could happen if you had CONFIG_NO_HZ_FULL and a bunch of nohz_full CPUs, or, alternatively, callback offloading enabled. But I don't see these in your previous emails. Another possible cause is that the grace-period kthread is being delayed, so that the grace period never starts. This seems unlikely, but it is the only thing thus far that matches the symptoms.
> > > >
> > > > CONFIG_RCU_BOOST=y has the side-effect of causing RCU's kthreads to be run at SCHED_FIFO priority 1, and that would help in the case where RCU's grace-period kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks, all of which execute in the rcu_gp_kthread() function) was being starved of CPU time.
> > > >
> > > > Does that sound likely?
> > > >
> > > > Thanx, Paul
> > > >
> > > > > -----Original Message-----
> > > > > From: Paul E. McKenney <[email protected]>
> > > > > Sent: Monday, December 3, 2018 9:57 PM
> > > > > To: He, Bo <[email protected]>
> > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > [email protected]; [email protected];
> > > > > [email protected]; [email protected]; Zhang, Jun
> > > > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang,
> > > > > Yanmin <[email protected]>
> > > > > Subject: Re: rcu_preempt caused oom
> > > > >
> > > > > On Mon, Dec 03, 2018 at 07:44:03AM +0000, He, Bo wrote:
> > > > > > Thanks, we have run the test for the whole weekend and not reproduce the issue, so we confirm the CONFIG_RCU_BOOST can fix the issue.
> > > > >
> > > > > Very good, that is encouraging. Perhaps I should think about making CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at least for architectures for which rt_mutexes are implemented.
> > > > >
> > > > > > We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set panic on rcu stall and will see if we can see the panic, will keep you posed with the test results.
> > > > > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > > > >
> > > > > Looking forward to seeing what is going on! Of course, to reproduce, you will need to again build with CONFIG_RCU_BOOST=n.
> > > > >
> > > > > Thanx, Paul
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Paul E. McKenney <[email protected]>
> > > > > > Sent: Saturday, December 1, 2018 12:49 AM
> > > > > > To: He, Bo <[email protected]>
> > > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > > [email protected]; [email protected];
> > > > > > [email protected]; [email protected]; Zhang,
> > > > > > Jun <[email protected]>; Xiao, Jin <[email protected]>;
> > > > > > Zhang, Yanmin <[email protected]>
> > > > > > Subject: Re: rcu_preempt caused oom
> > > > > >
> > > > > > On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote:
> > > > > > > Here is the kernel cmdline:
> > > > > >
> > > > > > Thank you!
> > > > > >
> > > > > > > Kernel command line: androidboot.acpio_idx=0
> > > > > > > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinge
> > > > > > > r-
> > > > > > > 06
> > > > > > > _0
> > > > > > > 3- userdebug androidboot.diskbus=00.0
> > > > > > > androidboot.verifiedbootstate=green
> > > > > > > androidboot.bootreason=power-on
> > > > > > > androidboot.serialno=R1J56L6006a7bb
> > > > > > > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves
> > > > > > > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr
> > > > > > > nopti ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off
> > > > > > > gpt
> > > > > > > loglevel=4 androidboot.hardware=gordon_peak
> > > > > > > firmware_class.path=/vendor/firmware relative_sleep_states=1
> > > > > > > enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10
> > > > > > > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:
> > > > > > > 00 /p ro pe rties/android/ pstore.backend=ramoops
> > > > > > > memmap=0x1400000$0x50000000
> > > > > > > ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000
> > > > > > > ramoops.record_size=0x4000 ramoops.console_size=0x1000000
> > > > > > > ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current
> > > > > > > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1
> > > > > > > drm.vblankoffdelay=
> > > > > >
> > > > > > And no sign of any suppression of RCU CPU stall warnings. Hmmm...
> > > > > > It does take more than 21 seconds to OOM? Or do things happen faster than that? If they do happen faster than that, then on approach would be to add something like this to the kernel command line:
> > > > > >
> > > > > > rcupdate.rcu_cpu_stall_timeout=7
> > > > > >
> > > > > > This would set the stall timeout to seven seconds. Note that timeouts less than three seconds are silently interpreted as three seconds.
> > > > > >
> > > > > > Thanx, Paul
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Steven Rostedt <[email protected]>
> > > > > > > Sent: Friday, November 30, 2018 11:17 PM
> > > > > > > To: Paul E. McKenney <[email protected]>
> > > > > > > Cc: He, Bo <[email protected]>; [email protected];
> > > > > > > [email protected]; [email protected];
> > > > > > > [email protected]; Zhang, Jun <[email protected]>;
> > > > > > > Xiao, Jin <[email protected]>; Zhang, Yanmin
> > > > > > > <[email protected]>
> > > > > > > Subject: Re: rcu_preempt caused oom
> > > > > > >
> > > > > > > On Fri, 30 Nov 2018 06:43:17 -0800 "Paul E. McKenney"
> > > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > > Could you please send me your list of kernel boot parameters?
> > > > > > > > They usually appear near the start of your console output.
> > > > > > >
> > > > > > > Or just: cat /proc/cmdline
> > > > > > >
> > > > > > > -- Steve
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
>
>


2018-12-10 07:10:20

by He, Bo

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

Hi,
We have start the test with the CONFIG_PROVE_RCU=y, and also add one 2s to detect the preempt rcu hang, hope we can get more useful logs tomorrow.
I also enclosed the config and the debug patches for you review.

-----Original Message-----
From: Paul E. McKenney <[email protected]>
Sent: Monday, December 10, 2018 3:56 AM
To: He, Bo <[email protected]>
Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>
Subject: Re: rcu_preempt caused oom

On Fri, Dec 07, 2018 at 06:11:31AM -0800, Paul E. McKenney wrote:
> On Fri, Dec 07, 2018 at 08:25:09AM +0000, He, Bo wrote:
> > Bad news, the issue is still reproduced after 61 Hours monkey test on 1/6 boards with the CONFIG_RCU_BOOST=y, and the issue is not seen on kernel 4.14, the CONFIG_RCU_BOOST is also not enabled in our kernel 4.14 config.
> > Here enclosed is the logs.
> >
> > > So the question becomes "Why is the grace-period kthread being awakened so many times, but not actually waking up?"
> > maybe it's not schedule issue, I have two suspects:
> > we can see tons of grace_period with 117392312:
> > [219346.919405, 0] showmap-31232 [000] d..1 219346.136035:
> > rcu_future_grace_period: rcu_preempt 117392312 117392316 0 0 3
> > Startleaf [219346.919417, 0] showmap-31232 [000] d..1
> > 219346.136036: rcu_future_grace_period: rcu_preempt 117392312
> > 117392316 0 0 3 Prestarted [219346.919429, 0] showmap-31232 [000]
> > d..1 219346.136036: rcu_grace_period: rcu_preempt 117392312
> > AccWaitCB
> >
> > "Startleaf" means start the grace period, "Prestarted" means the grace period is already started or other conditions blocked, RCU_GP_FLAG_INIT should follow the "Startedroot", then the kthread can be wakeup.
>
> Yes, when "Startleaf" is followed by "Prestarted", that means that we
> reached an rcu_node structure that is already aware that the requested
> grace period is needed. Breaking down the relevant "if" statement in
> rcu_start_this_gp():
>
> if (ULONG_CMP_GE(rnp->gp_seq_needed, gp_seq_req) ||
> // A. GP already requested at this rcu_node
> rcu_seq_started(&rnp->gp_seq, gp_seq_req) ||
> // B. The requested grace period already started
> (rnp != rnp_start &&
> rcu_seq_state(rcu_seq_current(&rnp->gp_seq)))) {
> // C. Leaf rcu_node recorded request, and
> // some grace period is in progress
>
> A: In this case, the "Startedroot" should be taken care of by some
> other thread, or one of B or C held earlier.
>
> B: This cannot be the case, because your earlier trace showed that
> the requested grace period had not started.
>
> C: This cannot be the case because both traces above are on the
> leaf rcu_node structure. If it were the case, the currently
> running grace period would notice the need for the requested
> grace period when it ended, and would start the grace period
> at that time.
>
> So you are saying that your trace goes back far enough to capture the
> very first "Startleaf" for this new grace period, and you don't ever
> see a "Startedroot"? This would be OK if the later "Startedleaf"
> showed up at that point. If you do have such a trace, could you
> please send it to me (or post it somewhere and send me the URL)?
>
> In any case, this code has bee reworked recently, so I will take a
> closer look, which will take some time. Please feel free to continue
> to do so as well, of course!

Hmmm... Could you please build with CONFIG_PROVE_RCU=y and run the original (for example, CONFIG_RCU_BOOST=n)? I would expect this to trigger the warning in rcu_check_gp_start_stall(). Of course, if it does not trigger, that would be valuable information as well.

Thanx, Paul

> > I do experiment to dump the backtrace, the rcu_quiescent_state_report is called in softirq context:
> > <idle>-0 [000] dNs2 24471.669280: rcu_quiescent_state_report: rcu_preempt 3562401 1>0 0 0 3 0
> > <idle>-0 [000] dNs2 24471.669293: <stack trace>
> > => rcu_report_qs_rnp+0x1e2/0x2a0
> > => rcu_process_callbacks+0x2f1/0x3c0 => __do_softirq+0x12a/0x386
> > => irq_exit+0xb1/0xc0 => smp_apic_timer_interrupt+0xd4/0x1e0
> > => apic_timer_interrupt+0xf/0x20
> > => cpuidle_enter_state+0xb1/0x340
> > => cpuidle_enter+0x17/0x20
> > => call_cpuidle+0x23/0x40
> > => do_idle+0x1ed/0x250
> > => cpu_startup_entry+0x73/0x80
> > => rest_init+0xf3/0x100
> > => start_kernel+0x46f/0x490
> > => x86_64_start_reservations+0x2a/0x2c
> > => x86_64_start_kernel+0x72/0x75
> > => secondary_startup_64+0xa4/0xb0
> > rcu_report_qs_rnp=>rcu_report_qs_rdp
> >
> > and in the rcu_report_qs_rdp(), rcu_report_qs_rnp is follow the rcu_accelerate_cbs, we can see AccWaitCB log, but can't see rcu_quiescent_state_report, mostly it's due to condition rnp->qsmask & mask blocked.
> >
> > static void
> > rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data
> > *rdp) { ...
> > if ((rnp->qsmask & mask) == 0) {
> > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > } else {
> > rdp->core_needs_qs = false;
> > needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> > rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> >
> > if (needwake)
> > rcu_gp_kthread_wake(rsp);
> > }
> > }
>
> This is a completely different code path. The rcu_start_this_gp()
> function is trying to start a new grace period. In contrast, this
> rcu_report_qs_rdp() function reports a quiescent state for a currently
> running grace period. In your earlier trace, there was no currently
> running grace period, so rcu_report_qs_rdp() exiting early is expected
> behavior.
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Friday, December 7, 2018 1:38 AM
> > To: He, Bo <[email protected]>
> > Cc: Steven Rostedt <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Zhang, Jun
> > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > <[email protected]>; Bai, Jie A <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Thu, Dec 06, 2018 at 01:23:01PM +0000, He, Bo wrote:
> > > 1. The test is positive after set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST, the issue is not reproduced until now.
> > > 2. Here is previous log enable the ftrace_dump, and we can get 4 seconds ftrace. The panic log was triggered with the enclosed debug patch, replaced the wait_for_completion(&rs_array[i].completion) with wait_for_completion_timeout(&rs_array[i].completion, 3*HZ) in __wait_rcu_gp(). The logs enabled the lockdep to dump the locks, and dump all tasks backtrace.
> >
> > Thank you for collecting this information!
> >
> > (By the way, the usual downside of the priority increase is
> > increased context-switch rate and thus CPU overhead.)
> >
> > And all three grace-period kthreads are blocked apparently in their top-level loops (though inlining and all that). There are quite a few preemptions ("72738.702815: rcu_preempt_task: rcu_preempt"), but they are all blocking the next grace period (29041008), not the current one (29041004). And the "rcu_unlock_preempted_task" trace records flag the current grace-period sequence number as 29041004, which means that there is no grace period in progress, that is, RCU is idle.
> >
> > Which explains why there is no RCU CPU stall warning -- after all, if there is no grace period in flight, it is not possible to stall that non-existent grace period.
> >
> > That also could explain why increasing the priority of the grace-period kthreads gets things going again. There have been a great number of requests for a new grace period (for example, "rcu_future_grace_period:
> > rcu_preempt 29041004 29041008 0 0 3 Startleaf"), so as soon as the grace-period kthread wakes up, a new grace period will start.
> >
> > Except that the rcu_preempt task says "I" rather than "R", as you noted in an earlier email.
> >
> > And there should have been multiple attempts to wake up the grace-period kthread, because there are lots of callbacks queued as in 136,045 of them ("rcu_callback: rcu_preempt rhp=0000000066f735c9 func=file_free_rcu 2811/136045"). Which is of course why you are seeing the OOM.
> >
> > So the question becomes "Why is the grace-period kthread being awakened so many times, but not actually waking up?" In the past, there was a scheduler bug that could cause that, but that was -way- before the v4.19 that you are running. More recently, there have been timer-related problems, but those only happened while a grace period was active, and where also long before v4.19.
> >
> > Hmmm... One possibility is that you have somehow managed to invoke
> > call_rcu() with interrupts disabled, which would in turn disable the extra wakeups that RCU sends when it sees excessive numbers of callbacks.
> > Except that in that case, boosting the priority wouldn't help. Besides, the scheduling-clock interrupt should also check for this, and should push things forward if need be.
> >
> > If RCU managed to put all of its callbacks into the RCU_NEXT_READY_TAIL bucket on all CPUs, that would defeat the wakeup-if-no-grace-period checks (RCU is supposed to have started the relevant grace period before putting callbacks into that bucket). But that cannot be the case here, because new callbacks are being enqueued throughout, and these would then trigger RCU's start-a-new-grace-period checks.
> >
> > But it would be good to confirm that this is actually working like I would expect it to. Could you please add scheduler wakeup to your tracing, if possible, only displaying those sent to the rcu_preempt task?
> >
> > Thanx, Paul
> >
> > > -----Original Message-----
> > > From: Paul E. McKenney <[email protected]>
> > > Sent: Thursday, December 6, 2018 1:45 AM
> > > To: He, Bo <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>;
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Zhang, Jun
> > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang,
> > > Yanmin <[email protected]>; Bai, Jie A <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Wed, Dec 05, 2018 at 08:42:54AM +0000, He, Bo wrote:
> > > > I double checked the .config, we don't enable CONFIG_NO_HZ_FULL .
> > > > Our previous logs can dump all the task backtrace, and kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks) are all in "I" state not in "R" state, my understandings are if it's the side-effect of causing RCU's kthreads to be run at SCHED_FIFO priority 1, the kthreads should be in R state.
> > >
> > > Hmmm... Well, the tasks could in theory be waiting on a blocking mutex.
> > > But in practice the grace-period kthreads wait on events, so that makes no sense.
> > >
> > > Is it possible for you to dump out the grace-period kthread's
> > > stack, for example, with sysreq-t? (Steve might know a better way
> > > to do
> > > this.)
> > >
> > > > I will do more experiments and keep you update once we have more findings:
> > > > 1. set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST and see if the issue can reproduce.
> > >
> > > That sounds like a most excellent experiment!
> > >
> > > > 2. check more ftrace to double confirm why there is no trace_rcu_quiescent_state_report and most of the trace_rcu_grace_period are in "AccWaitCB".
> > >
> > > As noted earlier, to see something interesting, you will need to start the ftrace before the grace period starts. This would probably mean having ftrace running before starting the test. Starting the ftrace after the hang commences is unlikely to produce useful information.
> > >
> > > Thanx, Paul
> > >
> > > > -----Original Message-----
> > > > From: Paul E. McKenney <[email protected]>
> > > > Sent: Wednesday, December 5, 2018 3:50 AM
> > > > To: He, Bo <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>;
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected]; Zhang,
> > > > Jun <[email protected]>; Xiao, Jin <[email protected]>;
> > > > Zhang, Yanmin <[email protected]>; Bai, Jie A
> > > > <[email protected]>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Tue, Dec 04, 2018 at 07:50:04AM +0000, He, Bo wrote:
> > > > > Hi, Paul:
> > > > > the enclosed is the log trigger the 120s hung_task_panic without other debug patches, the hung task is blocked at __wait_rcu_gp, it means the rcu_cpu_stall can't detect the scenario:
> > > > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > > > > echo 7 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> > > >
> > > > Not necessarily. If there is an RCU CPU stall warning, blocking
> > > > within
> > > > __wait_rcu_gp() is expected behavior. It is possible that the problem is that although the grace period is completing as required, the callbacks are not being invoked in a timely fashion. And that could happen if you had CONFIG_NO_HZ_FULL and a bunch of nohz_full CPUs, or, alternatively, callback offloading enabled. But I don't see these in your previous emails. Another possible cause is that the grace-period kthread is being delayed, so that the grace period never starts. This seems unlikely, but it is the only thing thus far that matches the symptoms.
> > > >
> > > > CONFIG_RCU_BOOST=y has the side-effect of causing RCU's kthreads to be run at SCHED_FIFO priority 1, and that would help in the case where RCU's grace-period kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks, all of which execute in the rcu_gp_kthread() function) was being starved of CPU time.
> > > >
> > > > Does that sound likely?
> > > >
> > > > Thanx, Paul
> > > >
> > > > > -----Original Message-----
> > > > > From: Paul E. McKenney <[email protected]>
> > > > > Sent: Monday, December 3, 2018 9:57 PM
> > > > > To: He, Bo <[email protected]>
> > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > [email protected]; [email protected];
> > > > > [email protected]; [email protected]; Zhang,
> > > > > Jun <[email protected]>; Xiao, Jin <[email protected]>;
> > > > > Zhang, Yanmin <[email protected]>
> > > > > Subject: Re: rcu_preempt caused oom
> > > > >
> > > > > On Mon, Dec 03, 2018 at 07:44:03AM +0000, He, Bo wrote:
> > > > > > Thanks, we have run the test for the whole weekend and not reproduce the issue, so we confirm the CONFIG_RCU_BOOST can fix the issue.
> > > > >
> > > > > Very good, that is encouraging. Perhaps I should think about making CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at least for architectures for which rt_mutexes are implemented.
> > > > >
> > > > > > We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set panic on rcu stall and will see if we can see the panic, will keep you posed with the test results.
> > > > > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > > > >
> > > > > Looking forward to seeing what is going on! Of course, to reproduce, you will need to again build with CONFIG_RCU_BOOST=n.
> > > > >
> > > > > Thanx, Paul
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Paul E. McKenney <[email protected]>
> > > > > > Sent: Saturday, December 1, 2018 12:49 AM
> > > > > > To: He, Bo <[email protected]>
> > > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > > [email protected]; [email protected];
> > > > > > [email protected]; [email protected];
> > > > > > Zhang, Jun <[email protected]>; Xiao, Jin
> > > > > > <[email protected]>; Zhang, Yanmin <[email protected]>
> > > > > > Subject: Re: rcu_preempt caused oom
> > > > > >
> > > > > > On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote:
> > > > > > > Here is the kernel cmdline:
> > > > > >
> > > > > > Thank you!
> > > > > >
> > > > > > > Kernel command line: androidboot.acpio_idx=0
> > > > > > > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelfl
> > > > > > > inge
> > > > > > > r-
> > > > > > > 06
> > > > > > > _0
> > > > > > > 3- userdebug androidboot.diskbus=00.0
> > > > > > > androidboot.verifiedbootstate=green
> > > > > > > androidboot.bootreason=power-on
> > > > > > > androidboot.serialno=R1J56L6006a7bb
> > > > > > > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves
> > > > > > > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr
> > > > > > > nopti ftrace_dump_on_oops trace_buf_size=1024K
> > > > > > > intel_iommu=off gpt
> > > > > > > loglevel=4 androidboot.hardware=gordon_peak
> > > > > > > firmware_class.path=/vendor/firmware
> > > > > > > relative_sleep_states=1
> > > > > > > enforcing=0 androidboot.selinux=permissive
> > > > > > > cpu_init_udelay=10
> > > > > > > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:
> > > > > > > 00 /p ro pe rties/android/ pstore.backend=ramoops
> > > > > > > memmap=0x1400000$0x50000000
> > > > > > > ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000
> > > > > > > ramoops.record_size=0x4000 ramoops.console_size=0x1000000
> > > > > > > ramoops.ftrace_size=0x10000 ramoops.dump_oops=1
> > > > > > > vga=current
> > > > > > > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1
> > > > > > > drm.vblankoffdelay=
> > > > > >
> > > > > > And no sign of any suppression of RCU CPU stall warnings. Hmmm...
> > > > > > It does take more than 21 seconds to OOM? Or do things happen faster than that? If they do happen faster than that, then on approach would be to add something like this to the kernel command line:
> > > > > >
> > > > > > rcupdate.rcu_cpu_stall_timeout=7
> > > > > >
> > > > > > This would set the stall timeout to seven seconds. Note that timeouts less than three seconds are silently interpreted as three seconds.
> > > > > >
> > > > > > Thanx, Paul
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Steven Rostedt <[email protected]>
> > > > > > > Sent: Friday, November 30, 2018 11:17 PM
> > > > > > > To: Paul E. McKenney <[email protected]>
> > > > > > > Cc: He, Bo <[email protected]>;
> > > > > > > [email protected]; [email protected];
> > > > > > > [email protected]; [email protected];
> > > > > > > Zhang, Jun <[email protected]>; Xiao, Jin
> > > > > > > <[email protected]>; Zhang, Yanmin
> > > > > > > <[email protected]>
> > > > > > > Subject: Re: rcu_preempt caused oom
> > > > > > >
> > > > > > > On Fri, 30 Nov 2018 06:43:17 -0800 "Paul E. McKenney"
> > > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > > Could you please send me your list of kernel boot parameters?
> > > > > > > > They usually appear near the start of your console output.
> > > > > > >
> > > > > > > Or just: cat /proc/cmdline
> > > > > > >
> > > > > > > -- Steve
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
>
>


Attachments:
0001-add-rcu-hung-task-detect.patch (1.04 kB)
0001-add-rcu-hung-task-detect.patch
0002-rcu-detect-the-preempt_rcu-hang.patch (3.66 kB)
0002-rcu-detect-the-preempt_rcu-hang.patch
config.gz (36.58 kB)
config.gz
Download all attachments

2018-12-11 01:08:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Mon, Dec 10, 2018 at 06:56:18AM +0000, He, Bo wrote:
> Hi,
> We have start the test with the CONFIG_PROVE_RCU=y, and also add one 2s to detect the preempt rcu hang, hope we can get more useful logs tomorrow.
> I also enclosed the config and the debug patches for you review.

I instead suggest the (lightly tested) debug patch shown below, which
tracks wakeups of RCU's grace-period kthreads and dumps them out if a
given requested grace period fails to start. Again, it is necessary to
build with CONFIG_PROVE_RCU=y, that is, with CONFIG_PROVE_LOCKING=y.

Thanx, Paul

------------------------------------------------------------------------

commit 2a3826f15adaf92d046c80e38d090ecff5403807
Author: Paul E. McKenney <[email protected]>
Date: Mon Dec 10 16:33:59 2018 -0800

rcu: Improve diagnostics for failed RCU grace-period start

Backported from v4.21/v5.0

If a grace period fails to start (for example, because you commented
out the last two lines of rcu_accelerate_cbs_unlocked()), rcu_core()
will invoke rcu_check_gp_start_stall(), which will notice and complain.
However, this complaint is lacking crucial debugging information such
as when the last wakeup executed and what the value of ->gp_seq was at
that time. This commit therefore removes the current pr_alert() from
rcu_check_gp_start_stall(), instead invoking show_rcu_gp_kthreads(),
which has been updated to print the needed information, which is collected
by rcu_gp_kthread_wake().

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b760c1369f7..7daaef57d905 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -626,25 +626,57 @@ void rcu_sched_force_quiescent_state(void)
}
EXPORT_SYMBOL_GPL(rcu_sched_force_quiescent_state);

+/*
+ * Convert a ->gp_state value to a character string.
+ */
+static const char *gp_state_getname(short gs)
+{
+ if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
+ return "???";
+ return gp_state_names[gs];
+}
+
+/*
+ * Return the root node of the specified rcu_state structure.
+ */
+static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
+{
+ return &rsp->node[0];
+}
+
/*
* Show the state of the grace-period kthreads.
*/
void show_rcu_gp_kthreads(void)
{
int cpu;
+ unsigned long j;
+ unsigned long ja;
+ unsigned long jr;
+ unsigned long jw;
struct rcu_data *rdp;
struct rcu_node *rnp;
struct rcu_state *rsp;

+ j = jiffies;
for_each_rcu_flavor(rsp) {
- pr_info("%s: wait state: %d ->state: %#lx\n",
- rsp->name, rsp->gp_state, rsp->gp_kthread->state);
+ ja = j - READ_ONCE(rsp->gp_activity);
+ jr = j - READ_ONCE(rsp->gp_req_activity);
+ jw = j - READ_ONCE(rsp->gp_wake_time);
+ pr_info("%s: wait state: %s(%d) ->state: %#lx delta ->gp_activity %lu ->gp_req_activity %lu ->gp_wake_time %lu ->gp_wake_seq %ld ->gp_seq %ld ->gp_seq_needed %ld ->gp_flags %#x\n",
+ rsp->name, gp_state_getname(rsp->gp_state),
+ rsp->gp_state,
+ rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL,
+ ja, jr, jw, (long)READ_ONCE(rsp->gp_wake_seq),
+ (long)READ_ONCE(rsp->gp_seq),
+ (long)READ_ONCE(rcu_get_root(rsp)->gp_seq_needed),
+ READ_ONCE(rsp->gp_flags));
rcu_for_each_node_breadth_first(rsp, rnp) {
if (ULONG_CMP_GE(rsp->gp_seq, rnp->gp_seq_needed))
continue;
- pr_info("\trcu_node %d:%d ->gp_seq %lu ->gp_seq_needed %lu\n",
- rnp->grplo, rnp->grphi, rnp->gp_seq,
- rnp->gp_seq_needed);
+ pr_info("\trcu_node %d:%d ->gp_seq %ld ->gp_seq_needed %ld\n",
+ rnp->grplo, rnp->grphi, (long)rnp->gp_seq,
+ (long)rnp->gp_seq_needed);
if (!rcu_is_leaf_node(rnp))
continue;
for_each_leaf_node_possible_cpu(rnp, cpu) {
@@ -653,8 +685,8 @@ void show_rcu_gp_kthreads(void)
ULONG_CMP_GE(rsp->gp_seq,
rdp->gp_seq_needed))
continue;
- pr_info("\tcpu %d ->gp_seq_needed %lu\n",
- cpu, rdp->gp_seq_needed);
+ pr_info("\tcpu %d ->gp_seq_needed %ld\n",
+ cpu, (long)rdp->gp_seq_needed);
}
}
/* sched_show_task(rsp->gp_kthread); */
@@ -690,14 +722,6 @@ void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
}
EXPORT_SYMBOL_GPL(rcutorture_get_gp_data);

-/*
- * Return the root node of the specified rcu_state structure.
- */
-static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
-{
- return &rsp->node[0];
-}
-
/*
* Enter an RCU extended quiescent state, which can be either the
* idle loop or adaptive-tickless usermode execution.
@@ -1285,16 +1309,6 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
rsp->n_force_qs_gpstart = READ_ONCE(rsp->n_force_qs);
}

-/*
- * Convert a ->gp_state value to a character string.
- */
-static const char *gp_state_getname(short gs)
-{
- if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
- return "???";
- return gp_state_names[gs];
-}
-
/*
* Complain about starvation of grace-period kthread.
*/
@@ -1693,7 +1707,8 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
* Don't do a self-awaken, and don't bother awakening when there is
* nothing for the grace-period kthread to do (as in several CPUs
* raced to awaken, and we lost), and finally don't try to awaken
- * a kthread that has not yet been created.
+ * a kthread that has not yet been created. If all those checks are
+ * passed, track some debug information and awaken.
*/
static void rcu_gp_kthread_wake(struct rcu_state *rsp)
{
@@ -1701,6 +1716,8 @@ static void rcu_gp_kthread_wake(struct rcu_state *rsp)
!READ_ONCE(rsp->gp_flags) ||
!rsp->gp_kthread)
return;
+ WRITE_ONCE(rsp->gp_wake_time, jiffies);
+ WRITE_ONCE(rsp->gp_wake_seq, READ_ONCE(rsp->gp_seq));
swake_up_one(&rsp->gp_wq);
}

@@ -1774,8 +1791,8 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_state *rsp,
raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
- if (needwake)
- rcu_gp_kthread_wake(rsp);
+ /* if (needwake)
+ rcu_gp_kthread_wake(rsp); */
}

/*
@@ -2802,16 +2819,11 @@ rcu_check_gp_start_stall(struct rcu_state *rsp, struct rcu_node *rnp,
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return;
}
- pr_alert("%s: g%ld->%ld gar:%lu ga:%lu f%#x gs:%d %s->state:%#lx\n",
- __func__, (long)READ_ONCE(rsp->gp_seq),
- (long)READ_ONCE(rnp_root->gp_seq_needed),
- j - rsp->gp_req_activity, j - rsp->gp_activity,
- rsp->gp_flags, rsp->gp_state, rsp->name,
- rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL);
WARN_ON(1);
if (rnp_root != rnp)
raw_spin_unlock_rcu_node(rnp_root);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+ show_rcu_gp_kthreads();
}

/*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 4e74df768c57..0e051d9b5f1a 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -327,6 +327,8 @@ struct rcu_state {
struct swait_queue_head gp_wq; /* Where GP task waits. */
short gp_flags; /* Commands for GP task. */
short gp_state; /* GP kthread sleep state. */
+ unsigned long gp_wake_time; /* Last GP kthread wake. */
+ unsigned long gp_wake_seq; /* ->gp_seq at ^^^. */

/* End of fields guarded by root rcu_node's lock. */



2018-12-11 06:27:36

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Mon, Dec 10, 2018 at 04:38:38PM -0800, Paul E. McKenney wrote:
> On Mon, Dec 10, 2018 at 06:56:18AM +0000, He, Bo wrote:
> > Hi,
> > We have start the test with the CONFIG_PROVE_RCU=y, and also add one 2s to detect the preempt rcu hang, hope we can get more useful logs tomorrow.
> > I also enclosed the config and the debug patches for you review.
>
> I instead suggest the (lightly tested) debug patch shown below, which
> tracks wakeups of RCU's grace-period kthreads and dumps them out if a
> given requested grace period fails to start. Again, it is necessary to
> build with CONFIG_PROVE_RCU=y, that is, with CONFIG_PROVE_LOCKING=y.

Right. This time without commenting out the wakeup as a test of the
diagnostic. :-/

Please use the patch below instead of the one that I sent in my
previous email.

Thanx, Paul

------------------------------------------------------------------------

commit adfc7dff659495a3433d5084256be59eee0ac6df
Author: Paul E. McKenney <[email protected]>
Date: Mon Dec 10 16:33:59 2018 -0800

rcu: Improve diagnostics for failed RCU grace-period start

Backported from v4.21/v5.0

If a grace period fails to start (for example, because you commented
out the last two lines of rcu_accelerate_cbs_unlocked()), rcu_core()
will invoke rcu_check_gp_start_stall(), which will notice and complain.
However, this complaint is lacking crucial debugging information such
as when the last wakeup executed and what the value of ->gp_seq was at
that time. This commit therefore removes the current pr_alert() from
rcu_check_gp_start_stall(), instead invoking show_rcu_gp_kthreads(),
which has been updated to print the needed information, which is collected
by rcu_gp_kthread_wake().

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b760c1369f7..4bcd8753e293 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -626,25 +626,57 @@ void rcu_sched_force_quiescent_state(void)
}
EXPORT_SYMBOL_GPL(rcu_sched_force_quiescent_state);

+/*
+ * Convert a ->gp_state value to a character string.
+ */
+static const char *gp_state_getname(short gs)
+{
+ if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
+ return "???";
+ return gp_state_names[gs];
+}
+
+/*
+ * Return the root node of the specified rcu_state structure.
+ */
+static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
+{
+ return &rsp->node[0];
+}
+
/*
* Show the state of the grace-period kthreads.
*/
void show_rcu_gp_kthreads(void)
{
int cpu;
+ unsigned long j;
+ unsigned long ja;
+ unsigned long jr;
+ unsigned long jw;
struct rcu_data *rdp;
struct rcu_node *rnp;
struct rcu_state *rsp;

+ j = jiffies;
for_each_rcu_flavor(rsp) {
- pr_info("%s: wait state: %d ->state: %#lx\n",
- rsp->name, rsp->gp_state, rsp->gp_kthread->state);
+ ja = j - READ_ONCE(rsp->gp_activity);
+ jr = j - READ_ONCE(rsp->gp_req_activity);
+ jw = j - READ_ONCE(rsp->gp_wake_time);
+ pr_info("%s: wait state: %s(%d) ->state: %#lx delta ->gp_activity %lu ->gp_req_activity %lu ->gp_wake_time %lu ->gp_wake_seq %ld ->gp_seq %ld ->gp_seq_needed %ld ->gp_flags %#x\n",
+ rsp->name, gp_state_getname(rsp->gp_state),
+ rsp->gp_state,
+ rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL,
+ ja, jr, jw, (long)READ_ONCE(rsp->gp_wake_seq),
+ (long)READ_ONCE(rsp->gp_seq),
+ (long)READ_ONCE(rcu_get_root(rsp)->gp_seq_needed),
+ READ_ONCE(rsp->gp_flags));
rcu_for_each_node_breadth_first(rsp, rnp) {
if (ULONG_CMP_GE(rsp->gp_seq, rnp->gp_seq_needed))
continue;
- pr_info("\trcu_node %d:%d ->gp_seq %lu ->gp_seq_needed %lu\n",
- rnp->grplo, rnp->grphi, rnp->gp_seq,
- rnp->gp_seq_needed);
+ pr_info("\trcu_node %d:%d ->gp_seq %ld ->gp_seq_needed %ld\n",
+ rnp->grplo, rnp->grphi, (long)rnp->gp_seq,
+ (long)rnp->gp_seq_needed);
if (!rcu_is_leaf_node(rnp))
continue;
for_each_leaf_node_possible_cpu(rnp, cpu) {
@@ -653,8 +685,8 @@ void show_rcu_gp_kthreads(void)
ULONG_CMP_GE(rsp->gp_seq,
rdp->gp_seq_needed))
continue;
- pr_info("\tcpu %d ->gp_seq_needed %lu\n",
- cpu, rdp->gp_seq_needed);
+ pr_info("\tcpu %d ->gp_seq_needed %ld\n",
+ cpu, (long)rdp->gp_seq_needed);
}
}
/* sched_show_task(rsp->gp_kthread); */
@@ -690,14 +722,6 @@ void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
}
EXPORT_SYMBOL_GPL(rcutorture_get_gp_data);

-/*
- * Return the root node of the specified rcu_state structure.
- */
-static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
-{
- return &rsp->node[0];
-}
-
/*
* Enter an RCU extended quiescent state, which can be either the
* idle loop or adaptive-tickless usermode execution.
@@ -1285,16 +1309,6 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
rsp->n_force_qs_gpstart = READ_ONCE(rsp->n_force_qs);
}

-/*
- * Convert a ->gp_state value to a character string.
- */
-static const char *gp_state_getname(short gs)
-{
- if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
- return "???";
- return gp_state_names[gs];
-}
-
/*
* Complain about starvation of grace-period kthread.
*/
@@ -1693,7 +1707,8 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
* Don't do a self-awaken, and don't bother awakening when there is
* nothing for the grace-period kthread to do (as in several CPUs
* raced to awaken, and we lost), and finally don't try to awaken
- * a kthread that has not yet been created.
+ * a kthread that has not yet been created. If all those checks are
+ * passed, track some debug information and awaken.
*/
static void rcu_gp_kthread_wake(struct rcu_state *rsp)
{
@@ -1701,6 +1716,8 @@ static void rcu_gp_kthread_wake(struct rcu_state *rsp)
!READ_ONCE(rsp->gp_flags) ||
!rsp->gp_kthread)
return;
+ WRITE_ONCE(rsp->gp_wake_time, jiffies);
+ WRITE_ONCE(rsp->gp_wake_seq, READ_ONCE(rsp->gp_seq));
swake_up_one(&rsp->gp_wq);
}

@@ -2802,16 +2819,11 @@ rcu_check_gp_start_stall(struct rcu_state *rsp, struct rcu_node *rnp,
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return;
}
- pr_alert("%s: g%ld->%ld gar:%lu ga:%lu f%#x gs:%d %s->state:%#lx\n",
- __func__, (long)READ_ONCE(rsp->gp_seq),
- (long)READ_ONCE(rnp_root->gp_seq_needed),
- j - rsp->gp_req_activity, j - rsp->gp_activity,
- rsp->gp_flags, rsp->gp_state, rsp->name,
- rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL);
WARN_ON(1);
if (rnp_root != rnp)
raw_spin_unlock_rcu_node(rnp_root);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+ show_rcu_gp_kthreads();
}

/*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 4e74df768c57..0e051d9b5f1a 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -327,6 +327,8 @@ struct rcu_state {
struct swait_queue_head gp_wq; /* Where GP task waits. */
short gp_flags; /* Commands for GP task. */
short gp_state; /* GP kthread sleep state. */
+ unsigned long gp_wake_time; /* Last GP kthread wake. */
+ unsigned long gp_wake_seq; /* ->gp_seq at ^^^. */

/* End of fields guarded by root rcu_node's lock. */



2018-12-11 06:30:15

by He, Bo

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

sure, we will update the new patch to run the test.

-----Original Message-----
From: Paul E. McKenney <[email protected]>
Sent: Tuesday, December 11, 2018 12:47 PM
To: He, Bo <[email protected]>
Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>
Subject: Re: rcu_preempt caused oom

On Mon, Dec 10, 2018 at 04:38:38PM -0800, Paul E. McKenney wrote:
> On Mon, Dec 10, 2018 at 06:56:18AM +0000, He, Bo wrote:
> > Hi,
> > We have start the test with the CONFIG_PROVE_RCU=y, and also add one 2s to detect the preempt rcu hang, hope we can get more useful logs tomorrow.
> > I also enclosed the config and the debug patches for you review.
>
> I instead suggest the (lightly tested) debug patch shown below, which
> tracks wakeups of RCU's grace-period kthreads and dumps them out if a
> given requested grace period fails to start. Again, it is necessary
> to build with CONFIG_PROVE_RCU=y, that is, with CONFIG_PROVE_LOCKING=y.

Right. This time without commenting out the wakeup as a test of the diagnostic. :-/

Please use the patch below instead of the one that I sent in my previous email.

Thanx, Paul

------------------------------------------------------------------------

commit adfc7dff659495a3433d5084256be59eee0ac6df
Author: Paul E. McKenney <[email protected]>
Date: Mon Dec 10 16:33:59 2018 -0800

rcu: Improve diagnostics for failed RCU grace-period start

Backported from v4.21/v5.0

If a grace period fails to start (for example, because you commented
out the last two lines of rcu_accelerate_cbs_unlocked()), rcu_core()
will invoke rcu_check_gp_start_stall(), which will notice and complain.
However, this complaint is lacking crucial debugging information such
as when the last wakeup executed and what the value of ->gp_seq was at
that time. This commit therefore removes the current pr_alert() from
rcu_check_gp_start_stall(), instead invoking show_rcu_gp_kthreads(),
which has been updated to print the needed information, which is collected
by rcu_gp_kthread_wake().

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0b760c1369f7..4bcd8753e293 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -626,25 +626,57 @@ void rcu_sched_force_quiescent_state(void)
}
EXPORT_SYMBOL_GPL(rcu_sched_force_quiescent_state);

+/*
+ * Convert a ->gp_state value to a character string.
+ */
+static const char *gp_state_getname(short gs) {
+ if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
+ return "???";
+ return gp_state_names[gs];
+}
+
+/*
+ * Return the root node of the specified rcu_state structure.
+ */
+static struct rcu_node *rcu_get_root(struct rcu_state *rsp) {
+ return &rsp->node[0];
+}
+
/*
* Show the state of the grace-period kthreads.
*/
void show_rcu_gp_kthreads(void)
{
int cpu;
+ unsigned long j;
+ unsigned long ja;
+ unsigned long jr;
+ unsigned long jw;
struct rcu_data *rdp;
struct rcu_node *rnp;
struct rcu_state *rsp;

+ j = jiffies;
for_each_rcu_flavor(rsp) {
- pr_info("%s: wait state: %d ->state: %#lx\n",
- rsp->name, rsp->gp_state, rsp->gp_kthread->state);
+ ja = j - READ_ONCE(rsp->gp_activity);
+ jr = j - READ_ONCE(rsp->gp_req_activity);
+ jw = j - READ_ONCE(rsp->gp_wake_time);
+ pr_info("%s: wait state: %s(%d) ->state: %#lx delta ->gp_activity %lu ->gp_req_activity %lu ->gp_wake_time %lu ->gp_wake_seq %ld ->gp_seq %ld ->gp_seq_needed %ld ->gp_flags %#x\n",
+ rsp->name, gp_state_getname(rsp->gp_state),
+ rsp->gp_state,
+ rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL,
+ ja, jr, jw, (long)READ_ONCE(rsp->gp_wake_seq),
+ (long)READ_ONCE(rsp->gp_seq),
+ (long)READ_ONCE(rcu_get_root(rsp)->gp_seq_needed),
+ READ_ONCE(rsp->gp_flags));
rcu_for_each_node_breadth_first(rsp, rnp) {
if (ULONG_CMP_GE(rsp->gp_seq, rnp->gp_seq_needed))
continue;
- pr_info("\trcu_node %d:%d ->gp_seq %lu ->gp_seq_needed %lu\n",
- rnp->grplo, rnp->grphi, rnp->gp_seq,
- rnp->gp_seq_needed);
+ pr_info("\trcu_node %d:%d ->gp_seq %ld ->gp_seq_needed %ld\n",
+ rnp->grplo, rnp->grphi, (long)rnp->gp_seq,
+ (long)rnp->gp_seq_needed);
if (!rcu_is_leaf_node(rnp))
continue;
for_each_leaf_node_possible_cpu(rnp, cpu) { @@ -653,8 +685,8 @@ void show_rcu_gp_kthreads(void)
ULONG_CMP_GE(rsp->gp_seq,
rdp->gp_seq_needed))
continue;
- pr_info("\tcpu %d ->gp_seq_needed %lu\n",
- cpu, rdp->gp_seq_needed);
+ pr_info("\tcpu %d ->gp_seq_needed %ld\n",
+ cpu, (long)rdp->gp_seq_needed);
}
}
/* sched_show_task(rsp->gp_kthread); */ @@ -690,14 +722,6 @@ void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags, } EXPORT_SYMBOL_GPL(rcutorture_get_gp_data);

-/*
- * Return the root node of the specified rcu_state structure.
- */
-static struct rcu_node *rcu_get_root(struct rcu_state *rsp) -{
- return &rsp->node[0];
-}
-
/*
* Enter an RCU extended quiescent state, which can be either the
* idle loop or adaptive-tickless usermode execution.
@@ -1285,16 +1309,6 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
rsp->n_force_qs_gpstart = READ_ONCE(rsp->n_force_qs); }

-/*
- * Convert a ->gp_state value to a character string.
- */
-static const char *gp_state_getname(short gs) -{
- if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
- return "???";
- return gp_state_names[gs];
-}
-
/*
* Complain about starvation of grace-period kthread.
*/
@@ -1693,7 +1707,8 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
* Don't do a self-awaken, and don't bother awakening when there is
* nothing for the grace-period kthread to do (as in several CPUs
* raced to awaken, and we lost), and finally don't try to awaken
- * a kthread that has not yet been created.
+ * a kthread that has not yet been created. If all those checks are
+ * passed, track some debug information and awaken.
*/
static void rcu_gp_kthread_wake(struct rcu_state *rsp) { @@ -1701,6 +1716,8 @@ static void rcu_gp_kthread_wake(struct rcu_state *rsp)
!READ_ONCE(rsp->gp_flags) ||
!rsp->gp_kthread)
return;
+ WRITE_ONCE(rsp->gp_wake_time, jiffies);
+ WRITE_ONCE(rsp->gp_wake_seq, READ_ONCE(rsp->gp_seq));
swake_up_one(&rsp->gp_wq);
}

@@ -2802,16 +2819,11 @@ rcu_check_gp_start_stall(struct rcu_state *rsp, struct rcu_node *rnp,
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return;
}
- pr_alert("%s: g%ld->%ld gar:%lu ga:%lu f%#x gs:%d %s->state:%#lx\n",
- __func__, (long)READ_ONCE(rsp->gp_seq),
- (long)READ_ONCE(rnp_root->gp_seq_needed),
- j - rsp->gp_req_activity, j - rsp->gp_activity,
- rsp->gp_flags, rsp->gp_state, rsp->name,
- rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL);
WARN_ON(1);
if (rnp_root != rnp)
raw_spin_unlock_rcu_node(rnp_root);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+ show_rcu_gp_kthreads();
}

/*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 4e74df768c57..0e051d9b5f1a 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -327,6 +327,8 @@ struct rcu_state {
struct swait_queue_head gp_wq; /* Where GP task waits. */
short gp_flags; /* Commands for GP task. */
short gp_state; /* GP kthread sleep state. */
+ unsigned long gp_wake_time; /* Last GP kthread wake. */
+ unsigned long gp_wake_seq; /* ->gp_seq at ^^^. */

/* End of fields guarded by root rcu_node's lock. */



2018-12-12 01:39:31

by He, Bo

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

We reproduced the issue panic in hung_task with the patch "Improve diagnostics for failed RCU grace-period start", but unfortunately maybe it's due to the loglevel, the show_rcu_gp_kthreads doesn't print any logs, we will improve the build and run the test to double check.

-----Original Message-----
From: Paul E. McKenney <[email protected]>
Sent: Tuesday, December 11, 2018 12:47 PM
To: He, Bo <[email protected]>
Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>
Subject: Re: rcu_preempt caused oom

On Mon, Dec 10, 2018 at 04:38:38PM -0800, Paul E. McKenney wrote:
> On Mon, Dec 10, 2018 at 06:56:18AM +0000, He, Bo wrote:
> > Hi,
> > We have start the test with the CONFIG_PROVE_RCU=y, and also add one 2s to detect the preempt rcu hang, hope we can get more useful logs tomorrow.
> > I also enclosed the config and the debug patches for you review.
>
> I instead suggest the (lightly tested) debug patch shown below, which
> tracks wakeups of RCU's grace-period kthreads and dumps them out if a
> given requested grace period fails to start. Again, it is necessary
> to build with CONFIG_PROVE_RCU=y, that is, with CONFIG_PROVE_LOCKING=y.

Right. This time without commenting out the wakeup as a test of the diagnostic. :-/

Please use the patch below instead of the one that I sent in my previous email.

Thanx, Paul

------------------------------------------------------------------------

commit adfc7dff659495a3433d5084256be59eee0ac6df
Author: Paul E. McKenney <[email protected]>
Date: Mon Dec 10 16:33:59 2018 -0800

rcu: Improve diagnostics for failed RCU grace-period start

Backported from v4.21/v5.0

If a grace period fails to start (for example, because you commented
out the last two lines of rcu_accelerate_cbs_unlocked()), rcu_core()
will invoke rcu_check_gp_start_stall(), which will notice and complain.
However, this complaint is lacking crucial debugging information such
as when the last wakeup executed and what the value of ->gp_seq was at
that time. This commit therefore removes the current pr_alert() from
rcu_check_gp_start_stall(), instead invoking show_rcu_gp_kthreads(),
which has been updated to print the needed information, which is collected
by rcu_gp_kthread_wake().

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0b760c1369f7..4bcd8753e293 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -626,25 +626,57 @@ void rcu_sched_force_quiescent_state(void)
}
EXPORT_SYMBOL_GPL(rcu_sched_force_quiescent_state);

+/*
+ * Convert a ->gp_state value to a character string.
+ */
+static const char *gp_state_getname(short gs) {
+ if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
+ return "???";
+ return gp_state_names[gs];
+}
+
+/*
+ * Return the root node of the specified rcu_state structure.
+ */
+static struct rcu_node *rcu_get_root(struct rcu_state *rsp) {
+ return &rsp->node[0];
+}
+
/*
* Show the state of the grace-period kthreads.
*/
void show_rcu_gp_kthreads(void)
{
int cpu;
+ unsigned long j;
+ unsigned long ja;
+ unsigned long jr;
+ unsigned long jw;
struct rcu_data *rdp;
struct rcu_node *rnp;
struct rcu_state *rsp;

+ j = jiffies;
for_each_rcu_flavor(rsp) {
- pr_info("%s: wait state: %d ->state: %#lx\n",
- rsp->name, rsp->gp_state, rsp->gp_kthread->state);
+ ja = j - READ_ONCE(rsp->gp_activity);
+ jr = j - READ_ONCE(rsp->gp_req_activity);
+ jw = j - READ_ONCE(rsp->gp_wake_time);
+ pr_info("%s: wait state: %s(%d) ->state: %#lx delta ->gp_activity %lu ->gp_req_activity %lu ->gp_wake_time %lu ->gp_wake_seq %ld ->gp_seq %ld ->gp_seq_needed %ld ->gp_flags %#x\n",
+ rsp->name, gp_state_getname(rsp->gp_state),
+ rsp->gp_state,
+ rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL,
+ ja, jr, jw, (long)READ_ONCE(rsp->gp_wake_seq),
+ (long)READ_ONCE(rsp->gp_seq),
+ (long)READ_ONCE(rcu_get_root(rsp)->gp_seq_needed),
+ READ_ONCE(rsp->gp_flags));
rcu_for_each_node_breadth_first(rsp, rnp) {
if (ULONG_CMP_GE(rsp->gp_seq, rnp->gp_seq_needed))
continue;
- pr_info("\trcu_node %d:%d ->gp_seq %lu ->gp_seq_needed %lu\n",
- rnp->grplo, rnp->grphi, rnp->gp_seq,
- rnp->gp_seq_needed);
+ pr_info("\trcu_node %d:%d ->gp_seq %ld ->gp_seq_needed %ld\n",
+ rnp->grplo, rnp->grphi, (long)rnp->gp_seq,
+ (long)rnp->gp_seq_needed);
if (!rcu_is_leaf_node(rnp))
continue;
for_each_leaf_node_possible_cpu(rnp, cpu) { @@ -653,8 +685,8 @@ void show_rcu_gp_kthreads(void)
ULONG_CMP_GE(rsp->gp_seq,
rdp->gp_seq_needed))
continue;
- pr_info("\tcpu %d ->gp_seq_needed %lu\n",
- cpu, rdp->gp_seq_needed);
+ pr_info("\tcpu %d ->gp_seq_needed %ld\n",
+ cpu, (long)rdp->gp_seq_needed);
}
}
/* sched_show_task(rsp->gp_kthread); */ @@ -690,14 +722,6 @@ void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags, } EXPORT_SYMBOL_GPL(rcutorture_get_gp_data);

-/*
- * Return the root node of the specified rcu_state structure.
- */
-static struct rcu_node *rcu_get_root(struct rcu_state *rsp) -{
- return &rsp->node[0];
-}
-
/*
* Enter an RCU extended quiescent state, which can be either the
* idle loop or adaptive-tickless usermode execution.
@@ -1285,16 +1309,6 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
rsp->n_force_qs_gpstart = READ_ONCE(rsp->n_force_qs); }

-/*
- * Convert a ->gp_state value to a character string.
- */
-static const char *gp_state_getname(short gs) -{
- if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
- return "???";
- return gp_state_names[gs];
-}
-
/*
* Complain about starvation of grace-period kthread.
*/
@@ -1693,7 +1707,8 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
* Don't do a self-awaken, and don't bother awakening when there is
* nothing for the grace-period kthread to do (as in several CPUs
* raced to awaken, and we lost), and finally don't try to awaken
- * a kthread that has not yet been created.
+ * a kthread that has not yet been created. If all those checks are
+ * passed, track some debug information and awaken.
*/
static void rcu_gp_kthread_wake(struct rcu_state *rsp) { @@ -1701,6 +1716,8 @@ static void rcu_gp_kthread_wake(struct rcu_state *rsp)
!READ_ONCE(rsp->gp_flags) ||
!rsp->gp_kthread)
return;
+ WRITE_ONCE(rsp->gp_wake_time, jiffies);
+ WRITE_ONCE(rsp->gp_wake_seq, READ_ONCE(rsp->gp_seq));
swake_up_one(&rsp->gp_wq);
}

@@ -2802,16 +2819,11 @@ rcu_check_gp_start_stall(struct rcu_state *rsp, struct rcu_node *rnp,
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return;
}
- pr_alert("%s: g%ld->%ld gar:%lu ga:%lu f%#x gs:%d %s->state:%#lx\n",
- __func__, (long)READ_ONCE(rsp->gp_seq),
- (long)READ_ONCE(rnp_root->gp_seq_needed),
- j - rsp->gp_req_activity, j - rsp->gp_activity,
- rsp->gp_flags, rsp->gp_state, rsp->name,
- rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL);
WARN_ON(1);
if (rnp_root != rnp)
raw_spin_unlock_rcu_node(rnp_root);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+ show_rcu_gp_kthreads();
}

/*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 4e74df768c57..0e051d9b5f1a 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -327,6 +327,8 @@ struct rcu_state {
struct swait_queue_head gp_wq; /* Where GP task waits. */
short gp_flags; /* Commands for GP task. */
short gp_state; /* GP kthread sleep state. */
+ unsigned long gp_wake_time; /* Last GP kthread wake. */
+ unsigned long gp_wake_seq; /* ->gp_seq at ^^^. */

/* End of fields guarded by root rcu_node's lock. */



Attachments:
console-ramoops_20111111154601.txt.gz (156.82 kB)
console-ramoops_20111111154601.txt.gz

2018-12-12 02:26:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Wed, Dec 12, 2018 at 01:37:40AM +0000, He, Bo wrote:
> We reproduced the issue panic in hung_task with the patch "Improve diagnostics for failed RCU grace-period start", but unfortunately maybe it's due to the loglevel, the show_rcu_gp_kthreads doesn't print any logs, we will improve the build and run the test to double check.

Well, at least the diagnostics didn't prevent the problem from happening. ;-)

Thanx, Paul

> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Tuesday, December 11, 2018 12:47 PM
> To: He, Bo <[email protected]>
> Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Mon, Dec 10, 2018 at 04:38:38PM -0800, Paul E. McKenney wrote:
> > On Mon, Dec 10, 2018 at 06:56:18AM +0000, He, Bo wrote:
> > > Hi,
> > > We have start the test with the CONFIG_PROVE_RCU=y, and also add one 2s to detect the preempt rcu hang, hope we can get more useful logs tomorrow.
> > > I also enclosed the config and the debug patches for you review.
> >
> > I instead suggest the (lightly tested) debug patch shown below, which
> > tracks wakeups of RCU's grace-period kthreads and dumps them out if a
> > given requested grace period fails to start. Again, it is necessary
> > to build with CONFIG_PROVE_RCU=y, that is, with CONFIG_PROVE_LOCKING=y.
>
> Right. This time without commenting out the wakeup as a test of the diagnostic. :-/
>
> Please use the patch below instead of the one that I sent in my previous email.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit adfc7dff659495a3433d5084256be59eee0ac6df
> Author: Paul E. McKenney <[email protected]>
> Date: Mon Dec 10 16:33:59 2018 -0800
>
> rcu: Improve diagnostics for failed RCU grace-period start
>
> Backported from v4.21/v5.0
>
> If a grace period fails to start (for example, because you commented
> out the last two lines of rcu_accelerate_cbs_unlocked()), rcu_core()
> will invoke rcu_check_gp_start_stall(), which will notice and complain.
> However, this complaint is lacking crucial debugging information such
> as when the last wakeup executed and what the value of ->gp_seq was at
> that time. This commit therefore removes the current pr_alert() from
> rcu_check_gp_start_stall(), instead invoking show_rcu_gp_kthreads(),
> which has been updated to print the needed information, which is collected
> by rcu_gp_kthread_wake().
>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0b760c1369f7..4bcd8753e293 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -626,25 +626,57 @@ void rcu_sched_force_quiescent_state(void)
> }
> EXPORT_SYMBOL_GPL(rcu_sched_force_quiescent_state);
>
> +/*
> + * Convert a ->gp_state value to a character string.
> + */
> +static const char *gp_state_getname(short gs) {
> + if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
> + return "???";
> + return gp_state_names[gs];
> +}
> +
> +/*
> + * Return the root node of the specified rcu_state structure.
> + */
> +static struct rcu_node *rcu_get_root(struct rcu_state *rsp) {
> + return &rsp->node[0];
> +}
> +
> /*
> * Show the state of the grace-period kthreads.
> */
> void show_rcu_gp_kthreads(void)
> {
> int cpu;
> + unsigned long j;
> + unsigned long ja;
> + unsigned long jr;
> + unsigned long jw;
> struct rcu_data *rdp;
> struct rcu_node *rnp;
> struct rcu_state *rsp;
>
> + j = jiffies;
> for_each_rcu_flavor(rsp) {
> - pr_info("%s: wait state: %d ->state: %#lx\n",
> - rsp->name, rsp->gp_state, rsp->gp_kthread->state);
> + ja = j - READ_ONCE(rsp->gp_activity);
> + jr = j - READ_ONCE(rsp->gp_req_activity);
> + jw = j - READ_ONCE(rsp->gp_wake_time);
> + pr_info("%s: wait state: %s(%d) ->state: %#lx delta ->gp_activity %lu ->gp_req_activity %lu ->gp_wake_time %lu ->gp_wake_seq %ld ->gp_seq %ld ->gp_seq_needed %ld ->gp_flags %#x\n",
> + rsp->name, gp_state_getname(rsp->gp_state),
> + rsp->gp_state,
> + rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL,
> + ja, jr, jw, (long)READ_ONCE(rsp->gp_wake_seq),
> + (long)READ_ONCE(rsp->gp_seq),
> + (long)READ_ONCE(rcu_get_root(rsp)->gp_seq_needed),
> + READ_ONCE(rsp->gp_flags));
> rcu_for_each_node_breadth_first(rsp, rnp) {
> if (ULONG_CMP_GE(rsp->gp_seq, rnp->gp_seq_needed))
> continue;
> - pr_info("\trcu_node %d:%d ->gp_seq %lu ->gp_seq_needed %lu\n",
> - rnp->grplo, rnp->grphi, rnp->gp_seq,
> - rnp->gp_seq_needed);
> + pr_info("\trcu_node %d:%d ->gp_seq %ld ->gp_seq_needed %ld\n",
> + rnp->grplo, rnp->grphi, (long)rnp->gp_seq,
> + (long)rnp->gp_seq_needed);
> if (!rcu_is_leaf_node(rnp))
> continue;
> for_each_leaf_node_possible_cpu(rnp, cpu) { @@ -653,8 +685,8 @@ void show_rcu_gp_kthreads(void)
> ULONG_CMP_GE(rsp->gp_seq,
> rdp->gp_seq_needed))
> continue;
> - pr_info("\tcpu %d ->gp_seq_needed %lu\n",
> - cpu, rdp->gp_seq_needed);
> + pr_info("\tcpu %d ->gp_seq_needed %ld\n",
> + cpu, (long)rdp->gp_seq_needed);
> }
> }
> /* sched_show_task(rsp->gp_kthread); */ @@ -690,14 +722,6 @@ void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags, } EXPORT_SYMBOL_GPL(rcutorture_get_gp_data);
>
> -/*
> - * Return the root node of the specified rcu_state structure.
> - */
> -static struct rcu_node *rcu_get_root(struct rcu_state *rsp) -{
> - return &rsp->node[0];
> -}
> -
> /*
> * Enter an RCU extended quiescent state, which can be either the
> * idle loop or adaptive-tickless usermode execution.
> @@ -1285,16 +1309,6 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
> rsp->n_force_qs_gpstart = READ_ONCE(rsp->n_force_qs); }
>
> -/*
> - * Convert a ->gp_state value to a character string.
> - */
> -static const char *gp_state_getname(short gs) -{
> - if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
> - return "???";
> - return gp_state_names[gs];
> -}
> -
> /*
> * Complain about starvation of grace-period kthread.
> */
> @@ -1693,7 +1707,8 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
> * Don't do a self-awaken, and don't bother awakening when there is
> * nothing for the grace-period kthread to do (as in several CPUs
> * raced to awaken, and we lost), and finally don't try to awaken
> - * a kthread that has not yet been created.
> + * a kthread that has not yet been created. If all those checks are
> + * passed, track some debug information and awaken.
> */
> static void rcu_gp_kthread_wake(struct rcu_state *rsp) { @@ -1701,6 +1716,8 @@ static void rcu_gp_kthread_wake(struct rcu_state *rsp)
> !READ_ONCE(rsp->gp_flags) ||
> !rsp->gp_kthread)
> return;
> + WRITE_ONCE(rsp->gp_wake_time, jiffies);
> + WRITE_ONCE(rsp->gp_wake_seq, READ_ONCE(rsp->gp_seq));
> swake_up_one(&rsp->gp_wq);
> }
>
> @@ -2802,16 +2819,11 @@ rcu_check_gp_start_stall(struct rcu_state *rsp, struct rcu_node *rnp,
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> return;
> }
> - pr_alert("%s: g%ld->%ld gar:%lu ga:%lu f%#x gs:%d %s->state:%#lx\n",
> - __func__, (long)READ_ONCE(rsp->gp_seq),
> - (long)READ_ONCE(rnp_root->gp_seq_needed),
> - j - rsp->gp_req_activity, j - rsp->gp_activity,
> - rsp->gp_flags, rsp->gp_state, rsp->name,
> - rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL);
> WARN_ON(1);
> if (rnp_root != rnp)
> raw_spin_unlock_rcu_node(rnp_root);
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> + show_rcu_gp_kthreads();
> }
>
> /*
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 4e74df768c57..0e051d9b5f1a 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -327,6 +327,8 @@ struct rcu_state {
> struct swait_queue_head gp_wq; /* Where GP task waits. */
> short gp_flags; /* Commands for GP task. */
> short gp_state; /* GP kthread sleep state. */
> + unsigned long gp_wake_time; /* Last GP kthread wake. */
> + unsigned long gp_wake_seq; /* ->gp_seq at ^^^. */
>
> /* End of fields guarded by root rcu_node's lock. */
>
>



2018-12-12 15:44:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.

Strange.

Are the systems responsive to sysrq keys once failure occurs? If so, I will
provide you a sysrq-R or some such to dump out the RCU state.

Thanx, Paul

> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Wednesday, December 12, 2018 10:25 AM
> To: He, Bo <[email protected]>
> Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Wed, Dec 12, 2018 at 01:37:40AM +0000, He, Bo wrote:
> > We reproduced the issue panic in hung_task with the patch "Improve diagnostics for failed RCU grace-period start", but unfortunately maybe it's due to the loglevel, the show_rcu_gp_kthreads doesn't print any logs, we will improve the build and run the test to double check.
>
> Well, at least the diagnostics didn't prevent the problem from happening. ;-)
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Tuesday, December 11, 2018 12:47 PM
> > To: He, Bo <[email protected]>
> > Cc: Steven Rostedt <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Zhang, Jun
> > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > <[email protected]>; Bai, Jie A <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Mon, Dec 10, 2018 at 04:38:38PM -0800, Paul E. McKenney wrote:
> > > On Mon, Dec 10, 2018 at 06:56:18AM +0000, He, Bo wrote:
> > > > Hi,
> > > > We have start the test with the CONFIG_PROVE_RCU=y, and also add one 2s to detect the preempt rcu hang, hope we can get more useful logs tomorrow.
> > > > I also enclosed the config and the debug patches for you review.
> > >
> > > I instead suggest the (lightly tested) debug patch shown below,
> > > which tracks wakeups of RCU's grace-period kthreads and dumps them
> > > out if a given requested grace period fails to start. Again, it is
> > > necessary to build with CONFIG_PROVE_RCU=y, that is, with CONFIG_PROVE_LOCKING=y.
> >
> > Right. This time without commenting out the wakeup as a test of the
> > diagnostic. :-/
> >
> > Please use the patch below instead of the one that I sent in my previous email.
> >
> > Thanx, Paul
> >
> > ----------------------------------------------------------------------
> > --
> >
> > commit adfc7dff659495a3433d5084256be59eee0ac6df
> > Author: Paul E. McKenney <[email protected]>
> > Date: Mon Dec 10 16:33:59 2018 -0800
> >
> > rcu: Improve diagnostics for failed RCU grace-period start
> >
> > Backported from v4.21/v5.0
> >
> > If a grace period fails to start (for example, because you commented
> > out the last two lines of rcu_accelerate_cbs_unlocked()), rcu_core()
> > will invoke rcu_check_gp_start_stall(), which will notice and complain.
> > However, this complaint is lacking crucial debugging information such
> > as when the last wakeup executed and what the value of ->gp_seq was at
> > that time. This commit therefore removes the current pr_alert() from
> > rcu_check_gp_start_stall(), instead invoking show_rcu_gp_kthreads(),
> > which has been updated to print the needed information, which is collected
> > by rcu_gp_kthread_wake().
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > 0b760c1369f7..4bcd8753e293 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -626,25 +626,57 @@ void rcu_sched_force_quiescent_state(void)
> > }
> > EXPORT_SYMBOL_GPL(rcu_sched_force_quiescent_state);
> >
> > +/*
> > + * Convert a ->gp_state value to a character string.
> > + */
> > +static const char *gp_state_getname(short gs) {
> > + if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
> > + return "???";
> > + return gp_state_names[gs];
> > +}
> > +
> > +/*
> > + * Return the root node of the specified rcu_state structure.
> > + */
> > +static struct rcu_node *rcu_get_root(struct rcu_state *rsp) {
> > + return &rsp->node[0];
> > +}
> > +
> > /*
> > * Show the state of the grace-period kthreads.
> > */
> > void show_rcu_gp_kthreads(void)
> > {
> > int cpu;
> > + unsigned long j;
> > + unsigned long ja;
> > + unsigned long jr;
> > + unsigned long jw;
> > struct rcu_data *rdp;
> > struct rcu_node *rnp;
> > struct rcu_state *rsp;
> >
> > + j = jiffies;
> > for_each_rcu_flavor(rsp) {
> > - pr_info("%s: wait state: %d ->state: %#lx\n",
> > - rsp->name, rsp->gp_state, rsp->gp_kthread->state);
> > + ja = j - READ_ONCE(rsp->gp_activity);
> > + jr = j - READ_ONCE(rsp->gp_req_activity);
> > + jw = j - READ_ONCE(rsp->gp_wake_time);
> > + pr_info("%s: wait state: %s(%d) ->state: %#lx delta ->gp_activity %lu ->gp_req_activity %lu ->gp_wake_time %lu ->gp_wake_seq %ld ->gp_seq %ld ->gp_seq_needed %ld ->gp_flags %#x\n",
> > + rsp->name, gp_state_getname(rsp->gp_state),
> > + rsp->gp_state,
> > + rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL,
> > + ja, jr, jw, (long)READ_ONCE(rsp->gp_wake_seq),
> > + (long)READ_ONCE(rsp->gp_seq),
> > + (long)READ_ONCE(rcu_get_root(rsp)->gp_seq_needed),
> > + READ_ONCE(rsp->gp_flags));
> > rcu_for_each_node_breadth_first(rsp, rnp) {
> > if (ULONG_CMP_GE(rsp->gp_seq, rnp->gp_seq_needed))
> > continue;
> > - pr_info("\trcu_node %d:%d ->gp_seq %lu ->gp_seq_needed %lu\n",
> > - rnp->grplo, rnp->grphi, rnp->gp_seq,
> > - rnp->gp_seq_needed);
> > + pr_info("\trcu_node %d:%d ->gp_seq %ld ->gp_seq_needed %ld\n",
> > + rnp->grplo, rnp->grphi, (long)rnp->gp_seq,
> > + (long)rnp->gp_seq_needed);
> > if (!rcu_is_leaf_node(rnp))
> > continue;
> > for_each_leaf_node_possible_cpu(rnp, cpu) { @@ -653,8 +685,8 @@ void show_rcu_gp_kthreads(void)
> > ULONG_CMP_GE(rsp->gp_seq,
> > rdp->gp_seq_needed))
> > continue;
> > - pr_info("\tcpu %d ->gp_seq_needed %lu\n",
> > - cpu, rdp->gp_seq_needed);
> > + pr_info("\tcpu %d ->gp_seq_needed %ld\n",
> > + cpu, (long)rdp->gp_seq_needed);
> > }
> > }
> > /* sched_show_task(rsp->gp_kthread); */ @@ -690,14 +722,6 @@ void
> > rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags, }
> > EXPORT_SYMBOL_GPL(rcutorture_get_gp_data);
> >
> > -/*
> > - * Return the root node of the specified rcu_state structure.
> > - */
> > -static struct rcu_node *rcu_get_root(struct rcu_state *rsp) -{
> > - return &rsp->node[0];
> > -}
> > -
> > /*
> > * Enter an RCU extended quiescent state, which can be either the
> > * idle loop or adaptive-tickless usermode execution.
> > @@ -1285,16 +1309,6 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
> > rsp->n_force_qs_gpstart = READ_ONCE(rsp->n_force_qs); }
> >
> > -/*
> > - * Convert a ->gp_state value to a character string.
> > - */
> > -static const char *gp_state_getname(short gs) -{
> > - if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
> > - return "???";
> > - return gp_state_names[gs];
> > -}
> > -
> > /*
> > * Complain about starvation of grace-period kthread.
> > */
> > @@ -1693,7 +1707,8 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
> > * Don't do a self-awaken, and don't bother awakening when there is
> > * nothing for the grace-period kthread to do (as in several CPUs
> > * raced to awaken, and we lost), and finally don't try to awaken
> > - * a kthread that has not yet been created.
> > + * a kthread that has not yet been created. If all those checks are
> > + * passed, track some debug information and awaken.
> > */
> > static void rcu_gp_kthread_wake(struct rcu_state *rsp) { @@ -1701,6 +1716,8 @@ static void rcu_gp_kthread_wake(struct rcu_state *rsp)
> > !READ_ONCE(rsp->gp_flags) ||
> > !rsp->gp_kthread)
> > return;
> > + WRITE_ONCE(rsp->gp_wake_time, jiffies);
> > + WRITE_ONCE(rsp->gp_wake_seq, READ_ONCE(rsp->gp_seq));
> > swake_up_one(&rsp->gp_wq);
> > }
> >
> > @@ -2802,16 +2819,11 @@ rcu_check_gp_start_stall(struct rcu_state *rsp, struct rcu_node *rnp,
> > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > return;
> > }
> > - pr_alert("%s: g%ld->%ld gar:%lu ga:%lu f%#x gs:%d %s->state:%#lx\n",
> > - __func__, (long)READ_ONCE(rsp->gp_seq),
> > - (long)READ_ONCE(rnp_root->gp_seq_needed),
> > - j - rsp->gp_req_activity, j - rsp->gp_activity,
> > - rsp->gp_flags, rsp->gp_state, rsp->name,
> > - rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL);
> > WARN_ON(1);
> > if (rnp_root != rnp)
> > raw_spin_unlock_rcu_node(rnp_root);
> > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > + show_rcu_gp_kthreads();
> > }
> >
> > /*
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index
> > 4e74df768c57..0e051d9b5f1a 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -327,6 +327,8 @@ struct rcu_state {
> > struct swait_queue_head gp_wq; /* Where GP task waits. */
> > short gp_flags; /* Commands for GP task. */
> > short gp_state; /* GP kthread sleep state. */
> > + unsigned long gp_wake_time; /* Last GP kthread wake. */
> > + unsigned long gp_wake_seq; /* ->gp_seq at ^^^. */
> >
> > /* End of fields guarded by root rcu_node's lock. */
> >
> >
>
>





2018-12-12 21:07:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
>
> Strange.
>
> Are the systems responsive to sysrq keys once failure occurs? If so, I will
> provide you a sysrq-R or some such to dump out the RCU state.

Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using
the patch below. Only lightly tested.

Thanx, Paul

------------------------------------------------------------------------

commit adfc7dff659495a3433d5084256be59eee0ac6df
Author: Paul E. McKenney <[email protected]>
Date: Mon Dec 10 16:33:59 2018 -0800

rcu: Improve diagnostics for failed RCU grace-period start

Backported from v4.21/v5.0

If a grace period fails to start (for example, because you commented
out the last two lines of rcu_accelerate_cbs_unlocked()), rcu_core()
will invoke rcu_check_gp_start_stall(), which will notice and complain.
However, this complaint is lacking crucial debugging information such
as when the last wakeup executed and what the value of ->gp_seq was at
that time. This commit therefore removes the current pr_alert() from
rcu_check_gp_start_stall(), instead invoking show_rcu_gp_kthreads(),
which has been updated to print the needed information, which is collected
by rcu_gp_kthread_wake().

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b760c1369f7..4bcd8753e293 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -626,25 +626,57 @@ void rcu_sched_force_quiescent_state(void)
}
EXPORT_SYMBOL_GPL(rcu_sched_force_quiescent_state);

+/*
+ * Convert a ->gp_state value to a character string.
+ */
+static const char *gp_state_getname(short gs)
+{
+ if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
+ return "???";
+ return gp_state_names[gs];
+}
+
+/*
+ * Return the root node of the specified rcu_state structure.
+ */
+static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
+{
+ return &rsp->node[0];
+}
+
/*
* Show the state of the grace-period kthreads.
*/
void show_rcu_gp_kthreads(void)
{
int cpu;
+ unsigned long j;
+ unsigned long ja;
+ unsigned long jr;
+ unsigned long jw;
struct rcu_data *rdp;
struct rcu_node *rnp;
struct rcu_state *rsp;

+ j = jiffies;
for_each_rcu_flavor(rsp) {
- pr_info("%s: wait state: %d ->state: %#lx\n",
- rsp->name, rsp->gp_state, rsp->gp_kthread->state);
+ ja = j - READ_ONCE(rsp->gp_activity);
+ jr = j - READ_ONCE(rsp->gp_req_activity);
+ jw = j - READ_ONCE(rsp->gp_wake_time);
+ pr_info("%s: wait state: %s(%d) ->state: %#lx delta ->gp_activity %lu ->gp_req_activity %lu ->gp_wake_time %lu ->gp_wake_seq %ld ->gp_seq %ld ->gp_seq_needed %ld ->gp_flags %#x\n",
+ rsp->name, gp_state_getname(rsp->gp_state),
+ rsp->gp_state,
+ rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL,
+ ja, jr, jw, (long)READ_ONCE(rsp->gp_wake_seq),
+ (long)READ_ONCE(rsp->gp_seq),
+ (long)READ_ONCE(rcu_get_root(rsp)->gp_seq_needed),
+ READ_ONCE(rsp->gp_flags));
rcu_for_each_node_breadth_first(rsp, rnp) {
if (ULONG_CMP_GE(rsp->gp_seq, rnp->gp_seq_needed))
continue;
- pr_info("\trcu_node %d:%d ->gp_seq %lu ->gp_seq_needed %lu\n",
- rnp->grplo, rnp->grphi, rnp->gp_seq,
- rnp->gp_seq_needed);
+ pr_info("\trcu_node %d:%d ->gp_seq %ld ->gp_seq_needed %ld\n",
+ rnp->grplo, rnp->grphi, (long)rnp->gp_seq,
+ (long)rnp->gp_seq_needed);
if (!rcu_is_leaf_node(rnp))
continue;
for_each_leaf_node_possible_cpu(rnp, cpu) {
@@ -653,8 +685,8 @@ void show_rcu_gp_kthreads(void)
ULONG_CMP_GE(rsp->gp_seq,
rdp->gp_seq_needed))
continue;
- pr_info("\tcpu %d ->gp_seq_needed %lu\n",
- cpu, rdp->gp_seq_needed);
+ pr_info("\tcpu %d ->gp_seq_needed %ld\n",
+ cpu, (long)rdp->gp_seq_needed);
}
}
/* sched_show_task(rsp->gp_kthread); */
@@ -690,14 +722,6 @@ void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
}
EXPORT_SYMBOL_GPL(rcutorture_get_gp_data);

-/*
- * Return the root node of the specified rcu_state structure.
- */
-static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
-{
- return &rsp->node[0];
-}
-
/*
* Enter an RCU extended quiescent state, which can be either the
* idle loop or adaptive-tickless usermode execution.
@@ -1285,16 +1309,6 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
rsp->n_force_qs_gpstart = READ_ONCE(rsp->n_force_qs);
}

-/*
- * Convert a ->gp_state value to a character string.
- */
-static const char *gp_state_getname(short gs)
-{
- if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
- return "???";
- return gp_state_names[gs];
-}
-
/*
* Complain about starvation of grace-period kthread.
*/
@@ -1693,7 +1707,8 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
* Don't do a self-awaken, and don't bother awakening when there is
* nothing for the grace-period kthread to do (as in several CPUs
* raced to awaken, and we lost), and finally don't try to awaken
- * a kthread that has not yet been created.
+ * a kthread that has not yet been created. If all those checks are
+ * passed, track some debug information and awaken.
*/
static void rcu_gp_kthread_wake(struct rcu_state *rsp)
{
@@ -1701,6 +1716,8 @@ static void rcu_gp_kthread_wake(struct rcu_state *rsp)
!READ_ONCE(rsp->gp_flags) ||
!rsp->gp_kthread)
return;
+ WRITE_ONCE(rsp->gp_wake_time, jiffies);
+ WRITE_ONCE(rsp->gp_wake_seq, READ_ONCE(rsp->gp_seq));
swake_up_one(&rsp->gp_wq);
}

@@ -2802,16 +2819,11 @@ rcu_check_gp_start_stall(struct rcu_state *rsp, struct rcu_node *rnp,
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return;
}
- pr_alert("%s: g%ld->%ld gar:%lu ga:%lu f%#x gs:%d %s->state:%#lx\n",
- __func__, (long)READ_ONCE(rsp->gp_seq),
- (long)READ_ONCE(rnp_root->gp_seq_needed),
- j - rsp->gp_req_activity, j - rsp->gp_activity,
- rsp->gp_flags, rsp->gp_state, rsp->name,
- rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL);
WARN_ON(1);
if (rnp_root != rnp)
raw_spin_unlock_rcu_node(rnp_root);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+ show_rcu_gp_kthreads();
}

/*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 4e74df768c57..0e051d9b5f1a 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -327,6 +327,8 @@ struct rcu_state {
struct swait_queue_head gp_wq; /* Where GP task waits. */
short gp_flags; /* Commands for GP task. */
short gp_state; /* GP kthread sleep state. */
+ unsigned long gp_wake_time; /* Last GP kthread wake. */
+ unsigned long gp_wake_seq; /* ->gp_seq at ^^^. */

/* End of fields guarded by root rcu_node's lock. */



2018-12-12 23:14:45

by He, Bo

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
Please correct me if I have something wrong.

-----Original Message-----
From: Paul E. McKenney <[email protected]>
Sent: Thursday, December 13, 2018 5:03 AM
To: He, Bo <[email protected]>
Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>
Subject: Re: rcu_preempt caused oom

On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
>
> Strange.
>
> Are the systems responsive to sysrq keys once failure occurs? If so,
> I will provide you a sysrq-R or some such to dump out the RCU state.

Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below. Only lightly tested.

Thanx, Paul

------------------------------------------------------------------------

commit adfc7dff659495a3433d5084256be59eee0ac6df
Author: Paul E. McKenney <[email protected]>
Date: Mon Dec 10 16:33:59 2018 -0800

rcu: Improve diagnostics for failed RCU grace-period start

Backported from v4.21/v5.0

If a grace period fails to start (for example, because you commented
out the last two lines of rcu_accelerate_cbs_unlocked()), rcu_core()
will invoke rcu_check_gp_start_stall(), which will notice and complain.
However, this complaint is lacking crucial debugging information such
as when the last wakeup executed and what the value of ->gp_seq was at
that time. This commit therefore removes the current pr_alert() from
rcu_check_gp_start_stall(), instead invoking show_rcu_gp_kthreads(),
which has been updated to print the needed information, which is collected
by rcu_gp_kthread_wake().

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0b760c1369f7..4bcd8753e293 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -626,25 +626,57 @@ void rcu_sched_force_quiescent_state(void)
}
EXPORT_SYMBOL_GPL(rcu_sched_force_quiescent_state);

+/*
+ * Convert a ->gp_state value to a character string.
+ */
+static const char *gp_state_getname(short gs) {
+ if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
+ return "???";
+ return gp_state_names[gs];
+}
+
+/*
+ * Return the root node of the specified rcu_state structure.
+ */
+static struct rcu_node *rcu_get_root(struct rcu_state *rsp) {
+ return &rsp->node[0];
+}
+
/*
* Show the state of the grace-period kthreads.
*/
void show_rcu_gp_kthreads(void)
{
int cpu;
+ unsigned long j;
+ unsigned long ja;
+ unsigned long jr;
+ unsigned long jw;
struct rcu_data *rdp;
struct rcu_node *rnp;
struct rcu_state *rsp;

+ j = jiffies;
for_each_rcu_flavor(rsp) {
- pr_info("%s: wait state: %d ->state: %#lx\n",
- rsp->name, rsp->gp_state, rsp->gp_kthread->state);
+ ja = j - READ_ONCE(rsp->gp_activity);
+ jr = j - READ_ONCE(rsp->gp_req_activity);
+ jw = j - READ_ONCE(rsp->gp_wake_time);
+ pr_info("%s: wait state: %s(%d) ->state: %#lx delta ->gp_activity %lu ->gp_req_activity %lu ->gp_wake_time %lu ->gp_wake_seq %ld ->gp_seq %ld ->gp_seq_needed %ld ->gp_flags %#x\n",
+ rsp->name, gp_state_getname(rsp->gp_state),
+ rsp->gp_state,
+ rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL,
+ ja, jr, jw, (long)READ_ONCE(rsp->gp_wake_seq),
+ (long)READ_ONCE(rsp->gp_seq),
+ (long)READ_ONCE(rcu_get_root(rsp)->gp_seq_needed),
+ READ_ONCE(rsp->gp_flags));
rcu_for_each_node_breadth_first(rsp, rnp) {
if (ULONG_CMP_GE(rsp->gp_seq, rnp->gp_seq_needed))
continue;
- pr_info("\trcu_node %d:%d ->gp_seq %lu ->gp_seq_needed %lu\n",
- rnp->grplo, rnp->grphi, rnp->gp_seq,
- rnp->gp_seq_needed);
+ pr_info("\trcu_node %d:%d ->gp_seq %ld ->gp_seq_needed %ld\n",
+ rnp->grplo, rnp->grphi, (long)rnp->gp_seq,
+ (long)rnp->gp_seq_needed);
if (!rcu_is_leaf_node(rnp))
continue;
for_each_leaf_node_possible_cpu(rnp, cpu) { @@ -653,8 +685,8 @@ void show_rcu_gp_kthreads(void)
ULONG_CMP_GE(rsp->gp_seq,
rdp->gp_seq_needed))
continue;
- pr_info("\tcpu %d ->gp_seq_needed %lu\n",
- cpu, rdp->gp_seq_needed);
+ pr_info("\tcpu %d ->gp_seq_needed %ld\n",
+ cpu, (long)rdp->gp_seq_needed);
}
}
/* sched_show_task(rsp->gp_kthread); */ @@ -690,14 +722,6 @@ void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags, } EXPORT_SYMBOL_GPL(rcutorture_get_gp_data);

-/*
- * Return the root node of the specified rcu_state structure.
- */
-static struct rcu_node *rcu_get_root(struct rcu_state *rsp) -{
- return &rsp->node[0];
-}
-
/*
* Enter an RCU extended quiescent state, which can be either the
* idle loop or adaptive-tickless usermode execution.
@@ -1285,16 +1309,6 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
rsp->n_force_qs_gpstart = READ_ONCE(rsp->n_force_qs); }

-/*
- * Convert a ->gp_state value to a character string.
- */
-static const char *gp_state_getname(short gs) -{
- if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names))
- return "???";
- return gp_state_names[gs];
-}
-
/*
* Complain about starvation of grace-period kthread.
*/
@@ -1693,7 +1707,8 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
* Don't do a self-awaken, and don't bother awakening when there is
* nothing for the grace-period kthread to do (as in several CPUs
* raced to awaken, and we lost), and finally don't try to awaken
- * a kthread that has not yet been created.
+ * a kthread that has not yet been created. If all those checks are
+ * passed, track some debug information and awaken.
*/
static void rcu_gp_kthread_wake(struct rcu_state *rsp) { @@ -1701,6 +1716,8 @@ static void rcu_gp_kthread_wake(struct rcu_state *rsp)
!READ_ONCE(rsp->gp_flags) ||
!rsp->gp_kthread)
return;
+ WRITE_ONCE(rsp->gp_wake_time, jiffies);
+ WRITE_ONCE(rsp->gp_wake_seq, READ_ONCE(rsp->gp_seq));
swake_up_one(&rsp->gp_wq);
}

@@ -2802,16 +2819,11 @@ rcu_check_gp_start_stall(struct rcu_state *rsp, struct rcu_node *rnp,
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return;
}
- pr_alert("%s: g%ld->%ld gar:%lu ga:%lu f%#x gs:%d %s->state:%#lx\n",
- __func__, (long)READ_ONCE(rsp->gp_seq),
- (long)READ_ONCE(rnp_root->gp_seq_needed),
- j - rsp->gp_req_activity, j - rsp->gp_activity,
- rsp->gp_flags, rsp->gp_state, rsp->name,
- rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL);
WARN_ON(1);
if (rnp_root != rnp)
raw_spin_unlock_rcu_node(rnp_root);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+ show_rcu_gp_kthreads();
}

/*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 4e74df768c57..0e051d9b5f1a 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -327,6 +327,8 @@ struct rcu_state {
struct swait_queue_head gp_wq; /* Where GP task waits. */
short gp_flags; /* Commands for GP task. */
short gp_state; /* GP kthread sleep state. */
+ unsigned long gp_wake_time; /* Last GP kthread wake. */
+ unsigned long gp_wake_seq; /* ->gp_seq at ^^^. */

/* End of fields guarded by root rcu_node's lock. */



2018-12-13 00:34:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> Please correct me if I have something wrong.

That would be because I sent you the wrong patch, apologies! :-/

Please instead see the one below, which does add sysrq_rcu.

Thanx, Paul

> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Thursday, December 13, 2018 5:03 AM
> To: He, Bo <[email protected]>
> Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> >
> > Strange.
> >
> > Are the systems responsive to sysrq keys once failure occurs? If so,
> > I will provide you a sysrq-R or some such to dump out the RCU state.
>
> Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below. Only lightly tested.

------------------------------------------------------------------------

commit 04b6245c8458e8725f4169e62912c1fadfdf8141
Author: Paul E. McKenney <[email protected]>
Date: Wed Dec 12 16:10:09 2018 -0800

rcu: Add sysrq rcu_node-dump capability

Backported from v4.21/v5.0

Life is hard if RCU manages to get stuck without triggering RCU CPU
stall warnings or triggering the rcu_check_gp_start_stall() checks
for failing to start a grace period. This commit therefore adds a
boot-time-selectable sysrq key (commandeering "y") that allows manually
dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter
must be set for this sysrq to be available.

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b760c1369f7..e9392a9d6291 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -61,6 +61,7 @@
#include <linux/trace_events.h>
#include <linux/suspend.h>
#include <linux/ftrace.h>
+#include <linux/sysrq.h>

#include "tree.h"
#include "rcu.h"
@@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
/* panic() on RCU Stall sysctl. */
int sysctl_panic_on_rcu_stall __read_mostly;
+/* Commandeer a sysrq key to dump RCU's tree. */
+static bool sysrq_rcu;
+module_param(sysrq_rcu, bool, 0444);

/*
* The rcu_scheduler_active variable is initialized to the value
@@ -662,6 +666,27 @@ void show_rcu_gp_kthreads(void)
}
EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);

+/* Dump grace-period-request information due to commandeered sysrq. */
+static void sysrq_show_rcu(int key)
+{
+ show_rcu_gp_kthreads();
+}
+
+static struct sysrq_key_op sysrq_rcudump_op = {
+ .handler = sysrq_show_rcu,
+ .help_msg = "show-rcu(y)",
+ .action_msg = "Show RCU tree",
+ .enable_mask = SYSRQ_ENABLE_DUMP,
+};
+
+static int __init rcu_sysrq_init(void)
+{
+ if (sysrq_rcu)
+ return register_sysrq_key('y', &sysrq_rcudump_op);
+ return 0;
+}
+early_initcall(rcu_sysrq_init);
+
/*
* Send along grace-period-related data for rcutorture diagnostics.
*/


2018-12-13 02:17:55

by Zhang, Jun

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

Hello, Paul

I think the next patch is better.
Because ULONG_CMP_GE could cause double write, which has risk that write back old value.
Please help review.
I don't test it. If you agree, we will test it.
Thanks!


diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b760c1..c00f34e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
zero_cpu_stall_ticks(rdp);
}
rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */
- if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
+ if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed) || rdp->gpwrap)
rdp->gp_seq_needed = rnp->gp_seq_needed;
WRITE_ONCE(rdp->gpwrap, false);
rcu_gpnum_ovf(rnp, rdp);


-----Original Message-----
From: Paul E. McKenney [mailto:[email protected]]
Sent: Thursday, December 13, 2018 08:12
To: He, Bo <[email protected]>
Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>
Subject: Re: rcu_preempt caused oom

On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> Please correct me if I have something wrong.

That would be because I sent you the wrong patch, apologies! :-/

Please instead see the one below, which does add sysrq_rcu.

Thanx, Paul

> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Thursday, December 13, 2018 5:03 AM
> To: He, Bo <[email protected]>
> Cc: Steven Rostedt <[email protected]>;
> [email protected]; [email protected];
> [email protected]; [email protected]; Zhang, Jun
> <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> <[email protected]>; Bai, Jie A <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> >
> > Strange.
> >
> > Are the systems responsive to sysrq keys once failure occurs? If
> > so, I will provide you a sysrq-R or some such to dump out the RCU state.
>
> Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below. Only lightly tested.

------------------------------------------------------------------------

commit 04b6245c8458e8725f4169e62912c1fadfdf8141
Author: Paul E. McKenney <[email protected]>
Date: Wed Dec 12 16:10:09 2018 -0800

rcu: Add sysrq rcu_node-dump capability

Backported from v4.21/v5.0

Life is hard if RCU manages to get stuck without triggering RCU CPU
stall warnings or triggering the rcu_check_gp_start_stall() checks
for failing to start a grace period. This commit therefore adds a
boot-time-selectable sysrq key (commandeering "y") that allows manually
dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter
must be set for this sysrq to be available.

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0b760c1369f7..e9392a9d6291 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -61,6 +61,7 @@
#include <linux/trace_events.h>
#include <linux/suspend.h>
#include <linux/ftrace.h>
+#include <linux/sysrq.h>

#include "tree.h"
#include "rcu.h"
@@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
/* panic() on RCU Stall sysctl. */
int sysctl_panic_on_rcu_stall __read_mostly;
+/* Commandeer a sysrq key to dump RCU's tree. */ static bool sysrq_rcu;
+module_param(sysrq_rcu, bool, 0444);

/*
* The rcu_scheduler_active variable is initialized to the value @@ -662,6 +666,27 @@ void show_rcu_gp_kthreads(void) } EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);

+/* Dump grace-period-request information due to commandeered sysrq. */
+static void sysrq_show_rcu(int key) {
+ show_rcu_gp_kthreads();
+}
+
+static struct sysrq_key_op sysrq_rcudump_op = {
+ .handler = sysrq_show_rcu,
+ .help_msg = "show-rcu(y)",
+ .action_msg = "Show RCU tree",
+ .enable_mask = SYSRQ_ENABLE_DUMP,
+};
+
+static int __init rcu_sysrq_init(void)
+{
+ if (sysrq_rcu)
+ return register_sysrq_key('y', &sysrq_rcudump_op);
+ return 0;
+}
+early_initcall(rcu_sysrq_init);
+
/*
* Send along grace-period-related data for rcutorture diagnostics.
*/


2018-12-13 02:44:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Thu, Dec 13, 2018 at 02:11:35AM +0000, Zhang, Jun wrote:
> Hello, Paul
>
> I think the next patch is better.
> Because ULONG_CMP_GE could cause double write, which has risk that write back old value.
> Please help review.
> I don't test it. If you agree, we will test it.

Just to make sure that I understand, you are worried about something like
the following, correct?

o __note_gp_changes() compares rnp->gp_seq_needed and rdp->gp_seq_needed
and finds them equal.

o At just this time something like rcu_start_this_gp() assigns a new
(larger) value to rdp->gp_seq_needed.

o Then __note_gp_changes() overwrites rdp->gp_seq_needed with the
old value.

This cannot happen because __note_gp_changes() runs with interrupts
disabled on the CPU corresponding to the rcu_data structure referenced
by the rdp pointer. So there is no way for rcu_start_this_gp() to be
invoked on the same CPU during this "if" statement.

Of course, there could be bugs. For example:

o __note_gp_changes() might be called on a different CPU than that
corresponding to rdp. You can check this with something like:

WARN_ON_ONCE(rdp->cpu != smp_processor_id());

o The same things could happen with rcu_start_this_gp(), and the
above WARN_ON_ONCE() would work there as well.

o rcutree_prepare_cpu() is a special case, but is irrelevant unless
you are doing CPU-hotplug operations. (It can run on a CPU other
than rdp->cpu, but only at times when rdp->cpu is offline.)

o Interrupts might not really be disabled.

That said, your patch could reduce overhead slightly, given that the
two values will be equal much of the time. So it might be worth testing
just for that reason.

So why not just test it anyway? If it makes the bug go away, I will be
surprised, but it would not be the first surprise for me. ;-)

Thanx, Paul

> Thanks!
>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 0b760c1..c00f34e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
> zero_cpu_stall_ticks(rdp);
> }
> rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */
> - if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
> + if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed) || rdp->gpwrap)
> rdp->gp_seq_needed = rnp->gp_seq_needed;
> WRITE_ONCE(rdp->gpwrap, false);
> rcu_gpnum_ovf(rnp, rdp);
>
>
> -----Original Message-----
> From: Paul E. McKenney [mailto:[email protected]]
> Sent: Thursday, December 13, 2018 08:12
> To: He, Bo <[email protected]>
> Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> > I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> > Please correct me if I have something wrong.
>
> That would be because I sent you the wrong patch, apologies! :-/
>
> Please instead see the one below, which does add sysrq_rcu.
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Thursday, December 13, 2018 5:03 AM
> > To: He, Bo <[email protected]>
> > Cc: Steven Rostedt <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Zhang, Jun
> > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > <[email protected]>; Bai, Jie A <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> > >
> > > Strange.
> > >
> > > Are the systems responsive to sysrq keys once failure occurs? If
> > > so, I will provide you a sysrq-R or some such to dump out the RCU state.
> >
> > Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below. Only lightly tested.
>
> ------------------------------------------------------------------------
>
> commit 04b6245c8458e8725f4169e62912c1fadfdf8141
> Author: Paul E. McKenney <[email protected]>
> Date: Wed Dec 12 16:10:09 2018 -0800
>
> rcu: Add sysrq rcu_node-dump capability
>
> Backported from v4.21/v5.0
>
> Life is hard if RCU manages to get stuck without triggering RCU CPU
> stall warnings or triggering the rcu_check_gp_start_stall() checks
> for failing to start a grace period. This commit therefore adds a
> boot-time-selectable sysrq key (commandeering "y") that allows manually
> dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter
> must be set for this sysrq to be available.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0b760c1369f7..e9392a9d6291 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -61,6 +61,7 @@
> #include <linux/trace_events.h>
> #include <linux/suspend.h>
> #include <linux/ftrace.h>
> +#include <linux/sysrq.h>
>
> #include "tree.h"
> #include "rcu.h"
> @@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> /* panic() on RCU Stall sysctl. */
> int sysctl_panic_on_rcu_stall __read_mostly;
> +/* Commandeer a sysrq key to dump RCU's tree. */ static bool sysrq_rcu;
> +module_param(sysrq_rcu, bool, 0444);
>
> /*
> * The rcu_scheduler_active variable is initialized to the value @@ -662,6 +666,27 @@ void show_rcu_gp_kthreads(void) } EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
>
> +/* Dump grace-period-request information due to commandeered sysrq. */
> +static void sysrq_show_rcu(int key) {
> + show_rcu_gp_kthreads();
> +}
> +
> +static struct sysrq_key_op sysrq_rcudump_op = {
> + .handler = sysrq_show_rcu,
> + .help_msg = "show-rcu(y)",
> + .action_msg = "Show RCU tree",
> + .enable_mask = SYSRQ_ENABLE_DUMP,
> +};
> +
> +static int __init rcu_sysrq_init(void)
> +{
> + if (sysrq_rcu)
> + return register_sysrq_key('y', &sysrq_rcudump_op);
> + return 0;
> +}
> +early_initcall(rcu_sysrq_init);
> +
> /*
> * Send along grace-period-related data for rcutorture diagnostics.
> */
>


2018-12-13 04:42:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Thu, Dec 13, 2018 at 03:28:46AM +0000, Zhang, Jun wrote:
> Ok, we will test it, thanks!

But please also try the sysrq-y with the earlier patch after a hang!

Thanx, Paul

> -----Original Message-----
> From: Paul E. McKenney [mailto:[email protected]]
> Sent: Thursday, December 13, 2018 10:43
> To: Zhang, Jun <[email protected]>
> Cc: He, Bo <[email protected]>; Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Thu, Dec 13, 2018 at 02:11:35AM +0000, Zhang, Jun wrote:
> > Hello, Paul
> >
> > I think the next patch is better.
> > Because ULONG_CMP_GE could cause double write, which has risk that write back old value.
> > Please help review.
> > I don't test it. If you agree, we will test it.
>
> Just to make sure that I understand, you are worried about something like the following, correct?
>
> o __note_gp_changes() compares rnp->gp_seq_needed and rdp->gp_seq_needed
> and finds them equal.
>
> o At just this time something like rcu_start_this_gp() assigns a new
> (larger) value to rdp->gp_seq_needed.
>
> o Then __note_gp_changes() overwrites rdp->gp_seq_needed with the
> old value.
>
> This cannot happen because __note_gp_changes() runs with interrupts disabled on the CPU corresponding to the rcu_data structure referenced by the rdp pointer. So there is no way for rcu_start_this_gp() to be invoked on the same CPU during this "if" statement.
>
> Of course, there could be bugs. For example:
>
> o __note_gp_changes() might be called on a different CPU than that
> corresponding to rdp. You can check this with something like:
>
> WARN_ON_ONCE(rdp->cpu != smp_processor_id());
>
> o The same things could happen with rcu_start_this_gp(), and the
> above WARN_ON_ONCE() would work there as well.
>
> o rcutree_prepare_cpu() is a special case, but is irrelevant unless
> you are doing CPU-hotplug operations. (It can run on a CPU other
> than rdp->cpu, but only at times when rdp->cpu is offline.)
>
> o Interrupts might not really be disabled.
>
> That said, your patch could reduce overhead slightly, given that the two values will be equal much of the time. So it might be worth testing just for that reason.
>
> So why not just test it anyway? If it makes the bug go away, I will be surprised, but it would not be the first surprise for me. ;-)
>
> Thanx, Paul
>
> > Thanks!
> >
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > 0b760c1..c00f34e 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
> > zero_cpu_stall_ticks(rdp);
> > }
> > rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */
> > - if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
> > + if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed) ||
> > + rdp->gpwrap)
> > rdp->gp_seq_needed = rnp->gp_seq_needed;
> > WRITE_ONCE(rdp->gpwrap, false);
> > rcu_gpnum_ovf(rnp, rdp);
> >
> >
> > -----Original Message-----
> > From: Paul E. McKenney [mailto:[email protected]]
> > Sent: Thursday, December 13, 2018 08:12
> > To: He, Bo <[email protected]>
> > Cc: Steven Rostedt <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Zhang, Jun
> > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J
> > <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> > > I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> > > Please correct me if I have something wrong.
> >
> > That would be because I sent you the wrong patch, apologies! :-/
> >
> > Please instead see the one below, which does add sysrq_rcu.
> >
> > Thanx, Paul
> >
> > > -----Original Message-----
> > > From: Paul E. McKenney <[email protected]>
> > > Sent: Thursday, December 13, 2018 5:03 AM
> > > To: He, Bo <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>;
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Zhang, Jun
> > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > > <[email protected]>; Bai, Jie A <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > > > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> > > >
> > > > Strange.
> > > >
> > > > Are the systems responsive to sysrq keys once failure occurs? If
> > > > so, I will provide you a sysrq-R or some such to dump out the RCU state.
> > >
> > > Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below. Only lightly tested.
> >
> > ----------------------------------------------------------------------
> > --
> >
> > commit 04b6245c8458e8725f4169e62912c1fadfdf8141
> > Author: Paul E. McKenney <[email protected]>
> > Date: Wed Dec 12 16:10:09 2018 -0800
> >
> > rcu: Add sysrq rcu_node-dump capability
> >
> > Backported from v4.21/v5.0
> >
> > Life is hard if RCU manages to get stuck without triggering RCU CPU
> > stall warnings or triggering the rcu_check_gp_start_stall() checks
> > for failing to start a grace period. This commit therefore adds a
> > boot-time-selectable sysrq key (commandeering "y") that allows manually
> > dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter
> > must be set for this sysrq to be available.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > 0b760c1369f7..e9392a9d6291 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -61,6 +61,7 @@
> > #include <linux/trace_events.h>
> > #include <linux/suspend.h>
> > #include <linux/ftrace.h>
> > +#include <linux/sysrq.h>
> >
> > #include "tree.h"
> > #include "rcu.h"
> > @@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int
> > rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in
> > use. */
> > /* panic() on RCU Stall sysctl. */
> > int sysctl_panic_on_rcu_stall __read_mostly;
> > +/* Commandeer a sysrq key to dump RCU's tree. */ static bool
> > +sysrq_rcu; module_param(sysrq_rcu, bool, 0444);
> >
> > /*
> > * The rcu_scheduler_active variable is initialized to the value @@
> > -662,6 +666,27 @@ void show_rcu_gp_kthreads(void) }
> > EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
> >
> > +/* Dump grace-period-request information due to commandeered sysrq.
> > +*/ static void sysrq_show_rcu(int key) {
> > + show_rcu_gp_kthreads();
> > +}
> > +
> > +static struct sysrq_key_op sysrq_rcudump_op = {
> > + .handler = sysrq_show_rcu,
> > + .help_msg = "show-rcu(y)",
> > + .action_msg = "Show RCU tree",
> > + .enable_mask = SYSRQ_ENABLE_DUMP,
> > +};
> > +
> > +static int __init rcu_sysrq_init(void) {
> > + if (sysrq_rcu)
> > + return register_sysrq_key('y', &sysrq_rcudump_op);
> > + return 0;
> > +}
> > +early_initcall(rcu_sysrq_init);
> > +
> > /*
> > * Send along grace-period-related data for rcutorture diagnostics.
> > */
> >
>


2018-12-13 18:28:22

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Thu, Dec 13, 2018 at 03:26:08PM +0000, He, Bo wrote:
> one of the board reproduce the issue with the show_rcu_gp_kthreads(), I also enclosed the logs as attachment.
>
> [17818.936032] rcu: rcu_preempt: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258 ->gp_wake_seq 21808189 ->gp_seq 21808192 ->gp_seq_needed 21808196 ->gp_flags 0x1

This is quite helpful, thank you!

The "RCU lockdep checking is enabled" says that CONFIG_PROVE_RCU=y,
which is good. The "RCU_GP_WAIT_GPS(1)" means that the rcu_preempt task
is waiting for a new grace-period request. The "->state: 0x402" means
that it is sleeping, neither running nor in the process of waking up.
The "delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time
308258" means that it has been more than 300,000 jiffies since the
rcu_preempt task did anything or was requested to do anything.

The "->gp_wake_seq 21808189 ->gp_seq 21808192" says that the last attempt
to awaken the rcu_preempt task happened during the last grace period.
The "->gp_seq_needed 21808196 ->gp_flags 0x1" nevertheless says that
someone requested a new grace period. So if the rcu_preempt task were
to wake up, it would process the new grace period. Note again also
the ->gp_req_activity 308256, which indicates that ->gp_flags was set
more than 300,000 jiffies ago, just after the last recorded activity
of the rcu_preempt task.

But this is exactly the situation that rcu_check_gp_start_stall() is
designed to warn about (and does warn about for me when I comment
out the wakeup code). So why is rcu_check_gp_start_stall() not being
called? Here are a couple of possibilities:

1. Because rcu_check_gp_start_stall() is only ever invoked from
RCU_SOFTIRQ, it is possible that softirqs are stalled for
whatever reason.

2. Because RCU_SOFTIRQ is invoked primarily from the scheduler-clock
interrupt handler, it is possible that the scheduler tick has
somehow been disabled. Traces from earlier runs showed a great
deal of RCU callbacks queued, which would have caused RCU to
refuse to allow the scheduler tick to be disabled, even if the
corresponding CPU was idle.

3. You have CONFIG_FAST_NO_HZ=y (which you probably do, given
that you are building for a battery-powered device) and all of the
CPU's callbacks are lazy. Except that your earlier traces showed
lots of non-lazy callbacks. Besides, even if all callbacks were
lazy, there would still be a scheduling-clock interrupt every
six seconds, and there are quite a few six-second intervals
in a two-minute watchdog timeout.

But if we cannot find the problem quickly, I will likely ask
you to try reproducing with CONFIG_FAST_NO_HZ=n. This could
be thought of as bisecting the RCU code looking for the bug.

The first two of these seem unlikely given that the watchdog timer was
still firing. Still, I don't see how 300,000 jiffies elapsed with a grace
period requested and not started otherwise. Could you please check?
One way to do so would be to enable ftrace on rcu_check_callbacks(),
__rcu_process_callbacks(), and rcu_check_gp_start_stall(). It might
be necessary to no-inline rcu_check_gp_start_stall(). You might have
better ways to collect this information.

Without this information, the only workaround patch I can give you will
degrade battery lifetime, which might not be what you want.

You do have a lockdep complaint early at boot. Although I don't
immediately see how this self-deadlock would affect RCU, please do get
it fixed. Sometimes the consequences of this sort of deadlock can
propagate to unexepected places.

Regardless of why rcu_check_gp_start_stall() failed to complain, it looks
like this was set after the rcu_preempt task slept for the last time,
and so there should have been a wakeup the last time that ->gp_flags
was set. Perhaps there is some code path that drops the wakeup.
I did check this in current -rcu, but you are instead running v4.19,
so I should also check there.

The ->gp_flags has its RCU_GP_FLAG_INIT bit set in rcu_start_this_gp()
and in rcu_gp_cleanup(). We can eliminate rcu_gp_cleanup() from
consideration because only the rcu_preempt task will execute that code,
and we know that this task was asleep at the last time this bit was set.
Now rcu_start_this_gp() returns a flag indicating whether or not a wakeup
is needed, and the caller must do the wakeup once it is safe to do so,
that is, after the various rcu_node locks have been released (doing a
wakeup while holding any of those locks results in deadlock).

The following functions invoke rcu_start_this_gp: rcu_accelerate_cbs()
and rcu_nocb_wait_gp(). We can eliminate rcu_nocb_wait_gp() because you
are building with CONFIG_RCU_NOCB_CPU=n. Then rcu_accelerate_cbs()
is invoked from:

o rcu_accelerate_cbs_unlocked(), which does the following, thus
properly awakening the rcu_preempt task when needed:

needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
if (needwake)
rcu_gp_kthread_wake(rsp);

o rcu_advance_cbs(), which returns the value returned by
rcu_accelerate_cbs(), thus pushing the problem off to its
callers, which are called out below.

o __note_gp_changes(), which also returns the value returned by
rcu_accelerate_cbs(), thus pushing the problem off to its callers,
which are called out below.

o rcu_gp_cleanup(), which is only ever invoked by RCU grace-period
kthreads such as the rcu_preempt task. Therefore, this function
never needs to awaken the rcu_preempt task, because the fact
that this function is executing means that this task is already
awake. (Also, as noted above, we can eliminate this code from
consideration because this task is known to have been sleeping
at the last time that the RCU_GP_FLAG_INIT bit was set.)

o rcu_report_qs_rdp(), which does the following, thus properly
awakening the rcu_preempt task when needed:

needwake = rcu_accelerate_cbs(rsp, rnp, rdp);

rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
/* ^^^ Released rnp->lock */
if (needwake)
rcu_gp_kthread_wake(rsp);

o rcu_prepare_for_idle(), which does the following, thus properly
awakening the rcu_preempt task when needed:

needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
if (needwake)
rcu_gp_kthread_wake(rsp);

Now for rcu_advance_cbs():

o __note_gp_changes(), which which also returns the value returned
by rcu_advance_cbs(), thus pushing the problem off to its callers,
which are called out below.

o rcu_migrate_callbacks(), which does the following, thus properly
awakening the rcu_preempt task when needed:

needwake = rcu_advance_cbs(rsp, rnp_root, rdp) ||
rcu_advance_cbs(rsp, rnp_root, my_rdp);
rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist);
WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) !=
!rcu_segcblist_n_cbs(&my_rdp->cblist));
raw_spin_unlock_irqrestore_rcu_node(rnp_root, flags);
if (needwake)
rcu_gp_kthread_wake(rsp);

Now for __note_gp_changes():

o note_gp_changes(), which does the following, thus properly
awakening the rcu_preempt task when needed:

needwake = __note_gp_changes(rsp, rnp, rdp);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
if (needwake)
rcu_gp_kthread_wake(rsp);

o rcu_gp_init() which is only ever invoked by RCU grace-period
kthreads such as the rcu_preempt task, which makes wakeups
unnecessary, just as for rcu_gp_cleanup() above.

o rcu_gp_cleanup(), ditto.

So I am not seeing how I am losing a wakeup, but please do feel free to
double-check my analysis. One way to do that is using event tracing.

Thanx, Paul

------------------------------------------------------------------------
lockdep complaint:
------------------------------------------------------------------------

[ 2.895507] ======================================================
[ 2.895511] WARNING: possible circular locking dependency detected
[ 2.895517] 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1 Tainted: G U
[ 2.895521] ------------------------------------------------------
[ 2.895525] earlyEvs/1839 is trying to acquire lock:
[ 2.895530] 00000000ff344115 (&asd->mutex){+.+.}, at: ipu_isys_subdev_get_ffmt+0x32/0x90
[ 2.895546]
[ 2.895546] but task is already holding lock:
[ 2.895550] 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
[ 2.895561]
[ 2.895561] which lock already depends on the new lock.
[ 2.895561]
[ 2.895566]
[ 2.895566] the existing dependency chain (in reverse order) is:
[ 2.895570]
[ 2.895570] -> #1 (&mdev->graph_mutex){+.+.}:
[ 2.895583] __mutex_lock+0x80/0x9a0
[ 2.895588] mutex_lock_nested+0x1b/0x20
[ 2.895593] media_device_register_entity+0x92/0x1e0
[ 2.895598] v4l2_device_register_subdev+0xc2/0x1b0
[ 2.895604] ipu_isys_csi2_init+0x22c/0x520
[ 2.895608] isys_probe+0x6cb/0xed0
[ 2.895613] ipu_bus_probe+0xfd/0x2e0
[ 2.895620] really_probe+0x268/0x3d0
[ 2.895625] driver_probe_device+0x11a/0x130
[ 2.895630] __device_attach_driver+0x86/0x100
[ 2.895635] bus_for_each_drv+0x6e/0xb0
[ 2.895640] __device_attach+0xdf/0x160
[ 2.895645] device_initial_probe+0x13/0x20
[ 2.895650] bus_probe_device+0xa6/0xc0
[ 2.895655] deferred_probe_work_func+0x88/0xe0
[ 2.895661] process_one_work+0x220/0x5c0
[ 2.895665] worker_thread+0x1da/0x3b0
[ 2.895670] kthread+0x12c/0x150
[ 2.895675] ret_from_fork+0x3a/0x50
[ 2.895678]
[ 2.895678] -> #0 (&asd->mutex){+.+.}:
[ 2.895688] lock_acquire+0x95/0x1a0
[ 2.895693] __mutex_lock+0x80/0x9a0
[ 2.895698] mutex_lock_nested+0x1b/0x20
[ 2.895703] ipu_isys_subdev_get_ffmt+0x32/0x90
[ 2.895708] ipu_isys_csi2_get_fmt+0x14/0x30
[ 2.895713] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
[ 2.895718] v4l2_subdev_link_validate_one+0x67/0x120
[ 2.895723] v4l2_subdev_link_validate+0x246/0x490
[ 2.895728] csi2_link_validate+0xc6/0x220
[ 2.895733] __media_pipeline_start+0x15b/0x2f0
[ 2.895738] media_pipeline_start+0x33/0x50
[ 2.895743] ipu_isys_video_prepare_streaming+0x1e0/0x610
[ 2.895748] start_streaming+0x186/0x3a0
[ 2.895753] vb2_start_streaming+0x6d/0x130
[ 2.895758] vb2_core_streamon+0x108/0x140
[ 2.895762] vb2_streamon+0x29/0x50
[ 2.895767] vb2_ioctl_streamon+0x42/0x50
[ 2.895772] v4l_streamon+0x20/0x30
[ 2.895776] __video_do_ioctl+0x1af/0x3c0
[ 2.895781] video_usercopy+0x27e/0x7e0
[ 2.895785] video_ioctl2+0x15/0x20
[ 2.895789] v4l2_ioctl+0x49/0x50
[ 2.895794] do_video_ioctl+0x93c/0x2360
[ 2.895799] v4l2_compat_ioctl32+0x93/0xe0
[ 2.895806] __ia32_compat_sys_ioctl+0x73a/0x1c90
[ 2.895813] do_fast_syscall_32+0x9a/0x2d6
[ 2.895818] entry_SYSENTER_compat+0x6d/0x7c
[ 2.895821]
[ 2.895821] other info that might help us debug this:
[ 2.895821]
[ 2.895826] Possible unsafe locking scenario:
[ 2.895826]
[ 2.895830] CPU0 CPU1
[ 2.895833] ---- ----
[ 2.895836] lock(&mdev->graph_mutex);
[ 2.895842] lock(&asd->mutex);
[ 2.895847] lock(&mdev->graph_mutex);
[ 2.895852] lock(&asd->mutex);
[ 2.895857]
[ 2.895857] *** DEADLOCK ***
[ 2.895857]
[ 2.895863] 3 locks held by earlyEvs/1839:
[ 2.895866] #0: 00000000ed860090 (&av->mutex){+.+.}, at: __video_do_ioctl+0xbf/0x3c0
[ 2.895876] #1: 000000000cb253e7 (&isys->stream_mutex){+.+.}, at: start_streaming+0x5c/0x3a0
[ 2.895886] #2: 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
[ 2.895896]
[ 2.895896] stack backtrace:
[ 2.895903] CPU: 0 PID: 1839 Comm: earlyEvs Tainted: G U 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1
[ 2.895907] Call Trace:
[ 2.895915] dump_stack+0x70/0xa5
[ 2.895921] print_circular_bug.isra.35+0x1d8/0x1e6
[ 2.895927] __lock_acquire+0x1284/0x1340
[ 2.895931] ? __lock_acquire+0x2b5/0x1340
[ 2.895940] lock_acquire+0x95/0x1a0
[ 2.895945] ? lock_acquire+0x95/0x1a0
[ 2.895950] ? ipu_isys_subdev_get_ffmt+0x32/0x90
[ 2.895956] ? ipu_isys_subdev_get_ffmt+0x32/0x90
[ 2.895961] __mutex_lock+0x80/0x9a0
[ 2.895966] ? ipu_isys_subdev_get_ffmt+0x32/0x90
[ 2.895971] ? crlmodule_get_format+0x43/0x50
[ 2.895979] mutex_lock_nested+0x1b/0x20
[ 2.895984] ? mutex_lock_nested+0x1b/0x20
[ 2.895989] ipu_isys_subdev_get_ffmt+0x32/0x90
[ 2.895995] ipu_isys_csi2_get_fmt+0x14/0x30
[ 2.896001] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
[ 2.896006] v4l2_subdev_link_validate_one+0x67/0x120
[ 2.896011] ? crlmodule_get_format+0x2a/0x50
[ 2.896018] ? find_held_lock+0x35/0xa0
[ 2.896023] ? crlmodule_get_format+0x43/0x50
[ 2.896030] v4l2_subdev_link_validate+0x246/0x490
[ 2.896035] ? __mutex_unlock_slowpath+0x58/0x2f0
[ 2.896042] ? mutex_unlock+0x12/0x20
[ 2.896046] ? crlmodule_get_format+0x43/0x50
[ 2.896052] ? v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
[ 2.896057] ? v4l2_subdev_link_validate_one+0x67/0x120
[ 2.896065] ? __is_insn_slot_addr+0xad/0x120
[ 2.896070] ? kernel_text_address+0xc4/0x100
[ 2.896078] ? v4l2_subdev_link_validate+0x246/0x490
[ 2.896085] ? kernel_text_address+0xc4/0x100
[ 2.896092] ? __lock_acquire+0x1106/0x1340
[ 2.896096] ? __lock_acquire+0x1169/0x1340
[ 2.896103] csi2_link_validate+0xc6/0x220
[ 2.896110] ? __lock_is_held+0x5a/0xa0
[ 2.896115] ? mark_held_locks+0x58/0x80
[ 2.896122] ? __kmalloc+0x207/0x2e0
[ 2.896127] ? __lock_is_held+0x5a/0xa0
[ 2.896134] ? rcu_read_lock_sched_held+0x81/0x90
[ 2.896139] ? __kmalloc+0x2a3/0x2e0
[ 2.896144] ? media_pipeline_start+0x28/0x50
[ 2.896150] ? __media_entity_enum_init+0x33/0x70
[ 2.896155] ? csi2_has_route+0x18/0x20
[ 2.896160] ? media_graph_walk_next.part.9+0xac/0x290
[ 2.896166] __media_pipeline_start+0x15b/0x2f0
[ 2.896173] ? rcu_read_lock_sched_held+0x81/0x90
[ 2.896179] media_pipeline_start+0x33/0x50
[ 2.896186] ipu_isys_video_prepare_streaming+0x1e0/0x610
[ 2.896191] ? __lock_acquire+0x132e/0x1340
[ 2.896198] ? __lock_acquire+0x2b5/0x1340
[ 2.896204] ? lock_acquire+0x95/0x1a0
[ 2.896209] ? start_streaming+0x5c/0x3a0
[ 2.896215] ? start_streaming+0x5c/0x3a0
[ 2.896221] ? __mutex_lock+0x391/0x9a0
[ 2.896226] ? v4l_enable_media_source+0x2d/0x70
[ 2.896233] ? find_held_lock+0x35/0xa0
[ 2.896238] ? v4l_enable_media_source+0x57/0x70
[ 2.896245] start_streaming+0x186/0x3a0
[ 2.896250] ? __mutex_unlock_slowpath+0x58/0x2f0
[ 2.896257] vb2_start_streaming+0x6d/0x130
[ 2.896262] ? vb2_start_streaming+0x6d/0x130
[ 2.896267] vb2_core_streamon+0x108/0x140
[ 2.896273] vb2_streamon+0x29/0x50
[ 2.896278] vb2_ioctl_streamon+0x42/0x50
[ 2.896284] v4l_streamon+0x20/0x30
[ 2.896288] __video_do_ioctl+0x1af/0x3c0
[ 2.896296] ? __might_fault+0x85/0x90
[ 2.896302] video_usercopy+0x27e/0x7e0
[ 2.896307] ? copy_overflow+0x20/0x20
[ 2.896313] ? find_held_lock+0x35/0xa0
[ 2.896319] ? __might_fault+0x3e/0x90
[ 2.896325] video_ioctl2+0x15/0x20
[ 2.896330] v4l2_ioctl+0x49/0x50
[ 2.896335] do_video_ioctl+0x93c/0x2360
[ 2.896343] v4l2_compat_ioctl32+0x93/0xe0
[ 2.896349] __ia32_compat_sys_ioctl+0x73a/0x1c90
[ 2.896354] ? lockdep_hardirqs_on+0xef/0x180
[ 2.896359] ? do_fast_syscall_32+0x3b/0x2d6
[ 2.896364] do_fast_syscall_32+0x9a/0x2d6
[ 2.896370] entry_SYSENTER_compat+0x6d/0x7c
[ 2.896377] RIP: 0023:0xf7e79b79
[ 2.896382] Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 0c 24 c3 8b 1c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
[ 2.896387] RSP: 002b:00000000f76816bc EFLAGS: 00000292 ORIG_RAX: 0000000000000036
[ 2.896393] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 0000000040045612
[ 2.896396] RDX: 00000000f768172c RSI: 00000000f7d42d9c RDI: 00000000f768172c
[ 2.896400] RBP: 00000000f7681708 R08: 0000000000000000 R09: 0000000000000000
[ 2.896404] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 2.896408] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

------------------------------------------------------------------------

> [17818.936039] rcu: rcu_node 0:3 ->gp_seq 21808192 ->gp_seq_needed 21808196
> [17818.936048] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 101730 ->gp_req_activity 101732 ->gp_wake_time 101730 ->gp_wake_seq 1357 - >gp_seq 1360 ->gp_seq_needed 1360 ->gp_flags 0x0
> [17818.936056] rcu: rcu_bh: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 4312486108 ->gp_req_activity 4312486108 ->gp_wake_time 4312486108 - >gp_wake_seq 0 ->gp_seq -1200 ->gp_seq_needed -1200 ->gp_flags 0x0
>
> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Thursday, December 13, 2018 12:40 PM
> To: Zhang, Jun <[email protected]>
> Cc: He, Bo <[email protected]>; Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Thu, Dec 13, 2018 at 03:28:46AM +0000, Zhang, Jun wrote:
> > Ok, we will test it, thanks!
>
> But please also try the sysrq-y with the earlier patch after a hang!
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Paul E. McKenney [mailto:[email protected]]
> > Sent: Thursday, December 13, 2018 10:43
> > To: Zhang, Jun <[email protected]>
> > Cc: He, Bo <[email protected]>; Steven Rostedt <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Xiao, Jin
> > <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie
> > A <[email protected]>; Sun, Yi J <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Thu, Dec 13, 2018 at 02:11:35AM +0000, Zhang, Jun wrote:
> > > Hello, Paul
> > >
> > > I think the next patch is better.
> > > Because ULONG_CMP_GE could cause double write, which has risk that write back old value.
> > > Please help review.
> > > I don't test it. If you agree, we will test it.
> >
> > Just to make sure that I understand, you are worried about something like the following, correct?
> >
> > o __note_gp_changes() compares rnp->gp_seq_needed and rdp->gp_seq_needed
> > and finds them equal.
> >
> > o At just this time something like rcu_start_this_gp() assigns a new
> > (larger) value to rdp->gp_seq_needed.
> >
> > o Then __note_gp_changes() overwrites rdp->gp_seq_needed with the
> > old value.
> >
> > This cannot happen because __note_gp_changes() runs with interrupts disabled on the CPU corresponding to the rcu_data structure referenced by the rdp pointer. So there is no way for rcu_start_this_gp() to be invoked on the same CPU during this "if" statement.
> >
> > Of course, there could be bugs. For example:
> >
> > o __note_gp_changes() might be called on a different CPU than that
> > corresponding to rdp. You can check this with something like:
> >
> > WARN_ON_ONCE(rdp->cpu != smp_processor_id());
> >
> > o The same things could happen with rcu_start_this_gp(), and the
> > above WARN_ON_ONCE() would work there as well.
> >
> > o rcutree_prepare_cpu() is a special case, but is irrelevant unless
> > you are doing CPU-hotplug operations. (It can run on a CPU other
> > than rdp->cpu, but only at times when rdp->cpu is offline.)
> >
> > o Interrupts might not really be disabled.
> >
> > That said, your patch could reduce overhead slightly, given that the two values will be equal much of the time. So it might be worth testing just for that reason.
> >
> > So why not just test it anyway? If it makes the bug go away, I will
> > be surprised, but it would not be the first surprise for me. ;-)
> >
> > Thanx, Paul
> >
> > > Thanks!
> > >
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > 0b760c1..c00f34e 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
> > > zero_cpu_stall_ticks(rdp);
> > > }
> > > rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */
> > > - if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
> > > + if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed) ||
> > > + rdp->gpwrap)
> > > rdp->gp_seq_needed = rnp->gp_seq_needed;
> > > WRITE_ONCE(rdp->gpwrap, false);
> > > rcu_gpnum_ovf(rnp, rdp);
> > >
> > >
> > > -----Original Message-----
> > > From: Paul E. McKenney [mailto:[email protected]]
> > > Sent: Thursday, December 13, 2018 08:12
> > > To: He, Bo <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>;
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Zhang, Jun
> > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > > <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi
> > > J <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> > > > I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> > > > Please correct me if I have something wrong.
> > >
> > > That would be because I sent you the wrong patch, apologies! :-/
> > >
> > > Please instead see the one below, which does add sysrq_rcu.
> > >
> > > Thanx, Paul
> > >
> > > > -----Original Message-----
> > > > From: Paul E. McKenney <[email protected]>
> > > > Sent: Thursday, December 13, 2018 5:03 AM
> > > > To: He, Bo <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>;
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected]; Zhang, Jun
> > > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang,
> > > > Yanmin <[email protected]>; Bai, Jie A <[email protected]>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > > > > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > > > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > > > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> > > > >
> > > > > Strange.
> > > > >
> > > > > Are the systems responsive to sysrq keys once failure occurs?
> > > > > If so, I will provide you a sysrq-R or some such to dump out the RCU state.
> > > >
> > > > Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below. Only lightly tested.
> > >
> > > --------------------------------------------------------------------
> > > --
> > > --
> > >
> > > commit 04b6245c8458e8725f4169e62912c1fadfdf8141
> > > Author: Paul E. McKenney <[email protected]>
> > > Date: Wed Dec 12 16:10:09 2018 -0800
> > >
> > > rcu: Add sysrq rcu_node-dump capability
> > >
> > > Backported from v4.21/v5.0
> > >
> > > Life is hard if RCU manages to get stuck without triggering RCU CPU
> > > stall warnings or triggering the rcu_check_gp_start_stall() checks
> > > for failing to start a grace period. This commit therefore adds a
> > > boot-time-selectable sysrq key (commandeering "y") that allows manually
> > > dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter
> > > must be set for this sysrq to be available.
> > >
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > 0b760c1369f7..e9392a9d6291 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -61,6 +61,7 @@
> > > #include <linux/trace_events.h>
> > > #include <linux/suspend.h>
> > > #include <linux/ftrace.h>
> > > +#include <linux/sysrq.h>
> > >
> > > #include "tree.h"
> > > #include "rcu.h"
> > > @@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int
> > > rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in
> > > use. */
> > > /* panic() on RCU Stall sysctl. */
> > > int sysctl_panic_on_rcu_stall __read_mostly;
> > > +/* Commandeer a sysrq key to dump RCU's tree. */ static bool
> > > +sysrq_rcu; module_param(sysrq_rcu, bool, 0444);
> > >
> > > /*
> > > * The rcu_scheduler_active variable is initialized to the value @@
> > > -662,6 +666,27 @@ void show_rcu_gp_kthreads(void) }
> > > EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
> > >
> > > +/* Dump grace-period-request information due to commandeered sysrq.
> > > +*/ static void sysrq_show_rcu(int key) {
> > > + show_rcu_gp_kthreads();
> > > +}
> > > +
> > > +static struct sysrq_key_op sysrq_rcudump_op = {
> > > + .handler = sysrq_show_rcu,
> > > + .help_msg = "show-rcu(y)",
> > > + .action_msg = "Show RCU tree",
> > > + .enable_mask = SYSRQ_ENABLE_DUMP,
> > > +};
> > > +
> > > +static int __init rcu_sysrq_init(void) {
> > > + if (sysrq_rcu)
> > > + return register_sysrq_key('y', &sysrq_rcudump_op);
> > > + return 0;
> > > +}
> > > +early_initcall(rcu_sysrq_init);
> > > +
> > > /*
> > > * Send along grace-period-related data for rcutorture diagnostics.
> > > */
> > >
> >
>



2018-12-14 01:31:36

by He, Bo

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

as you mentioned CONFIG_FAST_NO_HZ, do you mean CONFIG_RCU_FAST_NO_HZ? I double checked there is no FAST_NO_HZ in .config:

Here is the grep from .config:
egrep "HZ|RCU" .config
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
# RCU Subsystem
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
# CONFIG_MACHZ_WDT is not set
# RCU Debugging
CONFIG_PROVE_RCU=y
CONFIG_RCU_PERF_TEST=m
CONFIG_RCU_TORTURE_TEST=m
CONFIG_RCU_CPU_STALL_TIMEOUT=7
CONFIG_RCU_TRACE=y
CONFIG_RCU_EQS_DEBUG=y

-----Original Message-----
From: Paul E. McKenney <[email protected]>
Sent: Friday, December 14, 2018 2:12 AM
To: He, Bo <[email protected]>
Cc: Zhang, Jun <[email protected]>; Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>
Subject: Re: rcu_preempt caused oom

On Thu, Dec 13, 2018 at 03:26:08PM +0000, He, Bo wrote:
> one of the board reproduce the issue with the show_rcu_gp_kthreads(), I also enclosed the logs as attachment.
>
> [17818.936032] rcu: rcu_preempt: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258 ->gp_wake_seq 21808189 ->gp_seq 21808192 ->gp_seq_needed 21808196 ->gp_flags 0x1

This is quite helpful, thank you!

The "RCU lockdep checking is enabled" says that CONFIG_PROVE_RCU=y, which is good. The "RCU_GP_WAIT_GPS(1)" means that the rcu_preempt task is waiting for a new grace-period request. The "->state: 0x402" means that it is sleeping, neither running nor in the process of waking up.
The "delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258" means that it has been more than 300,000 jiffies since the rcu_preempt task did anything or was requested to do anything.

The "->gp_wake_seq 21808189 ->gp_seq 21808192" says that the last attempt to awaken the rcu_preempt task happened during the last grace period.
The "->gp_seq_needed 21808196 ->gp_flags 0x1" nevertheless says that someone requested a new grace period. So if the rcu_preempt task were to wake up, it would process the new grace period. Note again also the ->gp_req_activity 308256, which indicates that ->gp_flags was set more than 300,000 jiffies ago, just after the last recorded activity of the rcu_preempt task.

But this is exactly the situation that rcu_check_gp_start_stall() is designed to warn about (and does warn about for me when I comment out the wakeup code). So why is rcu_check_gp_start_stall() not being called? Here are a couple of possibilities:

1. Because rcu_check_gp_start_stall() is only ever invoked from
RCU_SOFTIRQ, it is possible that softirqs are stalled for
whatever reason.

2. Because RCU_SOFTIRQ is invoked primarily from the scheduler-clock
interrupt handler, it is possible that the scheduler tick has
somehow been disabled. Traces from earlier runs showed a great
deal of RCU callbacks queued, which would have caused RCU to
refuse to allow the scheduler tick to be disabled, even if the
corresponding CPU was idle.

3. You have CONFIG_FAST_NO_HZ=y (which you probably do, given
that you are building for a battery-powered device) and all of the
CPU's callbacks are lazy. Except that your earlier traces showed
lots of non-lazy callbacks. Besides, even if all callbacks were
lazy, there would still be a scheduling-clock interrupt every
six seconds, and there are quite a few six-second intervals
in a two-minute watchdog timeout.

But if we cannot find the problem quickly, I will likely ask
you to try reproducing with CONFIG_FAST_NO_HZ=n. This could
be thought of as bisecting the RCU code looking for the bug.

The first two of these seem unlikely given that the watchdog timer was still firing. Still, I don't see how 300,000 jiffies elapsed with a grace period requested and not started otherwise. Could you please check?
One way to do so would be to enable ftrace on rcu_check_callbacks(), __rcu_process_callbacks(), and rcu_check_gp_start_stall(). It might be necessary to no-inline rcu_check_gp_start_stall(). You might have better ways to collect this information.

Without this information, the only workaround patch I can give you will degrade battery lifetime, which might not be what you want.

You do have a lockdep complaint early at boot. Although I don't immediately see how this self-deadlock would affect RCU, please do get it fixed. Sometimes the consequences of this sort of deadlock can propagate to unexepected places.

Regardless of why rcu_check_gp_start_stall() failed to complain, it looks like this was set after the rcu_preempt task slept for the last time, and so there should have been a wakeup the last time that ->gp_flags was set. Perhaps there is some code path that drops the wakeup.
I did check this in current -rcu, but you are instead running v4.19, so I should also check there.

The ->gp_flags has its RCU_GP_FLAG_INIT bit set in rcu_start_this_gp() and in rcu_gp_cleanup(). We can eliminate rcu_gp_cleanup() from consideration because only the rcu_preempt task will execute that code, and we know that this task was asleep at the last time this bit was set.
Now rcu_start_this_gp() returns a flag indicating whether or not a wakeup is needed, and the caller must do the wakeup once it is safe to do so, that is, after the various rcu_node locks have been released (doing a wakeup while holding any of those locks results in deadlock).

The following functions invoke rcu_start_this_gp: rcu_accelerate_cbs() and rcu_nocb_wait_gp(). We can eliminate rcu_nocb_wait_gp() because you are building with CONFIG_RCU_NOCB_CPU=n. Then rcu_accelerate_cbs() is invoked from:

o rcu_accelerate_cbs_unlocked(), which does the following, thus
properly awakening the rcu_preempt task when needed:

needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
if (needwake)
rcu_gp_kthread_wake(rsp);

o rcu_advance_cbs(), which returns the value returned by
rcu_accelerate_cbs(), thus pushing the problem off to its
callers, which are called out below.

o __note_gp_changes(), which also returns the value returned by
rcu_accelerate_cbs(), thus pushing the problem off to its callers,
which are called out below.

o rcu_gp_cleanup(), which is only ever invoked by RCU grace-period
kthreads such as the rcu_preempt task. Therefore, this function
never needs to awaken the rcu_preempt task, because the fact
that this function is executing means that this task is already
awake. (Also, as noted above, we can eliminate this code from
consideration because this task is known to have been sleeping
at the last time that the RCU_GP_FLAG_INIT bit was set.)

o rcu_report_qs_rdp(), which does the following, thus properly
awakening the rcu_preempt task when needed:

needwake = rcu_accelerate_cbs(rsp, rnp, rdp);

rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
/* ^^^ Released rnp->lock */
if (needwake)
rcu_gp_kthread_wake(rsp);

o rcu_prepare_for_idle(), which does the following, thus properly
awakening the rcu_preempt task when needed:

needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
if (needwake)
rcu_gp_kthread_wake(rsp);

Now for rcu_advance_cbs():

o __note_gp_changes(), which which also returns the value returned
by rcu_advance_cbs(), thus pushing the problem off to its callers,
which are called out below.

o rcu_migrate_callbacks(), which does the following, thus properly
awakening the rcu_preempt task when needed:

needwake = rcu_advance_cbs(rsp, rnp_root, rdp) ||
rcu_advance_cbs(rsp, rnp_root, my_rdp);
rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist);
WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) !=
!rcu_segcblist_n_cbs(&my_rdp->cblist));
raw_spin_unlock_irqrestore_rcu_node(rnp_root, flags);
if (needwake)
rcu_gp_kthread_wake(rsp);

Now for __note_gp_changes():

o note_gp_changes(), which does the following, thus properly
awakening the rcu_preempt task when needed:

needwake = __note_gp_changes(rsp, rnp, rdp);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
if (needwake)
rcu_gp_kthread_wake(rsp);

o rcu_gp_init() which is only ever invoked by RCU grace-period
kthreads such as the rcu_preempt task, which makes wakeups
unnecessary, just as for rcu_gp_cleanup() above.

o rcu_gp_cleanup(), ditto.

So I am not seeing how I am losing a wakeup, but please do feel free to double-check my analysis. One way to do that is using event tracing.

Thanx, Paul

------------------------------------------------------------------------
lockdep complaint:
------------------------------------------------------------------------

[ 2.895507] ======================================================
[ 2.895511] WARNING: possible circular locking dependency detected
[ 2.895517] 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1 Tainted: G U
[ 2.895521] ------------------------------------------------------
[ 2.895525] earlyEvs/1839 is trying to acquire lock:
[ 2.895530] 00000000ff344115 (&asd->mutex){+.+.}, at: ipu_isys_subdev_get_ffmt+0x32/0x90
[ 2.895546]
[ 2.895546] but task is already holding lock:
[ 2.895550] 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
[ 2.895561]
[ 2.895561] which lock already depends on the new lock.
[ 2.895561]
[ 2.895566]
[ 2.895566] the existing dependency chain (in reverse order) is:
[ 2.895570]
[ 2.895570] -> #1 (&mdev->graph_mutex){+.+.}:
[ 2.895583] __mutex_lock+0x80/0x9a0
[ 2.895588] mutex_lock_nested+0x1b/0x20
[ 2.895593] media_device_register_entity+0x92/0x1e0
[ 2.895598] v4l2_device_register_subdev+0xc2/0x1b0
[ 2.895604] ipu_isys_csi2_init+0x22c/0x520
[ 2.895608] isys_probe+0x6cb/0xed0
[ 2.895613] ipu_bus_probe+0xfd/0x2e0
[ 2.895620] really_probe+0x268/0x3d0
[ 2.895625] driver_probe_device+0x11a/0x130
[ 2.895630] __device_attach_driver+0x86/0x100
[ 2.895635] bus_for_each_drv+0x6e/0xb0
[ 2.895640] __device_attach+0xdf/0x160
[ 2.895645] device_initial_probe+0x13/0x20
[ 2.895650] bus_probe_device+0xa6/0xc0
[ 2.895655] deferred_probe_work_func+0x88/0xe0
[ 2.895661] process_one_work+0x220/0x5c0
[ 2.895665] worker_thread+0x1da/0x3b0
[ 2.895670] kthread+0x12c/0x150
[ 2.895675] ret_from_fork+0x3a/0x50
[ 2.895678]
[ 2.895678] -> #0 (&asd->mutex){+.+.}:
[ 2.895688] lock_acquire+0x95/0x1a0
[ 2.895693] __mutex_lock+0x80/0x9a0
[ 2.895698] mutex_lock_nested+0x1b/0x20
[ 2.895703] ipu_isys_subdev_get_ffmt+0x32/0x90
[ 2.895708] ipu_isys_csi2_get_fmt+0x14/0x30
[ 2.895713] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
[ 2.895718] v4l2_subdev_link_validate_one+0x67/0x120
[ 2.895723] v4l2_subdev_link_validate+0x246/0x490
[ 2.895728] csi2_link_validate+0xc6/0x220
[ 2.895733] __media_pipeline_start+0x15b/0x2f0
[ 2.895738] media_pipeline_start+0x33/0x50
[ 2.895743] ipu_isys_video_prepare_streaming+0x1e0/0x610
[ 2.895748] start_streaming+0x186/0x3a0
[ 2.895753] vb2_start_streaming+0x6d/0x130
[ 2.895758] vb2_core_streamon+0x108/0x140
[ 2.895762] vb2_streamon+0x29/0x50
[ 2.895767] vb2_ioctl_streamon+0x42/0x50
[ 2.895772] v4l_streamon+0x20/0x30
[ 2.895776] __video_do_ioctl+0x1af/0x3c0
[ 2.895781] video_usercopy+0x27e/0x7e0
[ 2.895785] video_ioctl2+0x15/0x20
[ 2.895789] v4l2_ioctl+0x49/0x50
[ 2.895794] do_video_ioctl+0x93c/0x2360
[ 2.895799] v4l2_compat_ioctl32+0x93/0xe0
[ 2.895806] __ia32_compat_sys_ioctl+0x73a/0x1c90
[ 2.895813] do_fast_syscall_32+0x9a/0x2d6
[ 2.895818] entry_SYSENTER_compat+0x6d/0x7c
[ 2.895821]
[ 2.895821] other info that might help us debug this:
[ 2.895821]
[ 2.895826] Possible unsafe locking scenario:
[ 2.895826]
[ 2.895830] CPU0 CPU1
[ 2.895833] ---- ----
[ 2.895836] lock(&mdev->graph_mutex);
[ 2.895842] lock(&asd->mutex);
[ 2.895847] lock(&mdev->graph_mutex);
[ 2.895852] lock(&asd->mutex);
[ 2.895857]
[ 2.895857] *** DEADLOCK ***
[ 2.895857]
[ 2.895863] 3 locks held by earlyEvs/1839:
[ 2.895866] #0: 00000000ed860090 (&av->mutex){+.+.}, at: __video_do_ioctl+0xbf/0x3c0
[ 2.895876] #1: 000000000cb253e7 (&isys->stream_mutex){+.+.}, at: start_streaming+0x5c/0x3a0
[ 2.895886] #2: 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
[ 2.895896]
[ 2.895896] stack backtrace:
[ 2.895903] CPU: 0 PID: 1839 Comm: earlyEvs Tainted: G U 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1
[ 2.895907] Call Trace:
[ 2.895915] dump_stack+0x70/0xa5
[ 2.895921] print_circular_bug.isra.35+0x1d8/0x1e6
[ 2.895927] __lock_acquire+0x1284/0x1340
[ 2.895931] ? __lock_acquire+0x2b5/0x1340
[ 2.895940] lock_acquire+0x95/0x1a0
[ 2.895945] ? lock_acquire+0x95/0x1a0
[ 2.895950] ? ipu_isys_subdev_get_ffmt+0x32/0x90
[ 2.895956] ? ipu_isys_subdev_get_ffmt+0x32/0x90
[ 2.895961] __mutex_lock+0x80/0x9a0
[ 2.895966] ? ipu_isys_subdev_get_ffmt+0x32/0x90
[ 2.895971] ? crlmodule_get_format+0x43/0x50
[ 2.895979] mutex_lock_nested+0x1b/0x20
[ 2.895984] ? mutex_lock_nested+0x1b/0x20
[ 2.895989] ipu_isys_subdev_get_ffmt+0x32/0x90
[ 2.895995] ipu_isys_csi2_get_fmt+0x14/0x30
[ 2.896001] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
[ 2.896006] v4l2_subdev_link_validate_one+0x67/0x120
[ 2.896011] ? crlmodule_get_format+0x2a/0x50
[ 2.896018] ? find_held_lock+0x35/0xa0
[ 2.896023] ? crlmodule_get_format+0x43/0x50
[ 2.896030] v4l2_subdev_link_validate+0x246/0x490
[ 2.896035] ? __mutex_unlock_slowpath+0x58/0x2f0
[ 2.896042] ? mutex_unlock+0x12/0x20
[ 2.896046] ? crlmodule_get_format+0x43/0x50
[ 2.896052] ? v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
[ 2.896057] ? v4l2_subdev_link_validate_one+0x67/0x120
[ 2.896065] ? __is_insn_slot_addr+0xad/0x120
[ 2.896070] ? kernel_text_address+0xc4/0x100
[ 2.896078] ? v4l2_subdev_link_validate+0x246/0x490
[ 2.896085] ? kernel_text_address+0xc4/0x100
[ 2.896092] ? __lock_acquire+0x1106/0x1340
[ 2.896096] ? __lock_acquire+0x1169/0x1340
[ 2.896103] csi2_link_validate+0xc6/0x220
[ 2.896110] ? __lock_is_held+0x5a/0xa0
[ 2.896115] ? mark_held_locks+0x58/0x80
[ 2.896122] ? __kmalloc+0x207/0x2e0
[ 2.896127] ? __lock_is_held+0x5a/0xa0
[ 2.896134] ? rcu_read_lock_sched_held+0x81/0x90
[ 2.896139] ? __kmalloc+0x2a3/0x2e0
[ 2.896144] ? media_pipeline_start+0x28/0x50
[ 2.896150] ? __media_entity_enum_init+0x33/0x70
[ 2.896155] ? csi2_has_route+0x18/0x20
[ 2.896160] ? media_graph_walk_next.part.9+0xac/0x290
[ 2.896166] __media_pipeline_start+0x15b/0x2f0
[ 2.896173] ? rcu_read_lock_sched_held+0x81/0x90
[ 2.896179] media_pipeline_start+0x33/0x50
[ 2.896186] ipu_isys_video_prepare_streaming+0x1e0/0x610
[ 2.896191] ? __lock_acquire+0x132e/0x1340
[ 2.896198] ? __lock_acquire+0x2b5/0x1340
[ 2.896204] ? lock_acquire+0x95/0x1a0
[ 2.896209] ? start_streaming+0x5c/0x3a0
[ 2.896215] ? start_streaming+0x5c/0x3a0
[ 2.896221] ? __mutex_lock+0x391/0x9a0
[ 2.896226] ? v4l_enable_media_source+0x2d/0x70
[ 2.896233] ? find_held_lock+0x35/0xa0
[ 2.896238] ? v4l_enable_media_source+0x57/0x70
[ 2.896245] start_streaming+0x186/0x3a0
[ 2.896250] ? __mutex_unlock_slowpath+0x58/0x2f0
[ 2.896257] vb2_start_streaming+0x6d/0x130
[ 2.896262] ? vb2_start_streaming+0x6d/0x130
[ 2.896267] vb2_core_streamon+0x108/0x140
[ 2.896273] vb2_streamon+0x29/0x50
[ 2.896278] vb2_ioctl_streamon+0x42/0x50
[ 2.896284] v4l_streamon+0x20/0x30
[ 2.896288] __video_do_ioctl+0x1af/0x3c0
[ 2.896296] ? __might_fault+0x85/0x90
[ 2.896302] video_usercopy+0x27e/0x7e0
[ 2.896307] ? copy_overflow+0x20/0x20
[ 2.896313] ? find_held_lock+0x35/0xa0
[ 2.896319] ? __might_fault+0x3e/0x90
[ 2.896325] video_ioctl2+0x15/0x20
[ 2.896330] v4l2_ioctl+0x49/0x50
[ 2.896335] do_video_ioctl+0x93c/0x2360
[ 2.896343] v4l2_compat_ioctl32+0x93/0xe0
[ 2.896349] __ia32_compat_sys_ioctl+0x73a/0x1c90
[ 2.896354] ? lockdep_hardirqs_on+0xef/0x180
[ 2.896359] ? do_fast_syscall_32+0x3b/0x2d6
[ 2.896364] do_fast_syscall_32+0x9a/0x2d6
[ 2.896370] entry_SYSENTER_compat+0x6d/0x7c
[ 2.896377] RIP: 0023:0xf7e79b79
[ 2.896382] Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 0c 24 c3 8b 1c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
[ 2.896387] RSP: 002b:00000000f76816bc EFLAGS: 00000292 ORIG_RAX: 0000000000000036
[ 2.896393] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 0000000040045612
[ 2.896396] RDX: 00000000f768172c RSI: 00000000f7d42d9c RDI: 00000000f768172c
[ 2.896400] RBP: 00000000f7681708 R08: 0000000000000000 R09: 0000000000000000
[ 2.896404] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 2.896408] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

------------------------------------------------------------------------

> [17818.936039] rcu: rcu_node 0:3 ->gp_seq 21808192 ->gp_seq_needed 21808196
> [17818.936048] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 101730 ->gp_req_activity 101732 ->gp_wake_time 101730 ->gp_wake_seq 1357 - >gp_seq 1360 ->gp_seq_needed 1360 ->gp_flags 0x0
> [17818.936056] rcu: rcu_bh: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 4312486108 ->gp_req_activity 4312486108 ->gp_wake_time 4312486108 - >gp_wake_seq 0 ->gp_seq -1200 ->gp_seq_needed -1200 ->gp_flags 0x0
>
> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Thursday, December 13, 2018 12:40 PM
> To: Zhang, Jun <[email protected]>
> Cc: He, Bo <[email protected]>; Steven Rostedt <[email protected]>;
> [email protected]; [email protected];
> [email protected]; [email protected]; Xiao, Jin
> <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie
> A <[email protected]>; Sun, Yi J <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Thu, Dec 13, 2018 at 03:28:46AM +0000, Zhang, Jun wrote:
> > Ok, we will test it, thanks!
>
> But please also try the sysrq-y with the earlier patch after a hang!
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Paul E. McKenney [mailto:[email protected]]
> > Sent: Thursday, December 13, 2018 10:43
> > To: Zhang, Jun <[email protected]>
> > Cc: He, Bo <[email protected]>; Steven Rostedt <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Xiao, Jin
> > <[email protected]>; Zhang, Yanmin <[email protected]>; Bai,
> > Jie A <[email protected]>; Sun, Yi J <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Thu, Dec 13, 2018 at 02:11:35AM +0000, Zhang, Jun wrote:
> > > Hello, Paul
> > >
> > > I think the next patch is better.
> > > Because ULONG_CMP_GE could cause double write, which has risk that write back old value.
> > > Please help review.
> > > I don't test it. If you agree, we will test it.
> >
> > Just to make sure that I understand, you are worried about something like the following, correct?
> >
> > o __note_gp_changes() compares rnp->gp_seq_needed and rdp->gp_seq_needed
> > and finds them equal.
> >
> > o At just this time something like rcu_start_this_gp() assigns a new
> > (larger) value to rdp->gp_seq_needed.
> >
> > o Then __note_gp_changes() overwrites rdp->gp_seq_needed with the
> > old value.
> >
> > This cannot happen because __note_gp_changes() runs with interrupts disabled on the CPU corresponding to the rcu_data structure referenced by the rdp pointer. So there is no way for rcu_start_this_gp() to be invoked on the same CPU during this "if" statement.
> >
> > Of course, there could be bugs. For example:
> >
> > o __note_gp_changes() might be called on a different CPU than that
> > corresponding to rdp. You can check this with something like:
> >
> > WARN_ON_ONCE(rdp->cpu != smp_processor_id());
> >
> > o The same things could happen with rcu_start_this_gp(), and the
> > above WARN_ON_ONCE() would work there as well.
> >
> > o rcutree_prepare_cpu() is a special case, but is irrelevant unless
> > you are doing CPU-hotplug operations. (It can run on a CPU other
> > than rdp->cpu, but only at times when rdp->cpu is offline.)
> >
> > o Interrupts might not really be disabled.
> >
> > That said, your patch could reduce overhead slightly, given that the two values will be equal much of the time. So it might be worth testing just for that reason.
> >
> > So why not just test it anyway? If it makes the bug go away, I will
> > be surprised, but it would not be the first surprise for me. ;-)
> >
> > Thanx, Paul
> >
> > > Thanks!
> > >
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > 0b760c1..c00f34e 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
> > > zero_cpu_stall_ticks(rdp);
> > > }
> > > rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */
> > > - if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
> > > + if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed)
> > > + ||
> > > + rdp->gpwrap)
> > > rdp->gp_seq_needed = rnp->gp_seq_needed;
> > > WRITE_ONCE(rdp->gpwrap, false);
> > > rcu_gpnum_ovf(rnp, rdp);
> > >
> > >
> > > -----Original Message-----
> > > From: Paul E. McKenney [mailto:[email protected]]
> > > Sent: Thursday, December 13, 2018 08:12
> > > To: He, Bo <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>;
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Zhang, Jun
> > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang,
> > > Yanmin <[email protected]>; Bai, Jie A <[email protected]>;
> > > Sun, Yi J <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> > > > I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> > > > Please correct me if I have something wrong.
> > >
> > > That would be because I sent you the wrong patch, apologies! :-/
> > >
> > > Please instead see the one below, which does add sysrq_rcu.
> > >
> > > Thanx, Paul
> > >
> > > > -----Original Message-----
> > > > From: Paul E. McKenney <[email protected]>
> > > > Sent: Thursday, December 13, 2018 5:03 AM
> > > > To: He, Bo <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>;
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected]; Zhang,
> > > > Jun <[email protected]>; Xiao, Jin <[email protected]>;
> > > > Zhang, Yanmin <[email protected]>; Bai, Jie A
> > > > <[email protected]>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > > > > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > > > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > > > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> > > > >
> > > > > Strange.
> > > > >
> > > > > Are the systems responsive to sysrq keys once failure occurs?
> > > > > If so, I will provide you a sysrq-R or some such to dump out the RCU state.
> > > >
> > > > Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below. Only lightly tested.
> > >
> > > ------------------------------------------------------------------
> > > --
> > > --
> > > --
> > >
> > > commit 04b6245c8458e8725f4169e62912c1fadfdf8141
> > > Author: Paul E. McKenney <[email protected]>
> > > Date: Wed Dec 12 16:10:09 2018 -0800
> > >
> > > rcu: Add sysrq rcu_node-dump capability
> > >
> > > Backported from v4.21/v5.0
> > >
> > > Life is hard if RCU manages to get stuck without triggering RCU CPU
> > > stall warnings or triggering the rcu_check_gp_start_stall() checks
> > > for failing to start a grace period. This commit therefore adds a
> > > boot-time-selectable sysrq key (commandeering "y") that allows manually
> > > dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter
> > > must be set for this sysrq to be available.
> > >
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > 0b760c1369f7..e9392a9d6291 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -61,6 +61,7 @@
> > > #include <linux/trace_events.h>
> > > #include <linux/suspend.h>
> > > #include <linux/ftrace.h>
> > > +#include <linux/sysrq.h>
> > >
> > > #include "tree.h"
> > > #include "rcu.h"
> > > @@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int
> > > rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes
> > > in use. */
> > > /* panic() on RCU Stall sysctl. */ int sysctl_panic_on_rcu_stall
> > > __read_mostly;
> > > +/* Commandeer a sysrq key to dump RCU's tree. */ static bool
> > > +sysrq_rcu; module_param(sysrq_rcu, bool, 0444);
> > >
> > > /*
> > > * The rcu_scheduler_active variable is initialized to the value
> > > @@
> > > -662,6 +666,27 @@ void show_rcu_gp_kthreads(void) }
> > > EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
> > >
> > > +/* Dump grace-period-request information due to commandeered sysrq.
> > > +*/ static void sysrq_show_rcu(int key) {
> > > + show_rcu_gp_kthreads();
> > > +}
> > > +
> > > +static struct sysrq_key_op sysrq_rcudump_op = {
> > > + .handler = sysrq_show_rcu,
> > > + .help_msg = "show-rcu(y)",
> > > + .action_msg = "Show RCU tree",
> > > + .enable_mask = SYSRQ_ENABLE_DUMP, };
> > > +
> > > +static int __init rcu_sysrq_init(void) {
> > > + if (sysrq_rcu)
> > > + return register_sysrq_key('y', &sysrq_rcudump_op);
> > > + return 0;
> > > +}
> > > +early_initcall(rcu_sysrq_init);
> > > +
> > > /*
> > > * Send along grace-period-related data for rcutorture diagnostics.
> > > */
> > >
> >
>



2018-12-14 02:17:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Fri, Dec 14, 2018 at 01:30:04AM +0000, He, Bo wrote:
> as you mentioned CONFIG_FAST_NO_HZ, do you mean CONFIG_RCU_FAST_NO_HZ? I double checked there is no FAST_NO_HZ in .config:

Yes, you are correct, CONFIG_RCU_FAST_NO_HZ. OK, you do not have it set,
which means several code paths can be ignored. Also CONFIG_HZ=1000, so
300 second delay.

Thanx, Paul

> Here is the grep from .config:
> egrep "HZ|RCU" .config
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_HZ_PERIODIC is not set
> CONFIG_NO_HZ_IDLE=y
> # CONFIG_NO_HZ_FULL is not set
> CONFIG_NO_HZ=y
> # RCU Subsystem
> CONFIG_PREEMPT_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_SRCU=y
> CONFIG_TREE_SRCU=y
> CONFIG_TASKS_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> # CONFIG_HZ_100 is not set
> # CONFIG_HZ_250 is not set
> # CONFIG_HZ_300 is not set
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
> # CONFIG_MACHZ_WDT is not set
> # RCU Debugging
> CONFIG_PROVE_RCU=y
> CONFIG_RCU_PERF_TEST=m
> CONFIG_RCU_TORTURE_TEST=m
> CONFIG_RCU_CPU_STALL_TIMEOUT=7
> CONFIG_RCU_TRACE=y
> CONFIG_RCU_EQS_DEBUG=y
>
> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Friday, December 14, 2018 2:12 AM
> To: He, Bo <[email protected]>
> Cc: Zhang, Jun <[email protected]>; Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Thu, Dec 13, 2018 at 03:26:08PM +0000, He, Bo wrote:
> > one of the board reproduce the issue with the show_rcu_gp_kthreads(), I also enclosed the logs as attachment.
> >
> > [17818.936032] rcu: rcu_preempt: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258 ->gp_wake_seq 21808189 ->gp_seq 21808192 ->gp_seq_needed 21808196 ->gp_flags 0x1
>
> This is quite helpful, thank you!
>
> The "RCU lockdep checking is enabled" says that CONFIG_PROVE_RCU=y, which is good. The "RCU_GP_WAIT_GPS(1)" means that the rcu_preempt task is waiting for a new grace-period request. The "->state: 0x402" means that it is sleeping, neither running nor in the process of waking up.
> The "delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258" means that it has been more than 300,000 jiffies since the rcu_preempt task did anything or was requested to do anything.
>
> The "->gp_wake_seq 21808189 ->gp_seq 21808192" says that the last attempt to awaken the rcu_preempt task happened during the last grace period.
> The "->gp_seq_needed 21808196 ->gp_flags 0x1" nevertheless says that someone requested a new grace period. So if the rcu_preempt task were to wake up, it would process the new grace period. Note again also the ->gp_req_activity 308256, which indicates that ->gp_flags was set more than 300,000 jiffies ago, just after the last recorded activity of the rcu_preempt task.
>
> But this is exactly the situation that rcu_check_gp_start_stall() is designed to warn about (and does warn about for me when I comment out the wakeup code). So why is rcu_check_gp_start_stall() not being called? Here are a couple of possibilities:
>
> 1. Because rcu_check_gp_start_stall() is only ever invoked from
> RCU_SOFTIRQ, it is possible that softirqs are stalled for
> whatever reason.
>
> 2. Because RCU_SOFTIRQ is invoked primarily from the scheduler-clock
> interrupt handler, it is possible that the scheduler tick has
> somehow been disabled. Traces from earlier runs showed a great
> deal of RCU callbacks queued, which would have caused RCU to
> refuse to allow the scheduler tick to be disabled, even if the
> corresponding CPU was idle.
>
> 3. You have CONFIG_FAST_NO_HZ=y (which you probably do, given
> that you are building for a battery-powered device) and all of the
> CPU's callbacks are lazy. Except that your earlier traces showed
> lots of non-lazy callbacks. Besides, even if all callbacks were
> lazy, there would still be a scheduling-clock interrupt every
> six seconds, and there are quite a few six-second intervals
> in a two-minute watchdog timeout.
>
> But if we cannot find the problem quickly, I will likely ask
> you to try reproducing with CONFIG_FAST_NO_HZ=n. This could
> be thought of as bisecting the RCU code looking for the bug.
>
> The first two of these seem unlikely given that the watchdog timer was still firing. Still, I don't see how 300,000 jiffies elapsed with a grace period requested and not started otherwise. Could you please check?
> One way to do so would be to enable ftrace on rcu_check_callbacks(), __rcu_process_callbacks(), and rcu_check_gp_start_stall(). It might be necessary to no-inline rcu_check_gp_start_stall(). You might have better ways to collect this information.
>
> Without this information, the only workaround patch I can give you will degrade battery lifetime, which might not be what you want.
>
> You do have a lockdep complaint early at boot. Although I don't immediately see how this self-deadlock would affect RCU, please do get it fixed. Sometimes the consequences of this sort of deadlock can propagate to unexepected places.
>
> Regardless of why rcu_check_gp_start_stall() failed to complain, it looks like this was set after the rcu_preempt task slept for the last time, and so there should have been a wakeup the last time that ->gp_flags was set. Perhaps there is some code path that drops the wakeup.
> I did check this in current -rcu, but you are instead running v4.19, so I should also check there.
>
> The ->gp_flags has its RCU_GP_FLAG_INIT bit set in rcu_start_this_gp() and in rcu_gp_cleanup(). We can eliminate rcu_gp_cleanup() from consideration because only the rcu_preempt task will execute that code, and we know that this task was asleep at the last time this bit was set.
> Now rcu_start_this_gp() returns a flag indicating whether or not a wakeup is needed, and the caller must do the wakeup once it is safe to do so, that is, after the various rcu_node locks have been released (doing a wakeup while holding any of those locks results in deadlock).
>
> The following functions invoke rcu_start_this_gp: rcu_accelerate_cbs() and rcu_nocb_wait_gp(). We can eliminate rcu_nocb_wait_gp() because you are building with CONFIG_RCU_NOCB_CPU=n. Then rcu_accelerate_cbs() is invoked from:
>
> o rcu_accelerate_cbs_unlocked(), which does the following, thus
> properly awakening the rcu_preempt task when needed:
>
> needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> o rcu_advance_cbs(), which returns the value returned by
> rcu_accelerate_cbs(), thus pushing the problem off to its
> callers, which are called out below.
>
> o __note_gp_changes(), which also returns the value returned by
> rcu_accelerate_cbs(), thus pushing the problem off to its callers,
> which are called out below.
>
> o rcu_gp_cleanup(), which is only ever invoked by RCU grace-period
> kthreads such as the rcu_preempt task. Therefore, this function
> never needs to awaken the rcu_preempt task, because the fact
> that this function is executing means that this task is already
> awake. (Also, as noted above, we can eliminate this code from
> consideration because this task is known to have been sleeping
> at the last time that the RCU_GP_FLAG_INIT bit was set.)
>
> o rcu_report_qs_rdp(), which does the following, thus properly
> awakening the rcu_preempt task when needed:
>
> needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
>
> rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> /* ^^^ Released rnp->lock */
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> o rcu_prepare_for_idle(), which does the following, thus properly
> awakening the rcu_preempt task when needed:
>
> needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> Now for rcu_advance_cbs():
>
> o __note_gp_changes(), which which also returns the value returned
> by rcu_advance_cbs(), thus pushing the problem off to its callers,
> which are called out below.
>
> o rcu_migrate_callbacks(), which does the following, thus properly
> awakening the rcu_preempt task when needed:
>
> needwake = rcu_advance_cbs(rsp, rnp_root, rdp) ||
> rcu_advance_cbs(rsp, rnp_root, my_rdp);
> rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist);
> WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) !=
> !rcu_segcblist_n_cbs(&my_rdp->cblist));
> raw_spin_unlock_irqrestore_rcu_node(rnp_root, flags);
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> Now for __note_gp_changes():
>
> o note_gp_changes(), which does the following, thus properly
> awakening the rcu_preempt task when needed:
>
> needwake = __note_gp_changes(rsp, rnp, rdp);
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> o rcu_gp_init() which is only ever invoked by RCU grace-period
> kthreads such as the rcu_preempt task, which makes wakeups
> unnecessary, just as for rcu_gp_cleanup() above.
>
> o rcu_gp_cleanup(), ditto.
>
> So I am not seeing how I am losing a wakeup, but please do feel free to double-check my analysis. One way to do that is using event tracing.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
> lockdep complaint:
> ------------------------------------------------------------------------
>
> [ 2.895507] ======================================================
> [ 2.895511] WARNING: possible circular locking dependency detected
> [ 2.895517] 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1 Tainted: G U
> [ 2.895521] ------------------------------------------------------
> [ 2.895525] earlyEvs/1839 is trying to acquire lock:
> [ 2.895530] 00000000ff344115 (&asd->mutex){+.+.}, at: ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895546]
> [ 2.895546] but task is already holding lock:
> [ 2.895550] 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> [ 2.895561]
> [ 2.895561] which lock already depends on the new lock.
> [ 2.895561]
> [ 2.895566]
> [ 2.895566] the existing dependency chain (in reverse order) is:
> [ 2.895570]
> [ 2.895570] -> #1 (&mdev->graph_mutex){+.+.}:
> [ 2.895583] __mutex_lock+0x80/0x9a0
> [ 2.895588] mutex_lock_nested+0x1b/0x20
> [ 2.895593] media_device_register_entity+0x92/0x1e0
> [ 2.895598] v4l2_device_register_subdev+0xc2/0x1b0
> [ 2.895604] ipu_isys_csi2_init+0x22c/0x520
> [ 2.895608] isys_probe+0x6cb/0xed0
> [ 2.895613] ipu_bus_probe+0xfd/0x2e0
> [ 2.895620] really_probe+0x268/0x3d0
> [ 2.895625] driver_probe_device+0x11a/0x130
> [ 2.895630] __device_attach_driver+0x86/0x100
> [ 2.895635] bus_for_each_drv+0x6e/0xb0
> [ 2.895640] __device_attach+0xdf/0x160
> [ 2.895645] device_initial_probe+0x13/0x20
> [ 2.895650] bus_probe_device+0xa6/0xc0
> [ 2.895655] deferred_probe_work_func+0x88/0xe0
> [ 2.895661] process_one_work+0x220/0x5c0
> [ 2.895665] worker_thread+0x1da/0x3b0
> [ 2.895670] kthread+0x12c/0x150
> [ 2.895675] ret_from_fork+0x3a/0x50
> [ 2.895678]
> [ 2.895678] -> #0 (&asd->mutex){+.+.}:
> [ 2.895688] lock_acquire+0x95/0x1a0
> [ 2.895693] __mutex_lock+0x80/0x9a0
> [ 2.895698] mutex_lock_nested+0x1b/0x20
> [ 2.895703] ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895708] ipu_isys_csi2_get_fmt+0x14/0x30
> [ 2.895713] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> [ 2.895718] v4l2_subdev_link_validate_one+0x67/0x120
> [ 2.895723] v4l2_subdev_link_validate+0x246/0x490
> [ 2.895728] csi2_link_validate+0xc6/0x220
> [ 2.895733] __media_pipeline_start+0x15b/0x2f0
> [ 2.895738] media_pipeline_start+0x33/0x50
> [ 2.895743] ipu_isys_video_prepare_streaming+0x1e0/0x610
> [ 2.895748] start_streaming+0x186/0x3a0
> [ 2.895753] vb2_start_streaming+0x6d/0x130
> [ 2.895758] vb2_core_streamon+0x108/0x140
> [ 2.895762] vb2_streamon+0x29/0x50
> [ 2.895767] vb2_ioctl_streamon+0x42/0x50
> [ 2.895772] v4l_streamon+0x20/0x30
> [ 2.895776] __video_do_ioctl+0x1af/0x3c0
> [ 2.895781] video_usercopy+0x27e/0x7e0
> [ 2.895785] video_ioctl2+0x15/0x20
> [ 2.895789] v4l2_ioctl+0x49/0x50
> [ 2.895794] do_video_ioctl+0x93c/0x2360
> [ 2.895799] v4l2_compat_ioctl32+0x93/0xe0
> [ 2.895806] __ia32_compat_sys_ioctl+0x73a/0x1c90
> [ 2.895813] do_fast_syscall_32+0x9a/0x2d6
> [ 2.895818] entry_SYSENTER_compat+0x6d/0x7c
> [ 2.895821]
> [ 2.895821] other info that might help us debug this:
> [ 2.895821]
> [ 2.895826] Possible unsafe locking scenario:
> [ 2.895826]
> [ 2.895830] CPU0 CPU1
> [ 2.895833] ---- ----
> [ 2.895836] lock(&mdev->graph_mutex);
> [ 2.895842] lock(&asd->mutex);
> [ 2.895847] lock(&mdev->graph_mutex);
> [ 2.895852] lock(&asd->mutex);
> [ 2.895857]
> [ 2.895857] *** DEADLOCK ***
> [ 2.895857]
> [ 2.895863] 3 locks held by earlyEvs/1839:
> [ 2.895866] #0: 00000000ed860090 (&av->mutex){+.+.}, at: __video_do_ioctl+0xbf/0x3c0
> [ 2.895876] #1: 000000000cb253e7 (&isys->stream_mutex){+.+.}, at: start_streaming+0x5c/0x3a0
> [ 2.895886] #2: 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> [ 2.895896]
> [ 2.895896] stack backtrace:
> [ 2.895903] CPU: 0 PID: 1839 Comm: earlyEvs Tainted: G U 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1
> [ 2.895907] Call Trace:
> [ 2.895915] dump_stack+0x70/0xa5
> [ 2.895921] print_circular_bug.isra.35+0x1d8/0x1e6
> [ 2.895927] __lock_acquire+0x1284/0x1340
> [ 2.895931] ? __lock_acquire+0x2b5/0x1340
> [ 2.895940] lock_acquire+0x95/0x1a0
> [ 2.895945] ? lock_acquire+0x95/0x1a0
> [ 2.895950] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895956] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895961] __mutex_lock+0x80/0x9a0
> [ 2.895966] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895971] ? crlmodule_get_format+0x43/0x50
> [ 2.895979] mutex_lock_nested+0x1b/0x20
> [ 2.895984] ? mutex_lock_nested+0x1b/0x20
> [ 2.895989] ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895995] ipu_isys_csi2_get_fmt+0x14/0x30
> [ 2.896001] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> [ 2.896006] v4l2_subdev_link_validate_one+0x67/0x120
> [ 2.896011] ? crlmodule_get_format+0x2a/0x50
> [ 2.896018] ? find_held_lock+0x35/0xa0
> [ 2.896023] ? crlmodule_get_format+0x43/0x50
> [ 2.896030] v4l2_subdev_link_validate+0x246/0x490
> [ 2.896035] ? __mutex_unlock_slowpath+0x58/0x2f0
> [ 2.896042] ? mutex_unlock+0x12/0x20
> [ 2.896046] ? crlmodule_get_format+0x43/0x50
> [ 2.896052] ? v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> [ 2.896057] ? v4l2_subdev_link_validate_one+0x67/0x120
> [ 2.896065] ? __is_insn_slot_addr+0xad/0x120
> [ 2.896070] ? kernel_text_address+0xc4/0x100
> [ 2.896078] ? v4l2_subdev_link_validate+0x246/0x490
> [ 2.896085] ? kernel_text_address+0xc4/0x100
> [ 2.896092] ? __lock_acquire+0x1106/0x1340
> [ 2.896096] ? __lock_acquire+0x1169/0x1340
> [ 2.896103] csi2_link_validate+0xc6/0x220
> [ 2.896110] ? __lock_is_held+0x5a/0xa0
> [ 2.896115] ? mark_held_locks+0x58/0x80
> [ 2.896122] ? __kmalloc+0x207/0x2e0
> [ 2.896127] ? __lock_is_held+0x5a/0xa0
> [ 2.896134] ? rcu_read_lock_sched_held+0x81/0x90
> [ 2.896139] ? __kmalloc+0x2a3/0x2e0
> [ 2.896144] ? media_pipeline_start+0x28/0x50
> [ 2.896150] ? __media_entity_enum_init+0x33/0x70
> [ 2.896155] ? csi2_has_route+0x18/0x20
> [ 2.896160] ? media_graph_walk_next.part.9+0xac/0x290
> [ 2.896166] __media_pipeline_start+0x15b/0x2f0
> [ 2.896173] ? rcu_read_lock_sched_held+0x81/0x90
> [ 2.896179] media_pipeline_start+0x33/0x50
> [ 2.896186] ipu_isys_video_prepare_streaming+0x1e0/0x610
> [ 2.896191] ? __lock_acquire+0x132e/0x1340
> [ 2.896198] ? __lock_acquire+0x2b5/0x1340
> [ 2.896204] ? lock_acquire+0x95/0x1a0
> [ 2.896209] ? start_streaming+0x5c/0x3a0
> [ 2.896215] ? start_streaming+0x5c/0x3a0
> [ 2.896221] ? __mutex_lock+0x391/0x9a0
> [ 2.896226] ? v4l_enable_media_source+0x2d/0x70
> [ 2.896233] ? find_held_lock+0x35/0xa0
> [ 2.896238] ? v4l_enable_media_source+0x57/0x70
> [ 2.896245] start_streaming+0x186/0x3a0
> [ 2.896250] ? __mutex_unlock_slowpath+0x58/0x2f0
> [ 2.896257] vb2_start_streaming+0x6d/0x130
> [ 2.896262] ? vb2_start_streaming+0x6d/0x130
> [ 2.896267] vb2_core_streamon+0x108/0x140
> [ 2.896273] vb2_streamon+0x29/0x50
> [ 2.896278] vb2_ioctl_streamon+0x42/0x50
> [ 2.896284] v4l_streamon+0x20/0x30
> [ 2.896288] __video_do_ioctl+0x1af/0x3c0
> [ 2.896296] ? __might_fault+0x85/0x90
> [ 2.896302] video_usercopy+0x27e/0x7e0
> [ 2.896307] ? copy_overflow+0x20/0x20
> [ 2.896313] ? find_held_lock+0x35/0xa0
> [ 2.896319] ? __might_fault+0x3e/0x90
> [ 2.896325] video_ioctl2+0x15/0x20
> [ 2.896330] v4l2_ioctl+0x49/0x50
> [ 2.896335] do_video_ioctl+0x93c/0x2360
> [ 2.896343] v4l2_compat_ioctl32+0x93/0xe0
> [ 2.896349] __ia32_compat_sys_ioctl+0x73a/0x1c90
> [ 2.896354] ? lockdep_hardirqs_on+0xef/0x180
> [ 2.896359] ? do_fast_syscall_32+0x3b/0x2d6
> [ 2.896364] do_fast_syscall_32+0x9a/0x2d6
> [ 2.896370] entry_SYSENTER_compat+0x6d/0x7c
> [ 2.896377] RIP: 0023:0xf7e79b79
> [ 2.896382] Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 0c 24 c3 8b 1c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
> [ 2.896387] RSP: 002b:00000000f76816bc EFLAGS: 00000292 ORIG_RAX: 0000000000000036
> [ 2.896393] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 0000000040045612
> [ 2.896396] RDX: 00000000f768172c RSI: 00000000f7d42d9c RDI: 00000000f768172c
> [ 2.896400] RBP: 00000000f7681708 R08: 0000000000000000 R09: 0000000000000000
> [ 2.896404] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 2.896408] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>
> ------------------------------------------------------------------------
>
> > [17818.936039] rcu: rcu_node 0:3 ->gp_seq 21808192 ->gp_seq_needed 21808196
> > [17818.936048] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 101730 ->gp_req_activity 101732 ->gp_wake_time 101730 ->gp_wake_seq 1357 - >gp_seq 1360 ->gp_seq_needed 1360 ->gp_flags 0x0
> > [17818.936056] rcu: rcu_bh: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 4312486108 ->gp_req_activity 4312486108 ->gp_wake_time 4312486108 - >gp_wake_seq 0 ->gp_seq -1200 ->gp_seq_needed -1200 ->gp_flags 0x0
> >
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Thursday, December 13, 2018 12:40 PM
> > To: Zhang, Jun <[email protected]>
> > Cc: He, Bo <[email protected]>; Steven Rostedt <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Xiao, Jin
> > <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie
> > A <[email protected]>; Sun, Yi J <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Thu, Dec 13, 2018 at 03:28:46AM +0000, Zhang, Jun wrote:
> > > Ok, we will test it, thanks!
> >
> > But please also try the sysrq-y with the earlier patch after a hang!
> >
> > Thanx, Paul
> >
> > > -----Original Message-----
> > > From: Paul E. McKenney [mailto:[email protected]]
> > > Sent: Thursday, December 13, 2018 10:43
> > > To: Zhang, Jun <[email protected]>
> > > Cc: He, Bo <[email protected]>; Steven Rostedt <[email protected]>;
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Xiao, Jin
> > > <[email protected]>; Zhang, Yanmin <[email protected]>; Bai,
> > > Jie A <[email protected]>; Sun, Yi J <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Thu, Dec 13, 2018 at 02:11:35AM +0000, Zhang, Jun wrote:
> > > > Hello, Paul
> > > >
> > > > I think the next patch is better.
> > > > Because ULONG_CMP_GE could cause double write, which has risk that write back old value.
> > > > Please help review.
> > > > I don't test it. If you agree, we will test it.
> > >
> > > Just to make sure that I understand, you are worried about something like the following, correct?
> > >
> > > o __note_gp_changes() compares rnp->gp_seq_needed and rdp->gp_seq_needed
> > > and finds them equal.
> > >
> > > o At just this time something like rcu_start_this_gp() assigns a new
> > > (larger) value to rdp->gp_seq_needed.
> > >
> > > o Then __note_gp_changes() overwrites rdp->gp_seq_needed with the
> > > old value.
> > >
> > > This cannot happen because __note_gp_changes() runs with interrupts disabled on the CPU corresponding to the rcu_data structure referenced by the rdp pointer. So there is no way for rcu_start_this_gp() to be invoked on the same CPU during this "if" statement.
> > >
> > > Of course, there could be bugs. For example:
> > >
> > > o __note_gp_changes() might be called on a different CPU than that
> > > corresponding to rdp. You can check this with something like:
> > >
> > > WARN_ON_ONCE(rdp->cpu != smp_processor_id());
> > >
> > > o The same things could happen with rcu_start_this_gp(), and the
> > > above WARN_ON_ONCE() would work there as well.
> > >
> > > o rcutree_prepare_cpu() is a special case, but is irrelevant unless
> > > you are doing CPU-hotplug operations. (It can run on a CPU other
> > > than rdp->cpu, but only at times when rdp->cpu is offline.)
> > >
> > > o Interrupts might not really be disabled.
> > >
> > > That said, your patch could reduce overhead slightly, given that the two values will be equal much of the time. So it might be worth testing just for that reason.
> > >
> > > So why not just test it anyway? If it makes the bug go away, I will
> > > be surprised, but it would not be the first surprise for me. ;-)
> > >
> > > Thanx, Paul
> > >
> > > > Thanks!
> > > >
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > > 0b760c1..c00f34e 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
> > > > zero_cpu_stall_ticks(rdp);
> > > > }
> > > > rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */
> > > > - if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
> > > > + if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed)
> > > > + ||
> > > > + rdp->gpwrap)
> > > > rdp->gp_seq_needed = rnp->gp_seq_needed;
> > > > WRITE_ONCE(rdp->gpwrap, false);
> > > > rcu_gpnum_ovf(rnp, rdp);
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Paul E. McKenney [mailto:[email protected]]
> > > > Sent: Thursday, December 13, 2018 08:12
> > > > To: He, Bo <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>;
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected]; Zhang, Jun
> > > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang,
> > > > Yanmin <[email protected]>; Bai, Jie A <[email protected]>;
> > > > Sun, Yi J <[email protected]>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> > > > > I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> > > > > Please correct me if I have something wrong.
> > > >
> > > > That would be because I sent you the wrong patch, apologies! :-/
> > > >
> > > > Please instead see the one below, which does add sysrq_rcu.
> > > >
> > > > Thanx, Paul
> > > >
> > > > > -----Original Message-----
> > > > > From: Paul E. McKenney <[email protected]>
> > > > > Sent: Thursday, December 13, 2018 5:03 AM
> > > > > To: He, Bo <[email protected]>
> > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > [email protected]; [email protected];
> > > > > [email protected]; [email protected]; Zhang,
> > > > > Jun <[email protected]>; Xiao, Jin <[email protected]>;
> > > > > Zhang, Yanmin <[email protected]>; Bai, Jie A
> > > > > <[email protected]>
> > > > > Subject: Re: rcu_preempt caused oom
> > > > >
> > > > > On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > > > > > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > > > > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > > > > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> > > > > >
> > > > > > Strange.
> > > > > >
> > > > > > Are the systems responsive to sysrq keys once failure occurs?
> > > > > > If so, I will provide you a sysrq-R or some such to dump out the RCU state.
> > > > >
> > > > > Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below. Only lightly tested.
> > > >
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > --
> > > >
> > > > commit 04b6245c8458e8725f4169e62912c1fadfdf8141
> > > > Author: Paul E. McKenney <[email protected]>
> > > > Date: Wed Dec 12 16:10:09 2018 -0800
> > > >
> > > > rcu: Add sysrq rcu_node-dump capability
> > > >
> > > > Backported from v4.21/v5.0
> > > >
> > > > Life is hard if RCU manages to get stuck without triggering RCU CPU
> > > > stall warnings or triggering the rcu_check_gp_start_stall() checks
> > > > for failing to start a grace period. This commit therefore adds a
> > > > boot-time-selectable sysrq key (commandeering "y") that allows manually
> > > > dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter
> > > > must be set for this sysrq to be available.
> > > >
> > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > > 0b760c1369f7..e9392a9d6291 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -61,6 +61,7 @@
> > > > #include <linux/trace_events.h>
> > > > #include <linux/suspend.h>
> > > > #include <linux/ftrace.h>
> > > > +#include <linux/sysrq.h>
> > > >
> > > > #include "tree.h"
> > > > #include "rcu.h"
> > > > @@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int
> > > > rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes
> > > > in use. */
> > > > /* panic() on RCU Stall sysctl. */ int sysctl_panic_on_rcu_stall
> > > > __read_mostly;
> > > > +/* Commandeer a sysrq key to dump RCU's tree. */ static bool
> > > > +sysrq_rcu; module_param(sysrq_rcu, bool, 0444);
> > > >
> > > > /*
> > > > * The rcu_scheduler_active variable is initialized to the value
> > > > @@
> > > > -662,6 +666,27 @@ void show_rcu_gp_kthreads(void) }
> > > > EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
> > > >
> > > > +/* Dump grace-period-request information due to commandeered sysrq.
> > > > +*/ static void sysrq_show_rcu(int key) {
> > > > + show_rcu_gp_kthreads();
> > > > +}
> > > > +
> > > > +static struct sysrq_key_op sysrq_rcudump_op = {
> > > > + .handler = sysrq_show_rcu,
> > > > + .help_msg = "show-rcu(y)",
> > > > + .action_msg = "Show RCU tree",
> > > > + .enable_mask = SYSRQ_ENABLE_DUMP, };
> > > > +
> > > > +static int __init rcu_sysrq_init(void) {
> > > > + if (sysrq_rcu)
> > > > + return register_sysrq_key('y', &sysrq_rcudump_op);
> > > > + return 0;
> > > > +}
> > > > +early_initcall(rcu_sysrq_init);
> > > > +
> > > > /*
> > > > * Send along grace-period-related data for rcutorture diagnostics.
> > > > */
> > > >
> > >
> >
>
>


2018-12-14 02:42:02

by He, Bo

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

another experiment we have done with the enclosed debug patch, and also have more rcu trace event enable but without CONFIG_RCU_BOOST config, we don't reproduce the issue after 90 Hours until now on 10 boards(the issue should reproduce on one night per previous experience).

the purposes are to capture the more rcu event trace close to the issue happen, because I check the __wait_rcu_gp is not always in running, so we think even it trigger the panic for 3s timeout, the issue is already happened before 3s.

And Actually the rsp->gp_flags = 1, but RCU_GP_WAIT_GPS(1) ->state: 0x402, it means the kthread is not schedule for 300s but the RCU_GP_FLAG_INIT is set. What's your ideas?
---------------------------------------------------------------------------------------------------------------------------------
- swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
- RCU_GP_FLAG_INIT);
+ if (current->pid != rcu_preempt_pid) {
+ swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT);
+ } else {
+ ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT, 2*HZ);
+
+ if(!ret) {
+ show_rcu_gp_kthreads();
+ panic("hung_task: blocked in rcu_gp_kthread init");
+ }
+ }
--------------------------------------------------------------------------------------
-----Original Message-----
From: Paul E. McKenney <[email protected]>
Sent: Friday, December 14, 2018 10:15 AM
To: He, Bo <[email protected]>
Cc: Zhang, Jun <[email protected]>; Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>
Subject: Re: rcu_preempt caused oom

On Fri, Dec 14, 2018 at 01:30:04AM +0000, He, Bo wrote:
> as you mentioned CONFIG_FAST_NO_HZ, do you mean CONFIG_RCU_FAST_NO_HZ? I double checked there is no FAST_NO_HZ in .config:

Yes, you are correct, CONFIG_RCU_FAST_NO_HZ. OK, you do not have it set, which means several code paths can be ignored. Also CONFIG_HZ=1000, so
300 second delay.

Thanx, Paul

> Here is the grep from .config:
> egrep "HZ|RCU" .config
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_HZ_PERIODIC is not set
> CONFIG_NO_HZ_IDLE=y
> # CONFIG_NO_HZ_FULL is not set
> CONFIG_NO_HZ=y
> # RCU Subsystem
> CONFIG_PREEMPT_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_SRCU=y
> CONFIG_TREE_SRCU=y
> CONFIG_TASKS_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> # CONFIG_HZ_100 is not set
> # CONFIG_HZ_250 is not set
> # CONFIG_HZ_300 is not set
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
> # CONFIG_MACHZ_WDT is not set
> # RCU Debugging
> CONFIG_PROVE_RCU=y
> CONFIG_RCU_PERF_TEST=m
> CONFIG_RCU_TORTURE_TEST=m
> CONFIG_RCU_CPU_STALL_TIMEOUT=7
> CONFIG_RCU_TRACE=y
> CONFIG_RCU_EQS_DEBUG=y
>
> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Friday, December 14, 2018 2:12 AM
> To: He, Bo <[email protected]>
> Cc: Zhang, Jun <[email protected]>; Steven Rostedt
> <[email protected]>; [email protected];
> [email protected]; [email protected];
> [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin
> <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J
> <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Thu, Dec 13, 2018 at 03:26:08PM +0000, He, Bo wrote:
> > one of the board reproduce the issue with the show_rcu_gp_kthreads(), I also enclosed the logs as attachment.
> >
> > [17818.936032] rcu: rcu_preempt: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258 ->gp_wake_seq 21808189 ->gp_seq 21808192 ->gp_seq_needed 21808196 ->gp_flags 0x1
>
> This is quite helpful, thank you!
>
> The "RCU lockdep checking is enabled" says that CONFIG_PROVE_RCU=y, which is good. The "RCU_GP_WAIT_GPS(1)" means that the rcu_preempt task is waiting for a new grace-period request. The "->state: 0x402" means that it is sleeping, neither running nor in the process of waking up.
> The "delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258" means that it has been more than 300,000 jiffies since the rcu_preempt task did anything or was requested to do anything.
>
> The "->gp_wake_seq 21808189 ->gp_seq 21808192" says that the last attempt to awaken the rcu_preempt task happened during the last grace period.
> The "->gp_seq_needed 21808196 ->gp_flags 0x1" nevertheless says that someone requested a new grace period. So if the rcu_preempt task were to wake up, it would process the new grace period. Note again also the ->gp_req_activity 308256, which indicates that ->gp_flags was set more than 300,000 jiffies ago, just after the last recorded activity of the rcu_preempt task.
>
> But this is exactly the situation that rcu_check_gp_start_stall() is designed to warn about (and does warn about for me when I comment out the wakeup code). So why is rcu_check_gp_start_stall() not being called? Here are a couple of possibilities:
>
> 1. Because rcu_check_gp_start_stall() is only ever invoked from
> RCU_SOFTIRQ, it is possible that softirqs are stalled for
> whatever reason.
>
> 2. Because RCU_SOFTIRQ is invoked primarily from the scheduler-clock
> interrupt handler, it is possible that the scheduler tick has
> somehow been disabled. Traces from earlier runs showed a great
> deal of RCU callbacks queued, which would have caused RCU to
> refuse to allow the scheduler tick to be disabled, even if the
> corresponding CPU was idle.
>
> 3. You have CONFIG_FAST_NO_HZ=y (which you probably do, given
> that you are building for a battery-powered device) and all of the
> CPU's callbacks are lazy. Except that your earlier traces showed
> lots of non-lazy callbacks. Besides, even if all callbacks were
> lazy, there would still be a scheduling-clock interrupt every
> six seconds, and there are quite a few six-second intervals
> in a two-minute watchdog timeout.
>
> But if we cannot find the problem quickly, I will likely ask
> you to try reproducing with CONFIG_FAST_NO_HZ=n. This could
> be thought of as bisecting the RCU code looking for the bug.
>
> The first two of these seem unlikely given that the watchdog timer was still firing. Still, I don't see how 300,000 jiffies elapsed with a grace period requested and not started otherwise. Could you please check?
> One way to do so would be to enable ftrace on rcu_check_callbacks(), __rcu_process_callbacks(), and rcu_check_gp_start_stall(). It might be necessary to no-inline rcu_check_gp_start_stall(). You might have better ways to collect this information.
>
> Without this information, the only workaround patch I can give you will degrade battery lifetime, which might not be what you want.
>
> You do have a lockdep complaint early at boot. Although I don't immediately see how this self-deadlock would affect RCU, please do get it fixed. Sometimes the consequences of this sort of deadlock can propagate to unexepected places.
>
> Regardless of why rcu_check_gp_start_stall() failed to complain, it looks like this was set after the rcu_preempt task slept for the last time, and so there should have been a wakeup the last time that ->gp_flags was set. Perhaps there is some code path that drops the wakeup.
> I did check this in current -rcu, but you are instead running v4.19, so I should also check there.
>
> The ->gp_flags has its RCU_GP_FLAG_INIT bit set in rcu_start_this_gp() and in rcu_gp_cleanup(). We can eliminate rcu_gp_cleanup() from consideration because only the rcu_preempt task will execute that code, and we know that this task was asleep at the last time this bit was set.
> Now rcu_start_this_gp() returns a flag indicating whether or not a wakeup is needed, and the caller must do the wakeup once it is safe to do so, that is, after the various rcu_node locks have been released (doing a wakeup while holding any of those locks results in deadlock).
>
> The following functions invoke rcu_start_this_gp: rcu_accelerate_cbs() and rcu_nocb_wait_gp(). We can eliminate rcu_nocb_wait_gp() because you are building with CONFIG_RCU_NOCB_CPU=n. Then rcu_accelerate_cbs() is invoked from:
>
> o rcu_accelerate_cbs_unlocked(), which does the following, thus
> properly awakening the rcu_preempt task when needed:
>
> needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> o rcu_advance_cbs(), which returns the value returned by
> rcu_accelerate_cbs(), thus pushing the problem off to its
> callers, which are called out below.
>
> o __note_gp_changes(), which also returns the value returned by
> rcu_accelerate_cbs(), thus pushing the problem off to its callers,
> which are called out below.
>
> o rcu_gp_cleanup(), which is only ever invoked by RCU grace-period
> kthreads such as the rcu_preempt task. Therefore, this function
> never needs to awaken the rcu_preempt task, because the fact
> that this function is executing means that this task is already
> awake. (Also, as noted above, we can eliminate this code from
> consideration because this task is known to have been sleeping
> at the last time that the RCU_GP_FLAG_INIT bit was set.)
>
> o rcu_report_qs_rdp(), which does the following, thus properly
> awakening the rcu_preempt task when needed:
>
> needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
>
> rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> /* ^^^ Released rnp->lock */
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> o rcu_prepare_for_idle(), which does the following, thus properly
> awakening the rcu_preempt task when needed:
>
> needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> Now for rcu_advance_cbs():
>
> o __note_gp_changes(), which which also returns the value returned
> by rcu_advance_cbs(), thus pushing the problem off to its callers,
> which are called out below.
>
> o rcu_migrate_callbacks(), which does the following, thus properly
> awakening the rcu_preempt task when needed:
>
> needwake = rcu_advance_cbs(rsp, rnp_root, rdp) ||
> rcu_advance_cbs(rsp, rnp_root, my_rdp);
> rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist);
> WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) !=
> !rcu_segcblist_n_cbs(&my_rdp->cblist));
> raw_spin_unlock_irqrestore_rcu_node(rnp_root, flags);
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> Now for __note_gp_changes():
>
> o note_gp_changes(), which does the following, thus properly
> awakening the rcu_preempt task when needed:
>
> needwake = __note_gp_changes(rsp, rnp, rdp);
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> o rcu_gp_init() which is only ever invoked by RCU grace-period
> kthreads such as the rcu_preempt task, which makes wakeups
> unnecessary, just as for rcu_gp_cleanup() above.
>
> o rcu_gp_cleanup(), ditto.
>
> So I am not seeing how I am losing a wakeup, but please do feel free to double-check my analysis. One way to do that is using event tracing.
>
> Thanx, Paul
>
> ----------------------------------------------------------------------
> --
> lockdep complaint:
> ----------------------------------------------------------------------
> --
>
> [ 2.895507] ======================================================
> [ 2.895511] WARNING: possible circular locking dependency detected
> [ 2.895517] 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1 Tainted: G U
> [ 2.895521] ------------------------------------------------------
> [ 2.895525] earlyEvs/1839 is trying to acquire lock:
> [ 2.895530] 00000000ff344115 (&asd->mutex){+.+.}, at: ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895546]
> [ 2.895546] but task is already holding lock:
> [ 2.895550] 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> [ 2.895561]
> [ 2.895561] which lock already depends on the new lock.
> [ 2.895561]
> [ 2.895566]
> [ 2.895566] the existing dependency chain (in reverse order) is:
> [ 2.895570]
> [ 2.895570] -> #1 (&mdev->graph_mutex){+.+.}:
> [ 2.895583] __mutex_lock+0x80/0x9a0
> [ 2.895588] mutex_lock_nested+0x1b/0x20
> [ 2.895593] media_device_register_entity+0x92/0x1e0
> [ 2.895598] v4l2_device_register_subdev+0xc2/0x1b0
> [ 2.895604] ipu_isys_csi2_init+0x22c/0x520
> [ 2.895608] isys_probe+0x6cb/0xed0
> [ 2.895613] ipu_bus_probe+0xfd/0x2e0
> [ 2.895620] really_probe+0x268/0x3d0
> [ 2.895625] driver_probe_device+0x11a/0x130
> [ 2.895630] __device_attach_driver+0x86/0x100
> [ 2.895635] bus_for_each_drv+0x6e/0xb0
> [ 2.895640] __device_attach+0xdf/0x160
> [ 2.895645] device_initial_probe+0x13/0x20
> [ 2.895650] bus_probe_device+0xa6/0xc0
> [ 2.895655] deferred_probe_work_func+0x88/0xe0
> [ 2.895661] process_one_work+0x220/0x5c0
> [ 2.895665] worker_thread+0x1da/0x3b0
> [ 2.895670] kthread+0x12c/0x150
> [ 2.895675] ret_from_fork+0x3a/0x50
> [ 2.895678]
> [ 2.895678] -> #0 (&asd->mutex){+.+.}:
> [ 2.895688] lock_acquire+0x95/0x1a0
> [ 2.895693] __mutex_lock+0x80/0x9a0
> [ 2.895698] mutex_lock_nested+0x1b/0x20
> [ 2.895703] ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895708] ipu_isys_csi2_get_fmt+0x14/0x30
> [ 2.895713] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> [ 2.895718] v4l2_subdev_link_validate_one+0x67/0x120
> [ 2.895723] v4l2_subdev_link_validate+0x246/0x490
> [ 2.895728] csi2_link_validate+0xc6/0x220
> [ 2.895733] __media_pipeline_start+0x15b/0x2f0
> [ 2.895738] media_pipeline_start+0x33/0x50
> [ 2.895743] ipu_isys_video_prepare_streaming+0x1e0/0x610
> [ 2.895748] start_streaming+0x186/0x3a0
> [ 2.895753] vb2_start_streaming+0x6d/0x130
> [ 2.895758] vb2_core_streamon+0x108/0x140
> [ 2.895762] vb2_streamon+0x29/0x50
> [ 2.895767] vb2_ioctl_streamon+0x42/0x50
> [ 2.895772] v4l_streamon+0x20/0x30
> [ 2.895776] __video_do_ioctl+0x1af/0x3c0
> [ 2.895781] video_usercopy+0x27e/0x7e0
> [ 2.895785] video_ioctl2+0x15/0x20
> [ 2.895789] v4l2_ioctl+0x49/0x50
> [ 2.895794] do_video_ioctl+0x93c/0x2360
> [ 2.895799] v4l2_compat_ioctl32+0x93/0xe0
> [ 2.895806] __ia32_compat_sys_ioctl+0x73a/0x1c90
> [ 2.895813] do_fast_syscall_32+0x9a/0x2d6
> [ 2.895818] entry_SYSENTER_compat+0x6d/0x7c
> [ 2.895821]
> [ 2.895821] other info that might help us debug this:
> [ 2.895821]
> [ 2.895826] Possible unsafe locking scenario:
> [ 2.895826]
> [ 2.895830] CPU0 CPU1
> [ 2.895833] ---- ----
> [ 2.895836] lock(&mdev->graph_mutex);
> [ 2.895842] lock(&asd->mutex);
> [ 2.895847] lock(&mdev->graph_mutex);
> [ 2.895852] lock(&asd->mutex);
> [ 2.895857]
> [ 2.895857] *** DEADLOCK ***
> [ 2.895857]
> [ 2.895863] 3 locks held by earlyEvs/1839:
> [ 2.895866] #0: 00000000ed860090 (&av->mutex){+.+.}, at: __video_do_ioctl+0xbf/0x3c0
> [ 2.895876] #1: 000000000cb253e7 (&isys->stream_mutex){+.+.}, at: start_streaming+0x5c/0x3a0
> [ 2.895886] #2: 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> [ 2.895896]
> [ 2.895896] stack backtrace:
> [ 2.895903] CPU: 0 PID: 1839 Comm: earlyEvs Tainted: G U 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1
> [ 2.895907] Call Trace:
> [ 2.895915] dump_stack+0x70/0xa5
> [ 2.895921] print_circular_bug.isra.35+0x1d8/0x1e6
> [ 2.895927] __lock_acquire+0x1284/0x1340
> [ 2.895931] ? __lock_acquire+0x2b5/0x1340
> [ 2.895940] lock_acquire+0x95/0x1a0
> [ 2.895945] ? lock_acquire+0x95/0x1a0
> [ 2.895950] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895956] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895961] __mutex_lock+0x80/0x9a0
> [ 2.895966] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895971] ? crlmodule_get_format+0x43/0x50
> [ 2.895979] mutex_lock_nested+0x1b/0x20
> [ 2.895984] ? mutex_lock_nested+0x1b/0x20
> [ 2.895989] ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895995] ipu_isys_csi2_get_fmt+0x14/0x30
> [ 2.896001] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> [ 2.896006] v4l2_subdev_link_validate_one+0x67/0x120
> [ 2.896011] ? crlmodule_get_format+0x2a/0x50
> [ 2.896018] ? find_held_lock+0x35/0xa0
> [ 2.896023] ? crlmodule_get_format+0x43/0x50
> [ 2.896030] v4l2_subdev_link_validate+0x246/0x490
> [ 2.896035] ? __mutex_unlock_slowpath+0x58/0x2f0
> [ 2.896042] ? mutex_unlock+0x12/0x20
> [ 2.896046] ? crlmodule_get_format+0x43/0x50
> [ 2.896052] ? v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> [ 2.896057] ? v4l2_subdev_link_validate_one+0x67/0x120
> [ 2.896065] ? __is_insn_slot_addr+0xad/0x120
> [ 2.896070] ? kernel_text_address+0xc4/0x100
> [ 2.896078] ? v4l2_subdev_link_validate+0x246/0x490
> [ 2.896085] ? kernel_text_address+0xc4/0x100
> [ 2.896092] ? __lock_acquire+0x1106/0x1340
> [ 2.896096] ? __lock_acquire+0x1169/0x1340
> [ 2.896103] csi2_link_validate+0xc6/0x220
> [ 2.896110] ? __lock_is_held+0x5a/0xa0
> [ 2.896115] ? mark_held_locks+0x58/0x80
> [ 2.896122] ? __kmalloc+0x207/0x2e0
> [ 2.896127] ? __lock_is_held+0x5a/0xa0
> [ 2.896134] ? rcu_read_lock_sched_held+0x81/0x90
> [ 2.896139] ? __kmalloc+0x2a3/0x2e0
> [ 2.896144] ? media_pipeline_start+0x28/0x50
> [ 2.896150] ? __media_entity_enum_init+0x33/0x70
> [ 2.896155] ? csi2_has_route+0x18/0x20
> [ 2.896160] ? media_graph_walk_next.part.9+0xac/0x290
> [ 2.896166] __media_pipeline_start+0x15b/0x2f0
> [ 2.896173] ? rcu_read_lock_sched_held+0x81/0x90
> [ 2.896179] media_pipeline_start+0x33/0x50
> [ 2.896186] ipu_isys_video_prepare_streaming+0x1e0/0x610
> [ 2.896191] ? __lock_acquire+0x132e/0x1340
> [ 2.896198] ? __lock_acquire+0x2b5/0x1340
> [ 2.896204] ? lock_acquire+0x95/0x1a0
> [ 2.896209] ? start_streaming+0x5c/0x3a0
> [ 2.896215] ? start_streaming+0x5c/0x3a0
> [ 2.896221] ? __mutex_lock+0x391/0x9a0
> [ 2.896226] ? v4l_enable_media_source+0x2d/0x70
> [ 2.896233] ? find_held_lock+0x35/0xa0
> [ 2.896238] ? v4l_enable_media_source+0x57/0x70
> [ 2.896245] start_streaming+0x186/0x3a0
> [ 2.896250] ? __mutex_unlock_slowpath+0x58/0x2f0
> [ 2.896257] vb2_start_streaming+0x6d/0x130
> [ 2.896262] ? vb2_start_streaming+0x6d/0x130
> [ 2.896267] vb2_core_streamon+0x108/0x140
> [ 2.896273] vb2_streamon+0x29/0x50
> [ 2.896278] vb2_ioctl_streamon+0x42/0x50
> [ 2.896284] v4l_streamon+0x20/0x30
> [ 2.896288] __video_do_ioctl+0x1af/0x3c0
> [ 2.896296] ? __might_fault+0x85/0x90
> [ 2.896302] video_usercopy+0x27e/0x7e0
> [ 2.896307] ? copy_overflow+0x20/0x20
> [ 2.896313] ? find_held_lock+0x35/0xa0
> [ 2.896319] ? __might_fault+0x3e/0x90
> [ 2.896325] video_ioctl2+0x15/0x20
> [ 2.896330] v4l2_ioctl+0x49/0x50
> [ 2.896335] do_video_ioctl+0x93c/0x2360
> [ 2.896343] v4l2_compat_ioctl32+0x93/0xe0
> [ 2.896349] __ia32_compat_sys_ioctl+0x73a/0x1c90
> [ 2.896354] ? lockdep_hardirqs_on+0xef/0x180
> [ 2.896359] ? do_fast_syscall_32+0x3b/0x2d6
> [ 2.896364] do_fast_syscall_32+0x9a/0x2d6
> [ 2.896370] entry_SYSENTER_compat+0x6d/0x7c
> [ 2.896377] RIP: 0023:0xf7e79b79
> [ 2.896382] Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 0c 24 c3 8b 1c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
> [ 2.896387] RSP: 002b:00000000f76816bc EFLAGS: 00000292 ORIG_RAX: 0000000000000036
> [ 2.896393] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 0000000040045612
> [ 2.896396] RDX: 00000000f768172c RSI: 00000000f7d42d9c RDI: 00000000f768172c
> [ 2.896400] RBP: 00000000f7681708 R08: 0000000000000000 R09: 0000000000000000
> [ 2.896404] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 2.896408] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>
> ----------------------------------------------------------------------
> --
>
> > [17818.936039] rcu: rcu_node 0:3 ->gp_seq 21808192 ->gp_seq_needed 21808196
> > [17818.936048] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 101730 ->gp_req_activity 101732 ->gp_wake_time 101730 ->gp_wake_seq 1357 - >gp_seq 1360 ->gp_seq_needed 1360 ->gp_flags 0x0
> > [17818.936056] rcu: rcu_bh: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 4312486108 ->gp_req_activity 4312486108 ->gp_wake_time 4312486108 - >gp_wake_seq 0 ->gp_seq -1200 ->gp_seq_needed -1200 ->gp_flags 0x0
> >
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Thursday, December 13, 2018 12:40 PM
> > To: Zhang, Jun <[email protected]>
> > Cc: He, Bo <[email protected]>; Steven Rostedt <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Xiao, Jin
> > <[email protected]>; Zhang, Yanmin <[email protected]>; Bai,
> > Jie A <[email protected]>; Sun, Yi J <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Thu, Dec 13, 2018 at 03:28:46AM +0000, Zhang, Jun wrote:
> > > Ok, we will test it, thanks!
> >
> > But please also try the sysrq-y with the earlier patch after a hang!
> >
> > Thanx, Paul
> >
> > > -----Original Message-----
> > > From: Paul E. McKenney [mailto:[email protected]]
> > > Sent: Thursday, December 13, 2018 10:43
> > > To: Zhang, Jun <[email protected]>
> > > Cc: He, Bo <[email protected]>; Steven Rostedt
> > > <[email protected]>; [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; Xiao, Jin <[email protected]>; Zhang,
> > > Yanmin <[email protected]>; Bai, Jie A <[email protected]>;
> > > Sun, Yi J <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Thu, Dec 13, 2018 at 02:11:35AM +0000, Zhang, Jun wrote:
> > > > Hello, Paul
> > > >
> > > > I think the next patch is better.
> > > > Because ULONG_CMP_GE could cause double write, which has risk that write back old value.
> > > > Please help review.
> > > > I don't test it. If you agree, we will test it.
> > >
> > > Just to make sure that I understand, you are worried about something like the following, correct?
> > >
> > > o __note_gp_changes() compares rnp->gp_seq_needed and rdp->gp_seq_needed
> > > and finds them equal.
> > >
> > > o At just this time something like rcu_start_this_gp() assigns a new
> > > (larger) value to rdp->gp_seq_needed.
> > >
> > > o Then __note_gp_changes() overwrites rdp->gp_seq_needed with the
> > > old value.
> > >
> > > This cannot happen because __note_gp_changes() runs with interrupts disabled on the CPU corresponding to the rcu_data structure referenced by the rdp pointer. So there is no way for rcu_start_this_gp() to be invoked on the same CPU during this "if" statement.
> > >
> > > Of course, there could be bugs. For example:
> > >
> > > o __note_gp_changes() might be called on a different CPU than that
> > > corresponding to rdp. You can check this with something like:
> > >
> > > WARN_ON_ONCE(rdp->cpu != smp_processor_id());
> > >
> > > o The same things could happen with rcu_start_this_gp(), and the
> > > above WARN_ON_ONCE() would work there as well.
> > >
> > > o rcutree_prepare_cpu() is a special case, but is irrelevant unless
> > > you are doing CPU-hotplug operations. (It can run on a CPU other
> > > than rdp->cpu, but only at times when rdp->cpu is offline.)
> > >
> > > o Interrupts might not really be disabled.
> > >
> > > That said, your patch could reduce overhead slightly, given that the two values will be equal much of the time. So it might be worth testing just for that reason.
> > >
> > > So why not just test it anyway? If it makes the bug go away, I
> > > will be surprised, but it would not be the first surprise for me.
> > > ;-)
> > >
> > > Thanx, Paul
> > >
> > > > Thanks!
> > > >
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > > 0b760c1..c00f34e 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
> > > > zero_cpu_stall_ticks(rdp);
> > > > }
> > > > rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */
> > > > - if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
> > > > + if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed)
> > > > + ||
> > > > + rdp->gpwrap)
> > > > rdp->gp_seq_needed = rnp->gp_seq_needed;
> > > > WRITE_ONCE(rdp->gpwrap, false);
> > > > rcu_gpnum_ovf(rnp, rdp);
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Paul E. McKenney [mailto:[email protected]]
> > > > Sent: Thursday, December 13, 2018 08:12
> > > > To: He, Bo <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>;
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected]; Zhang,
> > > > Jun <[email protected]>; Xiao, Jin <[email protected]>;
> > > > Zhang, Yanmin <[email protected]>; Bai, Jie A
> > > > <[email protected]>; Sun, Yi J <[email protected]>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> > > > > I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> > > > > Please correct me if I have something wrong.
> > > >
> > > > That would be because I sent you the wrong patch, apologies!
> > > > :-/
> > > >
> > > > Please instead see the one below, which does add sysrq_rcu.
> > > >
> > > > Thanx, Paul
> > > >
> > > > > -----Original Message-----
> > > > > From: Paul E. McKenney <[email protected]>
> > > > > Sent: Thursday, December 13, 2018 5:03 AM
> > > > > To: He, Bo <[email protected]>
> > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > [email protected]; [email protected];
> > > > > [email protected]; [email protected]; Zhang,
> > > > > Jun <[email protected]>; Xiao, Jin <[email protected]>;
> > > > > Zhang, Yanmin <[email protected]>; Bai, Jie A
> > > > > <[email protected]>
> > > > > Subject: Re: rcu_preempt caused oom
> > > > >
> > > > > On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > > > > > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > > > > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > > > > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> > > > > >
> > > > > > Strange.
> > > > > >
> > > > > > Are the systems responsive to sysrq keys once failure occurs?
> > > > > > If so, I will provide you a sysrq-R or some such to dump out the RCU state.
> > > > >
> > > > > Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below. Only lightly tested.
> > > >
> > > > ----------------------------------------------------------------
> > > > --
> > > > --
> > > > --
> > > > --
> > > >
> > > > commit 04b6245c8458e8725f4169e62912c1fadfdf8141
> > > > Author: Paul E. McKenney <[email protected]>
> > > > Date: Wed Dec 12 16:10:09 2018 -0800
> > > >
> > > > rcu: Add sysrq rcu_node-dump capability
> > > >
> > > > Backported from v4.21/v5.0
> > > >
> > > > Life is hard if RCU manages to get stuck without triggering RCU CPU
> > > > stall warnings or triggering the rcu_check_gp_start_stall() checks
> > > > for failing to start a grace period. This commit therefore adds a
> > > > boot-time-selectable sysrq key (commandeering "y") that allows manually
> > > > dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter
> > > > must be set for this sysrq to be available.
> > > >
> > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > > 0b760c1369f7..e9392a9d6291 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -61,6 +61,7 @@
> > > > #include <linux/trace_events.h> #include <linux/suspend.h>
> > > > #include <linux/ftrace.h>
> > > > +#include <linux/sysrq.h>
> > > >
> > > > #include "tree.h"
> > > > #include "rcu.h"
> > > > @@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int
> > > > rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total #
> > > > rcu_nodes in use. */
> > > > /* panic() on RCU Stall sysctl. */ int
> > > > sysctl_panic_on_rcu_stall __read_mostly;
> > > > +/* Commandeer a sysrq key to dump RCU's tree. */ static bool
> > > > +sysrq_rcu; module_param(sysrq_rcu, bool, 0444);
> > > >
> > > > /*
> > > > * The rcu_scheduler_active variable is initialized to the
> > > > value @@
> > > > -662,6 +666,27 @@ void show_rcu_gp_kthreads(void) }
> > > > EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
> > > >
> > > > +/* Dump grace-period-request information due to commandeered sysrq.
> > > > +*/ static void sysrq_show_rcu(int key) {
> > > > + show_rcu_gp_kthreads();
> > > > +}
> > > > +
> > > > +static struct sysrq_key_op sysrq_rcudump_op = {
> > > > + .handler = sysrq_show_rcu,
> > > > + .help_msg = "show-rcu(y)",
> > > > + .action_msg = "Show RCU tree",
> > > > + .enable_mask = SYSRQ_ENABLE_DUMP, };
> > > > +
> > > > +static int __init rcu_sysrq_init(void) {
> > > > + if (sysrq_rcu)
> > > > + return register_sysrq_key('y', &sysrq_rcudump_op);
> > > > + return 0;
> > > > +}
> > > > +early_initcall(rcu_sysrq_init);
> > > > +
> > > > /*
> > > > * Send along grace-period-related data for rcutorture diagnostics.
> > > > */
> > > >
> > >
> >
>
>


Attachments:
0001-rcu-detect-the-preempt_rcu-hang.patch (3.65 kB)
0001-rcu-detect-the-preempt_rcu-hang.patch

2018-12-14 05:11:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Fri, Dec 14, 2018 at 02:40:50AM +0000, He, Bo wrote:
> another experiment we have done with the enclosed debug patch, and also have more rcu trace event enable but without CONFIG_RCU_BOOST config, we don't reproduce the issue after 90 Hours until now on 10 boards(the issue should reproduce on one night per previous experience).

That certainly supports the hypothesis that a wakeup is either not
being sent or is being lost. Your patch is great for debugging (thank
you!), but the real solution of course needs to avoid the extra wakeups,
especially on battery-powered systems.

One suggested change below, to get rid of potential false positives.

> the purposes are to capture the more rcu event trace close to the issue happen, because I check the __wait_rcu_gp is not always in running, so we think even it trigger the panic for 3s timeout, the issue is already happened before 3s.

Agreed, it would be really good to have trace information from the cause.
In the case you sent yesterday, it would be good to have trace information
from 308.256 seconds prior to the sysrq-v, for example, by collecting the
same event traces you did a few days ago. It would also be good to know
whether the scheduler tick is providing interrupts, and if so, why
rcu_check_gp_start_stall() isn't being invoked. ;-)

If collecting this information with your setup is not feasible (for
example, you might need a large trace buffer to capture five minutes
of traces), please let me know and I can provide additional debug
code. Or you could add "rcu_ftrace_dump(DUMP_ALL);" just before the
"show_rcu_gp_kthreads();" in your patch below.

> And Actually the rsp->gp_flags = 1, but RCU_GP_WAIT_GPS(1) ->state: 0x402, it means the kthread is not schedule for 300s but the RCU_GP_FLAG_INIT is set. What's your ideas?

The most likely possibility is that my analysis below is confused and
there really is some way that the code can set the RCU_GP_FLAG_INIT
bit without later doing a wakeup. The trace data above could help
unconfuse me.

Thanx, Paul

> ---------------------------------------------------------------------------------------------------------------------------------
> - swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> - RCU_GP_FLAG_INIT);
> + if (current->pid != rcu_preempt_pid) {
> + swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> + RCU_GP_FLAG_INIT);
> + } else {

wait_again:

> + ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> + RCU_GP_FLAG_INIT, 2*HZ);
> +
> + if(!ret) {

This would avoid complaining if RCU was legitimately idle for a long time:

if(!ret && !READ_ONCE(rsp->gp_flags)) {
rcu_ftrace_dump(DUMP_ALL);
show_rcu_gp_kthreads();
panic("hung_task: blocked in rcu_gp_kthread init");
} else if (!ret) {
goto wait_again;
}

> + show_rcu_gp_kthreads();
> + panic("hung_task: blocked in rcu_gp_kthread init");
> + }
> + }
> --------------------------------------------------------------------------------------
> -----Original Message-----
> From: Paul E. McKenney <[email protected]>
> Sent: Friday, December 14, 2018 10:15 AM
> To: He, Bo <[email protected]>
> Cc: Zhang, Jun <[email protected]>; Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>
> Subject: Re: rcu_preempt caused oom
>
> On Fri, Dec 14, 2018 at 01:30:04AM +0000, He, Bo wrote:
> > as you mentioned CONFIG_FAST_NO_HZ, do you mean CONFIG_RCU_FAST_NO_HZ? I double checked there is no FAST_NO_HZ in .config:
>
> Yes, you are correct, CONFIG_RCU_FAST_NO_HZ. OK, you do not have it set, which means several code paths can be ignored. Also CONFIG_HZ=1000, so
> 300 second delay.
>
> Thanx, Paul
>
> > Here is the grep from .config:
> > egrep "HZ|RCU" .config
> > CONFIG_NO_HZ_COMMON=y
> > # CONFIG_HZ_PERIODIC is not set
> > CONFIG_NO_HZ_IDLE=y
> > # CONFIG_NO_HZ_FULL is not set
> > CONFIG_NO_HZ=y
> > # RCU Subsystem
> > CONFIG_PREEMPT_RCU=y
> > # CONFIG_RCU_EXPERT is not set
> > CONFIG_SRCU=y
> > CONFIG_TREE_SRCU=y
> > CONFIG_TASKS_RCU=y
> > CONFIG_RCU_STALL_COMMON=y
> > CONFIG_RCU_NEED_SEGCBLIST=y
> > # CONFIG_HZ_100 is not set
> > # CONFIG_HZ_250 is not set
> > # CONFIG_HZ_300 is not set
> > CONFIG_HZ_1000=y
> > CONFIG_HZ=1000
> > # CONFIG_MACHZ_WDT is not set
> > # RCU Debugging
> > CONFIG_PROVE_RCU=y
> > CONFIG_RCU_PERF_TEST=m
> > CONFIG_RCU_TORTURE_TEST=m
> > CONFIG_RCU_CPU_STALL_TIMEOUT=7
> > CONFIG_RCU_TRACE=y
> > CONFIG_RCU_EQS_DEBUG=y
> >
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Friday, December 14, 2018 2:12 AM
> > To: He, Bo <[email protected]>
> > Cc: Zhang, Jun <[email protected]>; Steven Rostedt
> > <[email protected]>; [email protected];
> > [email protected]; [email protected];
> > [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J
> > <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Thu, Dec 13, 2018 at 03:26:08PM +0000, He, Bo wrote:
> > > one of the board reproduce the issue with the show_rcu_gp_kthreads(), I also enclosed the logs as attachment.
> > >
> > > [17818.936032] rcu: rcu_preempt: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258 ->gp_wake_seq 21808189 ->gp_seq 21808192 ->gp_seq_needed 21808196 ->gp_flags 0x1
> >
> > This is quite helpful, thank you!
> >
> > The "RCU lockdep checking is enabled" says that CONFIG_PROVE_RCU=y, which is good. The "RCU_GP_WAIT_GPS(1)" means that the rcu_preempt task is waiting for a new grace-period request. The "->state: 0x402" means that it is sleeping, neither running nor in the process of waking up.
> > The "delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258" means that it has been more than 300,000 jiffies since the rcu_preempt task did anything or was requested to do anything.
> >
> > The "->gp_wake_seq 21808189 ->gp_seq 21808192" says that the last attempt to awaken the rcu_preempt task happened during the last grace period.
> > The "->gp_seq_needed 21808196 ->gp_flags 0x1" nevertheless says that someone requested a new grace period. So if the rcu_preempt task were to wake up, it would process the new grace period. Note again also the ->gp_req_activity 308256, which indicates that ->gp_flags was set more than 300,000 jiffies ago, just after the last recorded activity of the rcu_preempt task.
> >
> > But this is exactly the situation that rcu_check_gp_start_stall() is designed to warn about (and does warn about for me when I comment out the wakeup code). So why is rcu_check_gp_start_stall() not being called? Here are a couple of possibilities:
> >
> > 1. Because rcu_check_gp_start_stall() is only ever invoked from
> > RCU_SOFTIRQ, it is possible that softirqs are stalled for
> > whatever reason.
> >
> > 2. Because RCU_SOFTIRQ is invoked primarily from the scheduler-clock
> > interrupt handler, it is possible that the scheduler tick has
> > somehow been disabled. Traces from earlier runs showed a great
> > deal of RCU callbacks queued, which would have caused RCU to
> > refuse to allow the scheduler tick to be disabled, even if the
> > corresponding CPU was idle.
> >
> > 3. You have CONFIG_FAST_NO_HZ=y (which you probably do, given
> > that you are building for a battery-powered device) and all of the
> > CPU's callbacks are lazy. Except that your earlier traces showed
> > lots of non-lazy callbacks. Besides, even if all callbacks were
> > lazy, there would still be a scheduling-clock interrupt every
> > six seconds, and there are quite a few six-second intervals
> > in a two-minute watchdog timeout.
> >
> > But if we cannot find the problem quickly, I will likely ask
> > you to try reproducing with CONFIG_FAST_NO_HZ=n. This could
> > be thought of as bisecting the RCU code looking for the bug.
> >
> > The first two of these seem unlikely given that the watchdog timer was still firing. Still, I don't see how 300,000 jiffies elapsed with a grace period requested and not started otherwise. Could you please check?
> > One way to do so would be to enable ftrace on rcu_check_callbacks(), __rcu_process_callbacks(), and rcu_check_gp_start_stall(). It might be necessary to no-inline rcu_check_gp_start_stall(). You might have better ways to collect this information.
> >
> > Without this information, the only workaround patch I can give you will degrade battery lifetime, which might not be what you want.
> >
> > You do have a lockdep complaint early at boot. Although I don't immediately see how this self-deadlock would affect RCU, please do get it fixed. Sometimes the consequences of this sort of deadlock can propagate to unexepected places.
> >
> > Regardless of why rcu_check_gp_start_stall() failed to complain, it looks like this was set after the rcu_preempt task slept for the last time, and so there should have been a wakeup the last time that ->gp_flags was set. Perhaps there is some code path that drops the wakeup.
> > I did check this in current -rcu, but you are instead running v4.19, so I should also check there.
> >
> > The ->gp_flags has its RCU_GP_FLAG_INIT bit set in rcu_start_this_gp() and in rcu_gp_cleanup(). We can eliminate rcu_gp_cleanup() from consideration because only the rcu_preempt task will execute that code, and we know that this task was asleep at the last time this bit was set.
> > Now rcu_start_this_gp() returns a flag indicating whether or not a wakeup is needed, and the caller must do the wakeup once it is safe to do so, that is, after the various rcu_node locks have been released (doing a wakeup while holding any of those locks results in deadlock).
> >
> > The following functions invoke rcu_start_this_gp: rcu_accelerate_cbs() and rcu_nocb_wait_gp(). We can eliminate rcu_nocb_wait_gp() because you are building with CONFIG_RCU_NOCB_CPU=n. Then rcu_accelerate_cbs() is invoked from:
> >
> > o rcu_accelerate_cbs_unlocked(), which does the following, thus
> > properly awakening the rcu_preempt task when needed:
> >
> > needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> > raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> > if (needwake)
> > rcu_gp_kthread_wake(rsp);
> >
> > o rcu_advance_cbs(), which returns the value returned by
> > rcu_accelerate_cbs(), thus pushing the problem off to its
> > callers, which are called out below.
> >
> > o __note_gp_changes(), which also returns the value returned by
> > rcu_accelerate_cbs(), thus pushing the problem off to its callers,
> > which are called out below.
> >
> > o rcu_gp_cleanup(), which is only ever invoked by RCU grace-period
> > kthreads such as the rcu_preempt task. Therefore, this function
> > never needs to awaken the rcu_preempt task, because the fact
> > that this function is executing means that this task is already
> > awake. (Also, as noted above, we can eliminate this code from
> > consideration because this task is known to have been sleeping
> > at the last time that the RCU_GP_FLAG_INIT bit was set.)
> >
> > o rcu_report_qs_rdp(), which does the following, thus properly
> > awakening the rcu_preempt task when needed:
> >
> > needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> >
> > rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> > /* ^^^ Released rnp->lock */
> > if (needwake)
> > rcu_gp_kthread_wake(rsp);
> >
> > o rcu_prepare_for_idle(), which does the following, thus properly
> > awakening the rcu_preempt task when needed:
> >
> > needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> > raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> > if (needwake)
> > rcu_gp_kthread_wake(rsp);
> >
> > Now for rcu_advance_cbs():
> >
> > o __note_gp_changes(), which which also returns the value returned
> > by rcu_advance_cbs(), thus pushing the problem off to its callers,
> > which are called out below.
> >
> > o rcu_migrate_callbacks(), which does the following, thus properly
> > awakening the rcu_preempt task when needed:
> >
> > needwake = rcu_advance_cbs(rsp, rnp_root, rdp) ||
> > rcu_advance_cbs(rsp, rnp_root, my_rdp);
> > rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist);
> > WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) !=
> > !rcu_segcblist_n_cbs(&my_rdp->cblist));
> > raw_spin_unlock_irqrestore_rcu_node(rnp_root, flags);
> > if (needwake)
> > rcu_gp_kthread_wake(rsp);
> >
> > Now for __note_gp_changes():
> >
> > o note_gp_changes(), which does the following, thus properly
> > awakening the rcu_preempt task when needed:
> >
> > needwake = __note_gp_changes(rsp, rnp, rdp);
> > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > if (needwake)
> > rcu_gp_kthread_wake(rsp);
> >
> > o rcu_gp_init() which is only ever invoked by RCU grace-period
> > kthreads such as the rcu_preempt task, which makes wakeups
> > unnecessary, just as for rcu_gp_cleanup() above.
> >
> > o rcu_gp_cleanup(), ditto.
> >
> > So I am not seeing how I am losing a wakeup, but please do feel free to double-check my analysis. One way to do that is using event tracing.
> >
> > Thanx, Paul
> >
> > ----------------------------------------------------------------------
> > --
> > lockdep complaint:
> > ----------------------------------------------------------------------
> > --
> >
> > [ 2.895507] ======================================================
> > [ 2.895511] WARNING: possible circular locking dependency detected
> > [ 2.895517] 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1 Tainted: G U
> > [ 2.895521] ------------------------------------------------------
> > [ 2.895525] earlyEvs/1839 is trying to acquire lock:
> > [ 2.895530] 00000000ff344115 (&asd->mutex){+.+.}, at: ipu_isys_subdev_get_ffmt+0x32/0x90
> > [ 2.895546]
> > [ 2.895546] but task is already holding lock:
> > [ 2.895550] 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> > [ 2.895561]
> > [ 2.895561] which lock already depends on the new lock.
> > [ 2.895561]
> > [ 2.895566]
> > [ 2.895566] the existing dependency chain (in reverse order) is:
> > [ 2.895570]
> > [ 2.895570] -> #1 (&mdev->graph_mutex){+.+.}:
> > [ 2.895583] __mutex_lock+0x80/0x9a0
> > [ 2.895588] mutex_lock_nested+0x1b/0x20
> > [ 2.895593] media_device_register_entity+0x92/0x1e0
> > [ 2.895598] v4l2_device_register_subdev+0xc2/0x1b0
> > [ 2.895604] ipu_isys_csi2_init+0x22c/0x520
> > [ 2.895608] isys_probe+0x6cb/0xed0
> > [ 2.895613] ipu_bus_probe+0xfd/0x2e0
> > [ 2.895620] really_probe+0x268/0x3d0
> > [ 2.895625] driver_probe_device+0x11a/0x130
> > [ 2.895630] __device_attach_driver+0x86/0x100
> > [ 2.895635] bus_for_each_drv+0x6e/0xb0
> > [ 2.895640] __device_attach+0xdf/0x160
> > [ 2.895645] device_initial_probe+0x13/0x20
> > [ 2.895650] bus_probe_device+0xa6/0xc0
> > [ 2.895655] deferred_probe_work_func+0x88/0xe0
> > [ 2.895661] process_one_work+0x220/0x5c0
> > [ 2.895665] worker_thread+0x1da/0x3b0
> > [ 2.895670] kthread+0x12c/0x150
> > [ 2.895675] ret_from_fork+0x3a/0x50
> > [ 2.895678]
> > [ 2.895678] -> #0 (&asd->mutex){+.+.}:
> > [ 2.895688] lock_acquire+0x95/0x1a0
> > [ 2.895693] __mutex_lock+0x80/0x9a0
> > [ 2.895698] mutex_lock_nested+0x1b/0x20
> > [ 2.895703] ipu_isys_subdev_get_ffmt+0x32/0x90
> > [ 2.895708] ipu_isys_csi2_get_fmt+0x14/0x30
> > [ 2.895713] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> > [ 2.895718] v4l2_subdev_link_validate_one+0x67/0x120
> > [ 2.895723] v4l2_subdev_link_validate+0x246/0x490
> > [ 2.895728] csi2_link_validate+0xc6/0x220
> > [ 2.895733] __media_pipeline_start+0x15b/0x2f0
> > [ 2.895738] media_pipeline_start+0x33/0x50
> > [ 2.895743] ipu_isys_video_prepare_streaming+0x1e0/0x610
> > [ 2.895748] start_streaming+0x186/0x3a0
> > [ 2.895753] vb2_start_streaming+0x6d/0x130
> > [ 2.895758] vb2_core_streamon+0x108/0x140
> > [ 2.895762] vb2_streamon+0x29/0x50
> > [ 2.895767] vb2_ioctl_streamon+0x42/0x50
> > [ 2.895772] v4l_streamon+0x20/0x30
> > [ 2.895776] __video_do_ioctl+0x1af/0x3c0
> > [ 2.895781] video_usercopy+0x27e/0x7e0
> > [ 2.895785] video_ioctl2+0x15/0x20
> > [ 2.895789] v4l2_ioctl+0x49/0x50
> > [ 2.895794] do_video_ioctl+0x93c/0x2360
> > [ 2.895799] v4l2_compat_ioctl32+0x93/0xe0
> > [ 2.895806] __ia32_compat_sys_ioctl+0x73a/0x1c90
> > [ 2.895813] do_fast_syscall_32+0x9a/0x2d6
> > [ 2.895818] entry_SYSENTER_compat+0x6d/0x7c
> > [ 2.895821]
> > [ 2.895821] other info that might help us debug this:
> > [ 2.895821]
> > [ 2.895826] Possible unsafe locking scenario:
> > [ 2.895826]
> > [ 2.895830] CPU0 CPU1
> > [ 2.895833] ---- ----
> > [ 2.895836] lock(&mdev->graph_mutex);
> > [ 2.895842] lock(&asd->mutex);
> > [ 2.895847] lock(&mdev->graph_mutex);
> > [ 2.895852] lock(&asd->mutex);
> > [ 2.895857]
> > [ 2.895857] *** DEADLOCK ***
> > [ 2.895857]
> > [ 2.895863] 3 locks held by earlyEvs/1839:
> > [ 2.895866] #0: 00000000ed860090 (&av->mutex){+.+.}, at: __video_do_ioctl+0xbf/0x3c0
> > [ 2.895876] #1: 000000000cb253e7 (&isys->stream_mutex){+.+.}, at: start_streaming+0x5c/0x3a0
> > [ 2.895886] #2: 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> > [ 2.895896]
> > [ 2.895896] stack backtrace:
> > [ 2.895903] CPU: 0 PID: 1839 Comm: earlyEvs Tainted: G U 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1
> > [ 2.895907] Call Trace:
> > [ 2.895915] dump_stack+0x70/0xa5
> > [ 2.895921] print_circular_bug.isra.35+0x1d8/0x1e6
> > [ 2.895927] __lock_acquire+0x1284/0x1340
> > [ 2.895931] ? __lock_acquire+0x2b5/0x1340
> > [ 2.895940] lock_acquire+0x95/0x1a0
> > [ 2.895945] ? lock_acquire+0x95/0x1a0
> > [ 2.895950] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> > [ 2.895956] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> > [ 2.895961] __mutex_lock+0x80/0x9a0
> > [ 2.895966] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> > [ 2.895971] ? crlmodule_get_format+0x43/0x50
> > [ 2.895979] mutex_lock_nested+0x1b/0x20
> > [ 2.895984] ? mutex_lock_nested+0x1b/0x20
> > [ 2.895989] ipu_isys_subdev_get_ffmt+0x32/0x90
> > [ 2.895995] ipu_isys_csi2_get_fmt+0x14/0x30
> > [ 2.896001] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> > [ 2.896006] v4l2_subdev_link_validate_one+0x67/0x120
> > [ 2.896011] ? crlmodule_get_format+0x2a/0x50
> > [ 2.896018] ? find_held_lock+0x35/0xa0
> > [ 2.896023] ? crlmodule_get_format+0x43/0x50
> > [ 2.896030] v4l2_subdev_link_validate+0x246/0x490
> > [ 2.896035] ? __mutex_unlock_slowpath+0x58/0x2f0
> > [ 2.896042] ? mutex_unlock+0x12/0x20
> > [ 2.896046] ? crlmodule_get_format+0x43/0x50
> > [ 2.896052] ? v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> > [ 2.896057] ? v4l2_subdev_link_validate_one+0x67/0x120
> > [ 2.896065] ? __is_insn_slot_addr+0xad/0x120
> > [ 2.896070] ? kernel_text_address+0xc4/0x100
> > [ 2.896078] ? v4l2_subdev_link_validate+0x246/0x490
> > [ 2.896085] ? kernel_text_address+0xc4/0x100
> > [ 2.896092] ? __lock_acquire+0x1106/0x1340
> > [ 2.896096] ? __lock_acquire+0x1169/0x1340
> > [ 2.896103] csi2_link_validate+0xc6/0x220
> > [ 2.896110] ? __lock_is_held+0x5a/0xa0
> > [ 2.896115] ? mark_held_locks+0x58/0x80
> > [ 2.896122] ? __kmalloc+0x207/0x2e0
> > [ 2.896127] ? __lock_is_held+0x5a/0xa0
> > [ 2.896134] ? rcu_read_lock_sched_held+0x81/0x90
> > [ 2.896139] ? __kmalloc+0x2a3/0x2e0
> > [ 2.896144] ? media_pipeline_start+0x28/0x50
> > [ 2.896150] ? __media_entity_enum_init+0x33/0x70
> > [ 2.896155] ? csi2_has_route+0x18/0x20
> > [ 2.896160] ? media_graph_walk_next.part.9+0xac/0x290
> > [ 2.896166] __media_pipeline_start+0x15b/0x2f0
> > [ 2.896173] ? rcu_read_lock_sched_held+0x81/0x90
> > [ 2.896179] media_pipeline_start+0x33/0x50
> > [ 2.896186] ipu_isys_video_prepare_streaming+0x1e0/0x610
> > [ 2.896191] ? __lock_acquire+0x132e/0x1340
> > [ 2.896198] ? __lock_acquire+0x2b5/0x1340
> > [ 2.896204] ? lock_acquire+0x95/0x1a0
> > [ 2.896209] ? start_streaming+0x5c/0x3a0
> > [ 2.896215] ? start_streaming+0x5c/0x3a0
> > [ 2.896221] ? __mutex_lock+0x391/0x9a0
> > [ 2.896226] ? v4l_enable_media_source+0x2d/0x70
> > [ 2.896233] ? find_held_lock+0x35/0xa0
> > [ 2.896238] ? v4l_enable_media_source+0x57/0x70
> > [ 2.896245] start_streaming+0x186/0x3a0
> > [ 2.896250] ? __mutex_unlock_slowpath+0x58/0x2f0
> > [ 2.896257] vb2_start_streaming+0x6d/0x130
> > [ 2.896262] ? vb2_start_streaming+0x6d/0x130
> > [ 2.896267] vb2_core_streamon+0x108/0x140
> > [ 2.896273] vb2_streamon+0x29/0x50
> > [ 2.896278] vb2_ioctl_streamon+0x42/0x50
> > [ 2.896284] v4l_streamon+0x20/0x30
> > [ 2.896288] __video_do_ioctl+0x1af/0x3c0
> > [ 2.896296] ? __might_fault+0x85/0x90
> > [ 2.896302] video_usercopy+0x27e/0x7e0
> > [ 2.896307] ? copy_overflow+0x20/0x20
> > [ 2.896313] ? find_held_lock+0x35/0xa0
> > [ 2.896319] ? __might_fault+0x3e/0x90
> > [ 2.896325] video_ioctl2+0x15/0x20
> > [ 2.896330] v4l2_ioctl+0x49/0x50
> > [ 2.896335] do_video_ioctl+0x93c/0x2360
> > [ 2.896343] v4l2_compat_ioctl32+0x93/0xe0
> > [ 2.896349] __ia32_compat_sys_ioctl+0x73a/0x1c90
> > [ 2.896354] ? lockdep_hardirqs_on+0xef/0x180
> > [ 2.896359] ? do_fast_syscall_32+0x3b/0x2d6
> > [ 2.896364] do_fast_syscall_32+0x9a/0x2d6
> > [ 2.896370] entry_SYSENTER_compat+0x6d/0x7c
> > [ 2.896377] RIP: 0023:0xf7e79b79
> > [ 2.896382] Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 0c 24 c3 8b 1c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
> > [ 2.896387] RSP: 002b:00000000f76816bc EFLAGS: 00000292 ORIG_RAX: 0000000000000036
> > [ 2.896393] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 0000000040045612
> > [ 2.896396] RDX: 00000000f768172c RSI: 00000000f7d42d9c RDI: 00000000f768172c
> > [ 2.896400] RBP: 00000000f7681708 R08: 0000000000000000 R09: 0000000000000000
> > [ 2.896404] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > [ 2.896408] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> >
> > ----------------------------------------------------------------------
> > --
> >
> > > [17818.936039] rcu: rcu_node 0:3 ->gp_seq 21808192 ->gp_seq_needed 21808196
> > > [17818.936048] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 101730 ->gp_req_activity 101732 ->gp_wake_time 101730 ->gp_wake_seq 1357 - >gp_seq 1360 ->gp_seq_needed 1360 ->gp_flags 0x0
> > > [17818.936056] rcu: rcu_bh: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 4312486108 ->gp_req_activity 4312486108 ->gp_wake_time 4312486108 - >gp_wake_seq 0 ->gp_seq -1200 ->gp_seq_needed -1200 ->gp_flags 0x0
> > >
> > > -----Original Message-----
> > > From: Paul E. McKenney <[email protected]>
> > > Sent: Thursday, December 13, 2018 12:40 PM
> > > To: Zhang, Jun <[email protected]>
> > > Cc: He, Bo <[email protected]>; Steven Rostedt <[email protected]>;
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Xiao, Jin
> > > <[email protected]>; Zhang, Yanmin <[email protected]>; Bai,
> > > Jie A <[email protected]>; Sun, Yi J <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Thu, Dec 13, 2018 at 03:28:46AM +0000, Zhang, Jun wrote:
> > > > Ok, we will test it, thanks!
> > >
> > > But please also try the sysrq-y with the earlier patch after a hang!
> > >
> > > Thanx, Paul
> > >
> > > > -----Original Message-----
> > > > From: Paul E. McKenney [mailto:[email protected]]
> > > > Sent: Thursday, December 13, 2018 10:43
> > > > To: Zhang, Jun <[email protected]>
> > > > Cc: He, Bo <[email protected]>; Steven Rostedt
> > > > <[email protected]>; [email protected];
> > > > [email protected]; [email protected];
> > > > [email protected]; Xiao, Jin <[email protected]>; Zhang,
> > > > Yanmin <[email protected]>; Bai, Jie A <[email protected]>;
> > > > Sun, Yi J <[email protected]>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Thu, Dec 13, 2018 at 02:11:35AM +0000, Zhang, Jun wrote:
> > > > > Hello, Paul
> > > > >
> > > > > I think the next patch is better.
> > > > > Because ULONG_CMP_GE could cause double write, which has risk that write back old value.
> > > > > Please help review.
> > > > > I don't test it. If you agree, we will test it.
> > > >
> > > > Just to make sure that I understand, you are worried about something like the following, correct?
> > > >
> > > > o __note_gp_changes() compares rnp->gp_seq_needed and rdp->gp_seq_needed
> > > > and finds them equal.
> > > >
> > > > o At just this time something like rcu_start_this_gp() assigns a new
> > > > (larger) value to rdp->gp_seq_needed.
> > > >
> > > > o Then __note_gp_changes() overwrites rdp->gp_seq_needed with the
> > > > old value.
> > > >
> > > > This cannot happen because __note_gp_changes() runs with interrupts disabled on the CPU corresponding to the rcu_data structure referenced by the rdp pointer. So there is no way for rcu_start_this_gp() to be invoked on the same CPU during this "if" statement.
> > > >
> > > > Of course, there could be bugs. For example:
> > > >
> > > > o __note_gp_changes() might be called on a different CPU than that
> > > > corresponding to rdp. You can check this with something like:
> > > >
> > > > WARN_ON_ONCE(rdp->cpu != smp_processor_id());
> > > >
> > > > o The same things could happen with rcu_start_this_gp(), and the
> > > > above WARN_ON_ONCE() would work there as well.
> > > >
> > > > o rcutree_prepare_cpu() is a special case, but is irrelevant unless
> > > > you are doing CPU-hotplug operations. (It can run on a CPU other
> > > > than rdp->cpu, but only at times when rdp->cpu is offline.)
> > > >
> > > > o Interrupts might not really be disabled.
> > > >
> > > > That said, your patch could reduce overhead slightly, given that the two values will be equal much of the time. So it might be worth testing just for that reason.
> > > >
> > > > So why not just test it anyway? If it makes the bug go away, I
> > > > will be surprised, but it would not be the first surprise for me.
> > > > ;-)
> > > >
> > > > Thanx, Paul
> > > >
> > > > > Thanks!
> > > > >
> > > > >
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > > > 0b760c1..c00f34e 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
> > > > > zero_cpu_stall_ticks(rdp);
> > > > > }
> > > > > rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */
> > > > > - if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
> > > > > + if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed)
> > > > > + ||
> > > > > + rdp->gpwrap)
> > > > > rdp->gp_seq_needed = rnp->gp_seq_needed;
> > > > > WRITE_ONCE(rdp->gpwrap, false);
> > > > > rcu_gpnum_ovf(rnp, rdp);
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Paul E. McKenney [mailto:[email protected]]
> > > > > Sent: Thursday, December 13, 2018 08:12
> > > > > To: He, Bo <[email protected]>
> > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > [email protected]; [email protected];
> > > > > [email protected]; [email protected]; Zhang,
> > > > > Jun <[email protected]>; Xiao, Jin <[email protected]>;
> > > > > Zhang, Yanmin <[email protected]>; Bai, Jie A
> > > > > <[email protected]>; Sun, Yi J <[email protected]>
> > > > > Subject: Re: rcu_preempt caused oom
> > > > >
> > > > > On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> > > > > > I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> > > > > > Please correct me if I have something wrong.
> > > > >
> > > > > That would be because I sent you the wrong patch, apologies!
> > > > > :-/
> > > > >
> > > > > Please instead see the one below, which does add sysrq_rcu.
> > > > >
> > > > > Thanx, Paul
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Paul E. McKenney <[email protected]>
> > > > > > Sent: Thursday, December 13, 2018 5:03 AM
> > > > > > To: He, Bo <[email protected]>
> > > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > > [email protected]; [email protected];
> > > > > > [email protected]; [email protected]; Zhang,
> > > > > > Jun <[email protected]>; Xiao, Jin <[email protected]>;
> > > > > > Zhang, Yanmin <[email protected]>; Bai, Jie A
> > > > > > <[email protected]>
> > > > > > Subject: Re: rcu_preempt caused oom
> > > > > >
> > > > > > On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > > > > > > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > > > > > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > > > > > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> > > > > > >
> > > > > > > Strange.
> > > > > > >
> > > > > > > Are the systems responsive to sysrq keys once failure occurs?
> > > > > > > If so, I will provide you a sysrq-R or some such to dump out the RCU state.
> > > > > >
> > > > > > Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below. Only lightly tested.
> > > > >
> > > > > ----------------------------------------------------------------
> > > > > --
> > > > > --
> > > > > --
> > > > > --
> > > > >
> > > > > commit 04b6245c8458e8725f4169e62912c1fadfdf8141
> > > > > Author: Paul E. McKenney <[email protected]>
> > > > > Date: Wed Dec 12 16:10:09 2018 -0800
> > > > >
> > > > > rcu: Add sysrq rcu_node-dump capability
> > > > >
> > > > > Backported from v4.21/v5.0
> > > > >
> > > > > Life is hard if RCU manages to get stuck without triggering RCU CPU
> > > > > stall warnings or triggering the rcu_check_gp_start_stall() checks
> > > > > for failing to start a grace period. This commit therefore adds a
> > > > > boot-time-selectable sysrq key (commandeering "y") that allows manually
> > > > > dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter
> > > > > must be set for this sysrq to be available.
> > > > >
> > > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > >
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > > > 0b760c1369f7..e9392a9d6291 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -61,6 +61,7 @@
> > > > > #include <linux/trace_events.h> #include <linux/suspend.h>
> > > > > #include <linux/ftrace.h>
> > > > > +#include <linux/sysrq.h>
> > > > >
> > > > > #include "tree.h"
> > > > > #include "rcu.h"
> > > > > @@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int
> > > > > rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total #
> > > > > rcu_nodes in use. */
> > > > > /* panic() on RCU Stall sysctl. */ int
> > > > > sysctl_panic_on_rcu_stall __read_mostly;
> > > > > +/* Commandeer a sysrq key to dump RCU's tree. */ static bool
> > > > > +sysrq_rcu; module_param(sysrq_rcu, bool, 0444);
> > > > >
> > > > > /*
> > > > > * The rcu_scheduler_active variable is initialized to the
> > > > > value @@
> > > > > -662,6 +666,27 @@ void show_rcu_gp_kthreads(void) }
> > > > > EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
> > > > >
> > > > > +/* Dump grace-period-request information due to commandeered sysrq.
> > > > > +*/ static void sysrq_show_rcu(int key) {
> > > > > + show_rcu_gp_kthreads();
> > > > > +}
> > > > > +
> > > > > +static struct sysrq_key_op sysrq_rcudump_op = {
> > > > > + .handler = sysrq_show_rcu,
> > > > > + .help_msg = "show-rcu(y)",
> > > > > + .action_msg = "Show RCU tree",
> > > > > + .enable_mask = SYSRQ_ENABLE_DUMP, };
> > > > > +
> > > > > +static int __init rcu_sysrq_init(void) {
> > > > > + if (sysrq_rcu)
> > > > > + return register_sysrq_key('y', &sysrq_rcudump_op);
> > > > > + return 0;
> > > > > +}
> > > > > +early_initcall(rcu_sysrq_init);
> > > > > +
> > > > > /*
> > > > > * Send along grace-period-related data for rcutorture diagnostics.
> > > > > */
> > > > >
> > > >
> > >
> >
> >
>



2018-12-14 05:40:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Thu, Dec 13, 2018 at 09:10:12PM -0800, Paul E. McKenney wrote:
> On Fri, Dec 14, 2018 at 02:40:50AM +0000, He, Bo wrote:
> > another experiment we have done with the enclosed debug patch, and also have more rcu trace event enable but without CONFIG_RCU_BOOST config, we don't reproduce the issue after 90 Hours until now on 10 boards(the issue should reproduce on one night per previous experience).
>
> That certainly supports the hypothesis that a wakeup is either not
> being sent or is being lost. Your patch is great for debugging (thank
> you!), but the real solution of course needs to avoid the extra wakeups,
> especially on battery-powered systems.
>
> One suggested change below, to get rid of potential false positives.
>
> > the purposes are to capture the more rcu event trace close to the issue happen, because I check the __wait_rcu_gp is not always in running, so we think even it trigger the panic for 3s timeout, the issue is already happened before 3s.
>
> Agreed, it would be really good to have trace information from the cause.
> In the case you sent yesterday, it would be good to have trace information
> from 308.256 seconds prior to the sysrq-v, for example, by collecting the
> same event traces you did a few days ago. It would also be good to know
> whether the scheduler tick is providing interrupts, and if so, why
> rcu_check_gp_start_stall() isn't being invoked. ;-)
>
> If collecting this information with your setup is not feasible (for
> example, you might need a large trace buffer to capture five minutes
> of traces), please let me know and I can provide additional debug
> code. Or you could add "rcu_ftrace_dump(DUMP_ALL);" just before the
> "show_rcu_gp_kthreads();" in your patch below.
>
> > And Actually the rsp->gp_flags = 1, but RCU_GP_WAIT_GPS(1) ->state: 0x402, it means the kthread is not schedule for 300s but the RCU_GP_FLAG_INIT is set. What's your ideas?
>
> The most likely possibility is that my analysis below is confused and
> there really is some way that the code can set the RCU_GP_FLAG_INIT
> bit without later doing a wakeup. The trace data above could help
> unconfuse me.
>
> Thanx, Paul
>
> > ---------------------------------------------------------------------------------------------------------------------------------
> > - swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> > - RCU_GP_FLAG_INIT);
> > + if (current->pid != rcu_preempt_pid) {
> > + swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> > + RCU_GP_FLAG_INIT);
> > + } else {
>
> wait_again:
>
> > + ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> > + RCU_GP_FLAG_INIT, 2*HZ);
> > +
> > + if(!ret) {
>
> This would avoid complaining if RCU was legitimately idle for a long time:

Let's try this again. Unless I am confused (quite possible) your original
would panic if RCU was idle for more than two seconds. What we instead
want is to panic if we time out, but end up with RCU_GP_FLAG_INIT set.

So something like this:

if (ret == 1) {
/* Timed out with RCU_GP_FLAG_INIT. */
rcu_ftrace_dump(DUMP_ALL);
show_rcu_gp_kthreads();
panic("hung_task: blocked in rcu_gp_kthread init");
} else if (!ret) {
/* Timed out w/out RCU_GP_FLAG_INIT. */
goto wait_again;
}

Thanx, Paul

> > + show_rcu_gp_kthreads();
> > + panic("hung_task: blocked in rcu_gp_kthread init");
> > + }
> > + }
> > --------------------------------------------------------------------------------------
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Friday, December 14, 2018 10:15 AM
> > To: He, Bo <[email protected]>
> > Cc: Zhang, Jun <[email protected]>; Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Fri, Dec 14, 2018 at 01:30:04AM +0000, He, Bo wrote:
> > > as you mentioned CONFIG_FAST_NO_HZ, do you mean CONFIG_RCU_FAST_NO_HZ? I double checked there is no FAST_NO_HZ in .config:
> >
> > Yes, you are correct, CONFIG_RCU_FAST_NO_HZ. OK, you do not have it set, which means several code paths can be ignored. Also CONFIG_HZ=1000, so
> > 300 second delay.
> >
> > Thanx, Paul
> >
> > > Here is the grep from .config:
> > > egrep "HZ|RCU" .config
> > > CONFIG_NO_HZ_COMMON=y
> > > # CONFIG_HZ_PERIODIC is not set
> > > CONFIG_NO_HZ_IDLE=y
> > > # CONFIG_NO_HZ_FULL is not set
> > > CONFIG_NO_HZ=y
> > > # RCU Subsystem
> > > CONFIG_PREEMPT_RCU=y
> > > # CONFIG_RCU_EXPERT is not set
> > > CONFIG_SRCU=y
> > > CONFIG_TREE_SRCU=y
> > > CONFIG_TASKS_RCU=y
> > > CONFIG_RCU_STALL_COMMON=y
> > > CONFIG_RCU_NEED_SEGCBLIST=y
> > > # CONFIG_HZ_100 is not set
> > > # CONFIG_HZ_250 is not set
> > > # CONFIG_HZ_300 is not set
> > > CONFIG_HZ_1000=y
> > > CONFIG_HZ=1000
> > > # CONFIG_MACHZ_WDT is not set
> > > # RCU Debugging
> > > CONFIG_PROVE_RCU=y
> > > CONFIG_RCU_PERF_TEST=m
> > > CONFIG_RCU_TORTURE_TEST=m
> > > CONFIG_RCU_CPU_STALL_TIMEOUT=7
> > > CONFIG_RCU_TRACE=y
> > > CONFIG_RCU_EQS_DEBUG=y
> > >
> > > -----Original Message-----
> > > From: Paul E. McKenney <[email protected]>
> > > Sent: Friday, December 14, 2018 2:12 AM
> > > To: He, Bo <[email protected]>
> > > Cc: Zhang, Jun <[email protected]>; Steven Rostedt
> > > <[email protected]>; [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > > <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J
> > > <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Thu, Dec 13, 2018 at 03:26:08PM +0000, He, Bo wrote:
> > > > one of the board reproduce the issue with the show_rcu_gp_kthreads(), I also enclosed the logs as attachment.
> > > >
> > > > [17818.936032] rcu: rcu_preempt: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258 ->gp_wake_seq 21808189 ->gp_seq 21808192 ->gp_seq_needed 21808196 ->gp_flags 0x1
> > >
> > > This is quite helpful, thank you!
> > >
> > > The "RCU lockdep checking is enabled" says that CONFIG_PROVE_RCU=y, which is good. The "RCU_GP_WAIT_GPS(1)" means that the rcu_preempt task is waiting for a new grace-period request. The "->state: 0x402" means that it is sleeping, neither running nor in the process of waking up.
> > > The "delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258" means that it has been more than 300,000 jiffies since the rcu_preempt task did anything or was requested to do anything.
> > >
> > > The "->gp_wake_seq 21808189 ->gp_seq 21808192" says that the last attempt to awaken the rcu_preempt task happened during the last grace period.
> > > The "->gp_seq_needed 21808196 ->gp_flags 0x1" nevertheless says that someone requested a new grace period. So if the rcu_preempt task were to wake up, it would process the new grace period. Note again also the ->gp_req_activity 308256, which indicates that ->gp_flags was set more than 300,000 jiffies ago, just after the last recorded activity of the rcu_preempt task.
> > >
> > > But this is exactly the situation that rcu_check_gp_start_stall() is designed to warn about (and does warn about for me when I comment out the wakeup code). So why is rcu_check_gp_start_stall() not being called? Here are a couple of possibilities:
> > >
> > > 1. Because rcu_check_gp_start_stall() is only ever invoked from
> > > RCU_SOFTIRQ, it is possible that softirqs are stalled for
> > > whatever reason.
> > >
> > > 2. Because RCU_SOFTIRQ is invoked primarily from the scheduler-clock
> > > interrupt handler, it is possible that the scheduler tick has
> > > somehow been disabled. Traces from earlier runs showed a great
> > > deal of RCU callbacks queued, which would have caused RCU to
> > > refuse to allow the scheduler tick to be disabled, even if the
> > > corresponding CPU was idle.
> > >
> > > 3. You have CONFIG_FAST_NO_HZ=y (which you probably do, given
> > > that you are building for a battery-powered device) and all of the
> > > CPU's callbacks are lazy. Except that your earlier traces showed
> > > lots of non-lazy callbacks. Besides, even if all callbacks were
> > > lazy, there would still be a scheduling-clock interrupt every
> > > six seconds, and there are quite a few six-second intervals
> > > in a two-minute watchdog timeout.
> > >
> > > But if we cannot find the problem quickly, I will likely ask
> > > you to try reproducing with CONFIG_FAST_NO_HZ=n. This could
> > > be thought of as bisecting the RCU code looking for the bug.
> > >
> > > The first two of these seem unlikely given that the watchdog timer was still firing. Still, I don't see how 300,000 jiffies elapsed with a grace period requested and not started otherwise. Could you please check?
> > > One way to do so would be to enable ftrace on rcu_check_callbacks(), __rcu_process_callbacks(), and rcu_check_gp_start_stall(). It might be necessary to no-inline rcu_check_gp_start_stall(). You might have better ways to collect this information.
> > >
> > > Without this information, the only workaround patch I can give you will degrade battery lifetime, which might not be what you want.
> > >
> > > You do have a lockdep complaint early at boot. Although I don't immediately see how this self-deadlock would affect RCU, please do get it fixed. Sometimes the consequences of this sort of deadlock can propagate to unexepected places.
> > >
> > > Regardless of why rcu_check_gp_start_stall() failed to complain, it looks like this was set after the rcu_preempt task slept for the last time, and so there should have been a wakeup the last time that ->gp_flags was set. Perhaps there is some code path that drops the wakeup.
> > > I did check this in current -rcu, but you are instead running v4.19, so I should also check there.
> > >
> > > The ->gp_flags has its RCU_GP_FLAG_INIT bit set in rcu_start_this_gp() and in rcu_gp_cleanup(). We can eliminate rcu_gp_cleanup() from consideration because only the rcu_preempt task will execute that code, and we know that this task was asleep at the last time this bit was set.
> > > Now rcu_start_this_gp() returns a flag indicating whether or not a wakeup is needed, and the caller must do the wakeup once it is safe to do so, that is, after the various rcu_node locks have been released (doing a wakeup while holding any of those locks results in deadlock).
> > >
> > > The following functions invoke rcu_start_this_gp: rcu_accelerate_cbs() and rcu_nocb_wait_gp(). We can eliminate rcu_nocb_wait_gp() because you are building with CONFIG_RCU_NOCB_CPU=n. Then rcu_accelerate_cbs() is invoked from:
> > >
> > > o rcu_accelerate_cbs_unlocked(), which does the following, thus
> > > properly awakening the rcu_preempt task when needed:
> > >
> > > needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> > > raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> > > if (needwake)
> > > rcu_gp_kthread_wake(rsp);
> > >
> > > o rcu_advance_cbs(), which returns the value returned by
> > > rcu_accelerate_cbs(), thus pushing the problem off to its
> > > callers, which are called out below.
> > >
> > > o __note_gp_changes(), which also returns the value returned by
> > > rcu_accelerate_cbs(), thus pushing the problem off to its callers,
> > > which are called out below.
> > >
> > > o rcu_gp_cleanup(), which is only ever invoked by RCU grace-period
> > > kthreads such as the rcu_preempt task. Therefore, this function
> > > never needs to awaken the rcu_preempt task, because the fact
> > > that this function is executing means that this task is already
> > > awake. (Also, as noted above, we can eliminate this code from
> > > consideration because this task is known to have been sleeping
> > > at the last time that the RCU_GP_FLAG_INIT bit was set.)
> > >
> > > o rcu_report_qs_rdp(), which does the following, thus properly
> > > awakening the rcu_preempt task when needed:
> > >
> > > needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> > >
> > > rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> > > /* ^^^ Released rnp->lock */
> > > if (needwake)
> > > rcu_gp_kthread_wake(rsp);
> > >
> > > o rcu_prepare_for_idle(), which does the following, thus properly
> > > awakening the rcu_preempt task when needed:
> > >
> > > needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> > > raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> > > if (needwake)
> > > rcu_gp_kthread_wake(rsp);
> > >
> > > Now for rcu_advance_cbs():
> > >
> > > o __note_gp_changes(), which which also returns the value returned
> > > by rcu_advance_cbs(), thus pushing the problem off to its callers,
> > > which are called out below.
> > >
> > > o rcu_migrate_callbacks(), which does the following, thus properly
> > > awakening the rcu_preempt task when needed:
> > >
> > > needwake = rcu_advance_cbs(rsp, rnp_root, rdp) ||
> > > rcu_advance_cbs(rsp, rnp_root, my_rdp);
> > > rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist);
> > > WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) !=
> > > !rcu_segcblist_n_cbs(&my_rdp->cblist));
> > > raw_spin_unlock_irqrestore_rcu_node(rnp_root, flags);
> > > if (needwake)
> > > rcu_gp_kthread_wake(rsp);
> > >
> > > Now for __note_gp_changes():
> > >
> > > o note_gp_changes(), which does the following, thus properly
> > > awakening the rcu_preempt task when needed:
> > >
> > > needwake = __note_gp_changes(rsp, rnp, rdp);
> > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > > if (needwake)
> > > rcu_gp_kthread_wake(rsp);
> > >
> > > o rcu_gp_init() which is only ever invoked by RCU grace-period
> > > kthreads such as the rcu_preempt task, which makes wakeups
> > > unnecessary, just as for rcu_gp_cleanup() above.
> > >
> > > o rcu_gp_cleanup(), ditto.
> > >
> > > So I am not seeing how I am losing a wakeup, but please do feel free to double-check my analysis. One way to do that is using event tracing.
> > >
> > > Thanx, Paul
> > >
> > > ----------------------------------------------------------------------
> > > --
> > > lockdep complaint:
> > > ----------------------------------------------------------------------
> > > --
> > >
> > > [ 2.895507] ======================================================
> > > [ 2.895511] WARNING: possible circular locking dependency detected
> > > [ 2.895517] 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1 Tainted: G U
> > > [ 2.895521] ------------------------------------------------------
> > > [ 2.895525] earlyEvs/1839 is trying to acquire lock:
> > > [ 2.895530] 00000000ff344115 (&asd->mutex){+.+.}, at: ipu_isys_subdev_get_ffmt+0x32/0x90
> > > [ 2.895546]
> > > [ 2.895546] but task is already holding lock:
> > > [ 2.895550] 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> > > [ 2.895561]
> > > [ 2.895561] which lock already depends on the new lock.
> > > [ 2.895561]
> > > [ 2.895566]
> > > [ 2.895566] the existing dependency chain (in reverse order) is:
> > > [ 2.895570]
> > > [ 2.895570] -> #1 (&mdev->graph_mutex){+.+.}:
> > > [ 2.895583] __mutex_lock+0x80/0x9a0
> > > [ 2.895588] mutex_lock_nested+0x1b/0x20
> > > [ 2.895593] media_device_register_entity+0x92/0x1e0
> > > [ 2.895598] v4l2_device_register_subdev+0xc2/0x1b0
> > > [ 2.895604] ipu_isys_csi2_init+0x22c/0x520
> > > [ 2.895608] isys_probe+0x6cb/0xed0
> > > [ 2.895613] ipu_bus_probe+0xfd/0x2e0
> > > [ 2.895620] really_probe+0x268/0x3d0
> > > [ 2.895625] driver_probe_device+0x11a/0x130
> > > [ 2.895630] __device_attach_driver+0x86/0x100
> > > [ 2.895635] bus_for_each_drv+0x6e/0xb0
> > > [ 2.895640] __device_attach+0xdf/0x160
> > > [ 2.895645] device_initial_probe+0x13/0x20
> > > [ 2.895650] bus_probe_device+0xa6/0xc0
> > > [ 2.895655] deferred_probe_work_func+0x88/0xe0
> > > [ 2.895661] process_one_work+0x220/0x5c0
> > > [ 2.895665] worker_thread+0x1da/0x3b0
> > > [ 2.895670] kthread+0x12c/0x150
> > > [ 2.895675] ret_from_fork+0x3a/0x50
> > > [ 2.895678]
> > > [ 2.895678] -> #0 (&asd->mutex){+.+.}:
> > > [ 2.895688] lock_acquire+0x95/0x1a0
> > > [ 2.895693] __mutex_lock+0x80/0x9a0
> > > [ 2.895698] mutex_lock_nested+0x1b/0x20
> > > [ 2.895703] ipu_isys_subdev_get_ffmt+0x32/0x90
> > > [ 2.895708] ipu_isys_csi2_get_fmt+0x14/0x30
> > > [ 2.895713] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> > > [ 2.895718] v4l2_subdev_link_validate_one+0x67/0x120
> > > [ 2.895723] v4l2_subdev_link_validate+0x246/0x490
> > > [ 2.895728] csi2_link_validate+0xc6/0x220
> > > [ 2.895733] __media_pipeline_start+0x15b/0x2f0
> > > [ 2.895738] media_pipeline_start+0x33/0x50
> > > [ 2.895743] ipu_isys_video_prepare_streaming+0x1e0/0x610
> > > [ 2.895748] start_streaming+0x186/0x3a0
> > > [ 2.895753] vb2_start_streaming+0x6d/0x130
> > > [ 2.895758] vb2_core_streamon+0x108/0x140
> > > [ 2.895762] vb2_streamon+0x29/0x50
> > > [ 2.895767] vb2_ioctl_streamon+0x42/0x50
> > > [ 2.895772] v4l_streamon+0x20/0x30
> > > [ 2.895776] __video_do_ioctl+0x1af/0x3c0
> > > [ 2.895781] video_usercopy+0x27e/0x7e0
> > > [ 2.895785] video_ioctl2+0x15/0x20
> > > [ 2.895789] v4l2_ioctl+0x49/0x50
> > > [ 2.895794] do_video_ioctl+0x93c/0x2360
> > > [ 2.895799] v4l2_compat_ioctl32+0x93/0xe0
> > > [ 2.895806] __ia32_compat_sys_ioctl+0x73a/0x1c90
> > > [ 2.895813] do_fast_syscall_32+0x9a/0x2d6
> > > [ 2.895818] entry_SYSENTER_compat+0x6d/0x7c
> > > [ 2.895821]
> > > [ 2.895821] other info that might help us debug this:
> > > [ 2.895821]
> > > [ 2.895826] Possible unsafe locking scenario:
> > > [ 2.895826]
> > > [ 2.895830] CPU0 CPU1
> > > [ 2.895833] ---- ----
> > > [ 2.895836] lock(&mdev->graph_mutex);
> > > [ 2.895842] lock(&asd->mutex);
> > > [ 2.895847] lock(&mdev->graph_mutex);
> > > [ 2.895852] lock(&asd->mutex);
> > > [ 2.895857]
> > > [ 2.895857] *** DEADLOCK ***
> > > [ 2.895857]
> > > [ 2.895863] 3 locks held by earlyEvs/1839:
> > > [ 2.895866] #0: 00000000ed860090 (&av->mutex){+.+.}, at: __video_do_ioctl+0xbf/0x3c0
> > > [ 2.895876] #1: 000000000cb253e7 (&isys->stream_mutex){+.+.}, at: start_streaming+0x5c/0x3a0
> > > [ 2.895886] #2: 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> > > [ 2.895896]
> > > [ 2.895896] stack backtrace:
> > > [ 2.895903] CPU: 0 PID: 1839 Comm: earlyEvs Tainted: G U 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1
> > > [ 2.895907] Call Trace:
> > > [ 2.895915] dump_stack+0x70/0xa5
> > > [ 2.895921] print_circular_bug.isra.35+0x1d8/0x1e6
> > > [ 2.895927] __lock_acquire+0x1284/0x1340
> > > [ 2.895931] ? __lock_acquire+0x2b5/0x1340
> > > [ 2.895940] lock_acquire+0x95/0x1a0
> > > [ 2.895945] ? lock_acquire+0x95/0x1a0
> > > [ 2.895950] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> > > [ 2.895956] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> > > [ 2.895961] __mutex_lock+0x80/0x9a0
> > > [ 2.895966] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> > > [ 2.895971] ? crlmodule_get_format+0x43/0x50
> > > [ 2.895979] mutex_lock_nested+0x1b/0x20
> > > [ 2.895984] ? mutex_lock_nested+0x1b/0x20
> > > [ 2.895989] ipu_isys_subdev_get_ffmt+0x32/0x90
> > > [ 2.895995] ipu_isys_csi2_get_fmt+0x14/0x30
> > > [ 2.896001] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> > > [ 2.896006] v4l2_subdev_link_validate_one+0x67/0x120
> > > [ 2.896011] ? crlmodule_get_format+0x2a/0x50
> > > [ 2.896018] ? find_held_lock+0x35/0xa0
> > > [ 2.896023] ? crlmodule_get_format+0x43/0x50
> > > [ 2.896030] v4l2_subdev_link_validate+0x246/0x490
> > > [ 2.896035] ? __mutex_unlock_slowpath+0x58/0x2f0
> > > [ 2.896042] ? mutex_unlock+0x12/0x20
> > > [ 2.896046] ? crlmodule_get_format+0x43/0x50
> > > [ 2.896052] ? v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> > > [ 2.896057] ? v4l2_subdev_link_validate_one+0x67/0x120
> > > [ 2.896065] ? __is_insn_slot_addr+0xad/0x120
> > > [ 2.896070] ? kernel_text_address+0xc4/0x100
> > > [ 2.896078] ? v4l2_subdev_link_validate+0x246/0x490
> > > [ 2.896085] ? kernel_text_address+0xc4/0x100
> > > [ 2.896092] ? __lock_acquire+0x1106/0x1340
> > > [ 2.896096] ? __lock_acquire+0x1169/0x1340
> > > [ 2.896103] csi2_link_validate+0xc6/0x220
> > > [ 2.896110] ? __lock_is_held+0x5a/0xa0
> > > [ 2.896115] ? mark_held_locks+0x58/0x80
> > > [ 2.896122] ? __kmalloc+0x207/0x2e0
> > > [ 2.896127] ? __lock_is_held+0x5a/0xa0
> > > [ 2.896134] ? rcu_read_lock_sched_held+0x81/0x90
> > > [ 2.896139] ? __kmalloc+0x2a3/0x2e0
> > > [ 2.896144] ? media_pipeline_start+0x28/0x50
> > > [ 2.896150] ? __media_entity_enum_init+0x33/0x70
> > > [ 2.896155] ? csi2_has_route+0x18/0x20
> > > [ 2.896160] ? media_graph_walk_next.part.9+0xac/0x290
> > > [ 2.896166] __media_pipeline_start+0x15b/0x2f0
> > > [ 2.896173] ? rcu_read_lock_sched_held+0x81/0x90
> > > [ 2.896179] media_pipeline_start+0x33/0x50
> > > [ 2.896186] ipu_isys_video_prepare_streaming+0x1e0/0x610
> > > [ 2.896191] ? __lock_acquire+0x132e/0x1340
> > > [ 2.896198] ? __lock_acquire+0x2b5/0x1340
> > > [ 2.896204] ? lock_acquire+0x95/0x1a0
> > > [ 2.896209] ? start_streaming+0x5c/0x3a0
> > > [ 2.896215] ? start_streaming+0x5c/0x3a0
> > > [ 2.896221] ? __mutex_lock+0x391/0x9a0
> > > [ 2.896226] ? v4l_enable_media_source+0x2d/0x70
> > > [ 2.896233] ? find_held_lock+0x35/0xa0
> > > [ 2.896238] ? v4l_enable_media_source+0x57/0x70
> > > [ 2.896245] start_streaming+0x186/0x3a0
> > > [ 2.896250] ? __mutex_unlock_slowpath+0x58/0x2f0
> > > [ 2.896257] vb2_start_streaming+0x6d/0x130
> > > [ 2.896262] ? vb2_start_streaming+0x6d/0x130
> > > [ 2.896267] vb2_core_streamon+0x108/0x140
> > > [ 2.896273] vb2_streamon+0x29/0x50
> > > [ 2.896278] vb2_ioctl_streamon+0x42/0x50
> > > [ 2.896284] v4l_streamon+0x20/0x30
> > > [ 2.896288] __video_do_ioctl+0x1af/0x3c0
> > > [ 2.896296] ? __might_fault+0x85/0x90
> > > [ 2.896302] video_usercopy+0x27e/0x7e0
> > > [ 2.896307] ? copy_overflow+0x20/0x20
> > > [ 2.896313] ? find_held_lock+0x35/0xa0
> > > [ 2.896319] ? __might_fault+0x3e/0x90
> > > [ 2.896325] video_ioctl2+0x15/0x20
> > > [ 2.896330] v4l2_ioctl+0x49/0x50
> > > [ 2.896335] do_video_ioctl+0x93c/0x2360
> > > [ 2.896343] v4l2_compat_ioctl32+0x93/0xe0
> > > [ 2.896349] __ia32_compat_sys_ioctl+0x73a/0x1c90
> > > [ 2.896354] ? lockdep_hardirqs_on+0xef/0x180
> > > [ 2.896359] ? do_fast_syscall_32+0x3b/0x2d6
> > > [ 2.896364] do_fast_syscall_32+0x9a/0x2d6
> > > [ 2.896370] entry_SYSENTER_compat+0x6d/0x7c
> > > [ 2.896377] RIP: 0023:0xf7e79b79
> > > [ 2.896382] Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 0c 24 c3 8b 1c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
> > > [ 2.896387] RSP: 002b:00000000f76816bc EFLAGS: 00000292 ORIG_RAX: 0000000000000036
> > > [ 2.896393] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 0000000040045612
> > > [ 2.896396] RDX: 00000000f768172c RSI: 00000000f7d42d9c RDI: 00000000f768172c
> > > [ 2.896400] RBP: 00000000f7681708 R08: 0000000000000000 R09: 0000000000000000
> > > [ 2.896404] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > [ 2.896408] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > >
> > > ----------------------------------------------------------------------
> > > --
> > >
> > > > [17818.936039] rcu: rcu_node 0:3 ->gp_seq 21808192 ->gp_seq_needed 21808196
> > > > [17818.936048] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 101730 ->gp_req_activity 101732 ->gp_wake_time 101730 ->gp_wake_seq 1357 - >gp_seq 1360 ->gp_seq_needed 1360 ->gp_flags 0x0
> > > > [17818.936056] rcu: rcu_bh: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 4312486108 ->gp_req_activity 4312486108 ->gp_wake_time 4312486108 - >gp_wake_seq 0 ->gp_seq -1200 ->gp_seq_needed -1200 ->gp_flags 0x0
> > > >
> > > > -----Original Message-----
> > > > From: Paul E. McKenney <[email protected]>
> > > > Sent: Thursday, December 13, 2018 12:40 PM
> > > > To: Zhang, Jun <[email protected]>
> > > > Cc: He, Bo <[email protected]>; Steven Rostedt <[email protected]>;
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected]; Xiao, Jin
> > > > <[email protected]>; Zhang, Yanmin <[email protected]>; Bai,
> > > > Jie A <[email protected]>; Sun, Yi J <[email protected]>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Thu, Dec 13, 2018 at 03:28:46AM +0000, Zhang, Jun wrote:
> > > > > Ok, we will test it, thanks!
> > > >
> > > > But please also try the sysrq-y with the earlier patch after a hang!
> > > >
> > > > Thanx, Paul
> > > >
> > > > > -----Original Message-----
> > > > > From: Paul E. McKenney [mailto:[email protected]]
> > > > > Sent: Thursday, December 13, 2018 10:43
> > > > > To: Zhang, Jun <[email protected]>
> > > > > Cc: He, Bo <[email protected]>; Steven Rostedt
> > > > > <[email protected]>; [email protected];
> > > > > [email protected]; [email protected];
> > > > > [email protected]; Xiao, Jin <[email protected]>; Zhang,
> > > > > Yanmin <[email protected]>; Bai, Jie A <[email protected]>;
> > > > > Sun, Yi J <[email protected]>
> > > > > Subject: Re: rcu_preempt caused oom
> > > > >
> > > > > On Thu, Dec 13, 2018 at 02:11:35AM +0000, Zhang, Jun wrote:
> > > > > > Hello, Paul
> > > > > >
> > > > > > I think the next patch is better.
> > > > > > Because ULONG_CMP_GE could cause double write, which has risk that write back old value.
> > > > > > Please help review.
> > > > > > I don't test it. If you agree, we will test it.
> > > > >
> > > > > Just to make sure that I understand, you are worried about something like the following, correct?
> > > > >
> > > > > o __note_gp_changes() compares rnp->gp_seq_needed and rdp->gp_seq_needed
> > > > > and finds them equal.
> > > > >
> > > > > o At just this time something like rcu_start_this_gp() assigns a new
> > > > > (larger) value to rdp->gp_seq_needed.
> > > > >
> > > > > o Then __note_gp_changes() overwrites rdp->gp_seq_needed with the
> > > > > old value.
> > > > >
> > > > > This cannot happen because __note_gp_changes() runs with interrupts disabled on the CPU corresponding to the rcu_data structure referenced by the rdp pointer. So there is no way for rcu_start_this_gp() to be invoked on the same CPU during this "if" statement.
> > > > >
> > > > > Of course, there could be bugs. For example:
> > > > >
> > > > > o __note_gp_changes() might be called on a different CPU than that
> > > > > corresponding to rdp. You can check this with something like:
> > > > >
> > > > > WARN_ON_ONCE(rdp->cpu != smp_processor_id());
> > > > >
> > > > > o The same things could happen with rcu_start_this_gp(), and the
> > > > > above WARN_ON_ONCE() would work there as well.
> > > > >
> > > > > o rcutree_prepare_cpu() is a special case, but is irrelevant unless
> > > > > you are doing CPU-hotplug operations. (It can run on a CPU other
> > > > > than rdp->cpu, but only at times when rdp->cpu is offline.)
> > > > >
> > > > > o Interrupts might not really be disabled.
> > > > >
> > > > > That said, your patch could reduce overhead slightly, given that the two values will be equal much of the time. So it might be worth testing just for that reason.
> > > > >
> > > > > So why not just test it anyway? If it makes the bug go away, I
> > > > > will be surprised, but it would not be the first surprise for me.
> > > > > ;-)
> > > > >
> > > > > Thanx, Paul
> > > > >
> > > > > > Thanks!
> > > > > >
> > > > > >
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > > > > 0b760c1..c00f34e 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
> > > > > > zero_cpu_stall_ticks(rdp);
> > > > > > }
> > > > > > rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */
> > > > > > - if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
> > > > > > + if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed)
> > > > > > + ||
> > > > > > + rdp->gpwrap)
> > > > > > rdp->gp_seq_needed = rnp->gp_seq_needed;
> > > > > > WRITE_ONCE(rdp->gpwrap, false);
> > > > > > rcu_gpnum_ovf(rnp, rdp);
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Paul E. McKenney [mailto:[email protected]]
> > > > > > Sent: Thursday, December 13, 2018 08:12
> > > > > > To: He, Bo <[email protected]>
> > > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > > [email protected]; [email protected];
> > > > > > [email protected]; [email protected]; Zhang,
> > > > > > Jun <[email protected]>; Xiao, Jin <[email protected]>;
> > > > > > Zhang, Yanmin <[email protected]>; Bai, Jie A
> > > > > > <[email protected]>; Sun, Yi J <[email protected]>
> > > > > > Subject: Re: rcu_preempt caused oom
> > > > > >
> > > > > > On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> > > > > > > I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> > > > > > > Please correct me if I have something wrong.
> > > > > >
> > > > > > That would be because I sent you the wrong patch, apologies!
> > > > > > :-/
> > > > > >
> > > > > > Please instead see the one below, which does add sysrq_rcu.
> > > > > >
> > > > > > Thanx, Paul
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Paul E. McKenney <[email protected]>
> > > > > > > Sent: Thursday, December 13, 2018 5:03 AM
> > > > > > > To: He, Bo <[email protected]>
> > > > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > > > [email protected]; [email protected];
> > > > > > > [email protected]; [email protected]; Zhang,
> > > > > > > Jun <[email protected]>; Xiao, Jin <[email protected]>;
> > > > > > > Zhang, Yanmin <[email protected]>; Bai, Jie A
> > > > > > > <[email protected]>
> > > > > > > Subject: Re: rcu_preempt caused oom
> > > > > > >
> > > > > > > On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > > > > > > > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > > > > > > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > > > > > > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> > > > > > > >
> > > > > > > > Strange.
> > > > > > > >
> > > > > > > > Are the systems responsive to sysrq keys once failure occurs?
> > > > > > > > If so, I will provide you a sysrq-R or some such to dump out the RCU state.
> > > > > > >
> > > > > > > Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below. Only lightly tested.
> > > > > >
> > > > > > ----------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > > --
> > > > > > --
> > > > > >
> > > > > > commit 04b6245c8458e8725f4169e62912c1fadfdf8141
> > > > > > Author: Paul E. McKenney <[email protected]>
> > > > > > Date: Wed Dec 12 16:10:09 2018 -0800
> > > > > >
> > > > > > rcu: Add sysrq rcu_node-dump capability
> > > > > >
> > > > > > Backported from v4.21/v5.0
> > > > > >
> > > > > > Life is hard if RCU manages to get stuck without triggering RCU CPU
> > > > > > stall warnings or triggering the rcu_check_gp_start_stall() checks
> > > > > > for failing to start a grace period. This commit therefore adds a
> > > > > > boot-time-selectable sysrq key (commandeering "y") that allows manually
> > > > > > dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter
> > > > > > must be set for this sysrq to be available.
> > > > > >
> > > > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > > >
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > > > > 0b760c1369f7..e9392a9d6291 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -61,6 +61,7 @@
> > > > > > #include <linux/trace_events.h> #include <linux/suspend.h>
> > > > > > #include <linux/ftrace.h>
> > > > > > +#include <linux/sysrq.h>
> > > > > >
> > > > > > #include "tree.h"
> > > > > > #include "rcu.h"
> > > > > > @@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int
> > > > > > rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total #
> > > > > > rcu_nodes in use. */
> > > > > > /* panic() on RCU Stall sysctl. */ int
> > > > > > sysctl_panic_on_rcu_stall __read_mostly;
> > > > > > +/* Commandeer a sysrq key to dump RCU's tree. */ static bool
> > > > > > +sysrq_rcu; module_param(sysrq_rcu, bool, 0444);
> > > > > >
> > > > > > /*
> > > > > > * The rcu_scheduler_active variable is initialized to the
> > > > > > value @@
> > > > > > -662,6 +666,27 @@ void show_rcu_gp_kthreads(void) }
> > > > > > EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
> > > > > >
> > > > > > +/* Dump grace-period-request information due to commandeered sysrq.
> > > > > > +*/ static void sysrq_show_rcu(int key) {
> > > > > > + show_rcu_gp_kthreads();
> > > > > > +}
> > > > > > +
> > > > > > +static struct sysrq_key_op sysrq_rcudump_op = {
> > > > > > + .handler = sysrq_show_rcu,
> > > > > > + .help_msg = "show-rcu(y)",
> > > > > > + .action_msg = "Show RCU tree",
> > > > > > + .enable_mask = SYSRQ_ENABLE_DUMP, };
> > > > > > +
> > > > > > +static int __init rcu_sysrq_init(void) {
> > > > > > + if (sysrq_rcu)
> > > > > > + return register_sysrq_key('y', &sysrq_rcudump_op);
> > > > > > + return 0;
> > > > > > +}
> > > > > > +early_initcall(rcu_sysrq_init);
> > > > > > +
> > > > > > /*
> > > > > > * Send along grace-period-related data for rcutorture diagnostics.
> > > > > > */
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
>
>


2018-12-17 04:05:37

by He, Bo

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

for double confirm the issue is not reproduce after 90 hours, we tried only add the enclosed patch on the easy reproduced build, the issue is not reproduced after 63 hours in the whole weekend on 16 boards.
so current conclusion is the debug patch has extreme effect on the rcu issue.

Compared with the swait_event_idle_timeout_exclusive will do 3 times to check the condition, while swait_event_idle_ exclusive will do 2 times check the condition.

so today I will do another experiment, only change as below:
- swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
- RCU_GP_FLAG_INIT);
+ ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT, MAX_SCHEDULE_TIMEOUT);
+

Can you get some clues from the experiment?

-----Original Message-----
From: Paul E. McKenney <[email protected]>
Sent: Friday, December 14, 2018 1:39 PM
To: He, Bo <[email protected]>
Cc: Zhang, Jun <[email protected]>; Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>
Subject: Re: rcu_preempt caused oom

On Thu, Dec 13, 2018 at 09:10:12PM -0800, Paul E. McKenney wrote:
> On Fri, Dec 14, 2018 at 02:40:50AM +0000, He, Bo wrote:
> > another experiment we have done with the enclosed debug patch, and also have more rcu trace event enable but without CONFIG_RCU_BOOST config, we don't reproduce the issue after 90 Hours until now on 10 boards(the issue should reproduce on one night per previous experience).
>
> That certainly supports the hypothesis that a wakeup is either not
> being sent or is being lost. Your patch is great for debugging (thank
> you!), but the real solution of course needs to avoid the extra
> wakeups, especially on battery-powered systems.
>
> One suggested change below, to get rid of potential false positives.
>
> > the purposes are to capture the more rcu event trace close to the issue happen, because I check the __wait_rcu_gp is not always in running, so we think even it trigger the panic for 3s timeout, the issue is already happened before 3s.
>
> Agreed, it would be really good to have trace information from the cause.
> In the case you sent yesterday, it would be good to have trace
> information from 308.256 seconds prior to the sysrq-v, for example, by
> collecting the same event traces you did a few days ago. It would
> also be good to know whether the scheduler tick is providing
> interrupts, and if so, why
> rcu_check_gp_start_stall() isn't being invoked. ;-)
>
> If collecting this information with your setup is not feasible (for
> example, you might need a large trace buffer to capture five minutes
> of traces), please let me know and I can provide additional debug
> code. Or you could add "rcu_ftrace_dump(DUMP_ALL);" just before the
> "show_rcu_gp_kthreads();" in your patch below.
>
> > And Actually the rsp->gp_flags = 1, but RCU_GP_WAIT_GPS(1) ->state: 0x402, it means the kthread is not schedule for 300s but the RCU_GP_FLAG_INIT is set. What's your ideas?
>
> The most likely possibility is that my analysis below is confused and
> there really is some way that the code can set the RCU_GP_FLAG_INIT
> bit without later doing a wakeup. The trace data above could help
> unconfuse me.
>
> Thanx, Paul
>
> > ---------------------------------------------------------------------------------------------------------------------------------
> > - swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> > - RCU_GP_FLAG_INIT);
> > + if (current->pid != rcu_preempt_pid) {
> > + swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> > + RCU_GP_FLAG_INIT);
> > + } else {
>
> wait_again:
>
> > + ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> > + RCU_GP_FLAG_INIT, 2*HZ);
> > +
> > + if(!ret) {
>
> This would avoid complaining if RCU was legitimately idle for a long time:

Let's try this again. Unless I am confused (quite possible) your original would panic if RCU was idle for more than two seconds. What we instead want is to panic if we time out, but end up with RCU_GP_FLAG_INIT set.

So something like this:

if (ret == 1) {
/* Timed out with RCU_GP_FLAG_INIT. */
rcu_ftrace_dump(DUMP_ALL);
show_rcu_gp_kthreads();
panic("hung_task: blocked in rcu_gp_kthread init");
} else if (!ret) {
/* Timed out w/out RCU_GP_FLAG_INIT. */
goto wait_again;
}

Thanx, Paul

> > + show_rcu_gp_kthreads();
> > + panic("hung_task: blocked in rcu_gp_kthread init");
> > + }
> > + }
> > --------------------------------------------------------------------
> > ------------------
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Friday, December 14, 2018 10:15 AM
> > To: He, Bo <[email protected]>
> > Cc: Zhang, Jun <[email protected]>; Steven Rostedt
> > <[email protected]>; [email protected];
> > [email protected]; [email protected];
> > [email protected]; Xiao, Jin <[email protected]>; Zhang,
> > Yanmin <[email protected]>; Bai, Jie A <[email protected]>;
> > Sun, Yi J <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Fri, Dec 14, 2018 at 01:30:04AM +0000, He, Bo wrote:
> > > as you mentioned CONFIG_FAST_NO_HZ, do you mean CONFIG_RCU_FAST_NO_HZ? I double checked there is no FAST_NO_HZ in .config:
> >
> > Yes, you are correct, CONFIG_RCU_FAST_NO_HZ. OK, you do not have it
> > set, which means several code paths can be ignored. Also
> > CONFIG_HZ=1000, so
> > 300 second delay.
> >
> > Thanx, Paul
> >
> > > Here is the grep from .config:
> > > egrep "HZ|RCU" .config
> > > CONFIG_NO_HZ_COMMON=y
> > > # CONFIG_HZ_PERIODIC is not set
> > > CONFIG_NO_HZ_IDLE=y
> > > # CONFIG_NO_HZ_FULL is not set
> > > CONFIG_NO_HZ=y
> > > # RCU Subsystem
> > > CONFIG_PREEMPT_RCU=y
> > > # CONFIG_RCU_EXPERT is not set
> > > CONFIG_SRCU=y
> > > CONFIG_TREE_SRCU=y
> > > CONFIG_TASKS_RCU=y
> > > CONFIG_RCU_STALL_COMMON=y
> > > CONFIG_RCU_NEED_SEGCBLIST=y
> > > # CONFIG_HZ_100 is not set
> > > # CONFIG_HZ_250 is not set
> > > # CONFIG_HZ_300 is not set
> > > CONFIG_HZ_1000=y
> > > CONFIG_HZ=1000
> > > # CONFIG_MACHZ_WDT is not set
> > > # RCU Debugging
> > > CONFIG_PROVE_RCU=y
> > > CONFIG_RCU_PERF_TEST=m
> > > CONFIG_RCU_TORTURE_TEST=m
> > > CONFIG_RCU_CPU_STALL_TIMEOUT=7
> > > CONFIG_RCU_TRACE=y
> > > CONFIG_RCU_EQS_DEBUG=y
> > >
> > > -----Original Message-----
> > > From: Paul E. McKenney <[email protected]>
> > > Sent: Friday, December 14, 2018 2:12 AM
> > > To: He, Bo <[email protected]>
> > > Cc: Zhang, Jun <[email protected]>; Steven Rostedt
> > > <[email protected]>; [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; Xiao, Jin <[email protected]>; Zhang,
> > > Yanmin <[email protected]>; Bai, Jie A <[email protected]>;
> > > Sun, Yi J <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Thu, Dec 13, 2018 at 03:26:08PM +0000, He, Bo wrote:
> > > > one of the board reproduce the issue with the show_rcu_gp_kthreads(), I also enclosed the logs as attachment.
> > > >
> > > > [17818.936032] rcu: rcu_preempt: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258 ->gp_wake_seq 21808189 ->gp_seq 21808192 ->gp_seq_needed 21808196 ->gp_flags 0x1
> > >
> > > This is quite helpful, thank you!
> > >
> > > The "RCU lockdep checking is enabled" says that CONFIG_PROVE_RCU=y, which is good. The "RCU_GP_WAIT_GPS(1)" means that the rcu_preempt task is waiting for a new grace-period request. The "->state: 0x402" means that it is sleeping, neither running nor in the process of waking up.
> > > The "delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258" means that it has been more than 300,000 jiffies since the rcu_preempt task did anything or was requested to do anything.
> > >
> > > The "->gp_wake_seq 21808189 ->gp_seq 21808192" says that the last attempt to awaken the rcu_preempt task happened during the last grace period.
> > > The "->gp_seq_needed 21808196 ->gp_flags 0x1" nevertheless says that someone requested a new grace period. So if the rcu_preempt task were to wake up, it would process the new grace period. Note again also the ->gp_req_activity 308256, which indicates that ->gp_flags was set more than 300,000 jiffies ago, just after the last recorded activity of the rcu_preempt task.
> > >
> > > But this is exactly the situation that rcu_check_gp_start_stall() is designed to warn about (and does warn about for me when I comment out the wakeup code). So why is rcu_check_gp_start_stall() not being called? Here are a couple of possibilities:
> > >
> > > 1. Because rcu_check_gp_start_stall() is only ever invoked from
> > > RCU_SOFTIRQ, it is possible that softirqs are stalled for
> > > whatever reason.
> > >
> > > 2. Because RCU_SOFTIRQ is invoked primarily from the scheduler-clock
> > > interrupt handler, it is possible that the scheduler tick has
> > > somehow been disabled. Traces from earlier runs showed a great
> > > deal of RCU callbacks queued, which would have caused RCU to
> > > refuse to allow the scheduler tick to be disabled, even if the
> > > corresponding CPU was idle.
> > >
> > > 3. You have CONFIG_FAST_NO_HZ=y (which you probably do, given
> > > that you are building for a battery-powered device) and all of the
> > > CPU's callbacks are lazy. Except that your earlier traces showed
> > > lots of non-lazy callbacks. Besides, even if all callbacks were
> > > lazy, there would still be a scheduling-clock interrupt every
> > > six seconds, and there are quite a few six-second intervals
> > > in a two-minute watchdog timeout.
> > >
> > > But if we cannot find the problem quickly, I will likely ask
> > > you to try reproducing with CONFIG_FAST_NO_HZ=n. This could
> > > be thought of as bisecting the RCU code looking for the bug.
> > >
> > > The first two of these seem unlikely given that the watchdog timer was still firing. Still, I don't see how 300,000 jiffies elapsed with a grace period requested and not started otherwise. Could you please check?
> > > One way to do so would be to enable ftrace on rcu_check_callbacks(), __rcu_process_callbacks(), and rcu_check_gp_start_stall(). It might be necessary to no-inline rcu_check_gp_start_stall(). You might have better ways to collect this information.
> > >
> > > Without this information, the only workaround patch I can give you will degrade battery lifetime, which might not be what you want.
> > >
> > > You do have a lockdep complaint early at boot. Although I don't immediately see how this self-deadlock would affect RCU, please do get it fixed. Sometimes the consequences of this sort of deadlock can propagate to unexepected places.
> > >
> > > Regardless of why rcu_check_gp_start_stall() failed to complain, it looks like this was set after the rcu_preempt task slept for the last time, and so there should have been a wakeup the last time that ->gp_flags was set. Perhaps there is some code path that drops the wakeup.
> > > I did check this in current -rcu, but you are instead running v4.19, so I should also check there.
> > >
> > > The ->gp_flags has its RCU_GP_FLAG_INIT bit set in rcu_start_this_gp() and in rcu_gp_cleanup(). We can eliminate rcu_gp_cleanup() from consideration because only the rcu_preempt task will execute that code, and we know that this task was asleep at the last time this bit was set.
> > > Now rcu_start_this_gp() returns a flag indicating whether or not a wakeup is needed, and the caller must do the wakeup once it is safe to do so, that is, after the various rcu_node locks have been released (doing a wakeup while holding any of those locks results in deadlock).
> > >
> > > The following functions invoke rcu_start_this_gp: rcu_accelerate_cbs() and rcu_nocb_wait_gp(). We can eliminate rcu_nocb_wait_gp() because you are building with CONFIG_RCU_NOCB_CPU=n. Then rcu_accelerate_cbs() is invoked from:
> > >
> > > o rcu_accelerate_cbs_unlocked(), which does the following, thus
> > > properly awakening the rcu_preempt task when needed:
> > >
> > > needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> > > raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> > > if (needwake)
> > > rcu_gp_kthread_wake(rsp);
> > >
> > > o rcu_advance_cbs(), which returns the value returned by
> > > rcu_accelerate_cbs(), thus pushing the problem off to its
> > > callers, which are called out below.
> > >
> > > o __note_gp_changes(), which also returns the value returned by
> > > rcu_accelerate_cbs(), thus pushing the problem off to its callers,
> > > which are called out below.
> > >
> > > o rcu_gp_cleanup(), which is only ever invoked by RCU grace-period
> > > kthreads such as the rcu_preempt task. Therefore, this function
> > > never needs to awaken the rcu_preempt task, because the fact
> > > that this function is executing means that this task is already
> > > awake. (Also, as noted above, we can eliminate this code from
> > > consideration because this task is known to have been sleeping
> > > at the last time that the RCU_GP_FLAG_INIT bit was set.)
> > >
> > > o rcu_report_qs_rdp(), which does the following, thus properly
> > > awakening the rcu_preempt task when needed:
> > >
> > > needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> > >
> > > rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> > > /* ^^^ Released rnp->lock */
> > > if (needwake)
> > > rcu_gp_kthread_wake(rsp);
> > >
> > > o rcu_prepare_for_idle(), which does the following, thus properly
> > > awakening the rcu_preempt task when needed:
> > >
> > > needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> > > raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> > > if (needwake)
> > > rcu_gp_kthread_wake(rsp);
> > >
> > > Now for rcu_advance_cbs():
> > >
> > > o __note_gp_changes(), which which also returns the value returned
> > > by rcu_advance_cbs(), thus pushing the problem off to its callers,
> > > which are called out below.
> > >
> > > o rcu_migrate_callbacks(), which does the following, thus properly
> > > awakening the rcu_preempt task when needed:
> > >
> > > needwake = rcu_advance_cbs(rsp, rnp_root, rdp) ||
> > > rcu_advance_cbs(rsp, rnp_root, my_rdp);
> > > rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist);
> > > WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) !=
> > > !rcu_segcblist_n_cbs(&my_rdp->cblist));
> > > raw_spin_unlock_irqrestore_rcu_node(rnp_root, flags);
> > > if (needwake)
> > > rcu_gp_kthread_wake(rsp);
> > >
> > > Now for __note_gp_changes():
> > >
> > > o note_gp_changes(), which does the following, thus properly
> > > awakening the rcu_preempt task when needed:
> > >
> > > needwake = __note_gp_changes(rsp, rnp, rdp);
> > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > > if (needwake)
> > > rcu_gp_kthread_wake(rsp);
> > >
> > > o rcu_gp_init() which is only ever invoked by RCU grace-period
> > > kthreads such as the rcu_preempt task, which makes wakeups
> > > unnecessary, just as for rcu_gp_cleanup() above.
> > >
> > > o rcu_gp_cleanup(), ditto.
> > >
> > > So I am not seeing how I am losing a wakeup, but please do feel free to double-check my analysis. One way to do that is using event tracing.
> > >
> > > Thanx, Paul
> > >
> > > ------------------------------------------------------------------
> > > ----
> > > --
> > > lockdep complaint:
> > > ------------------------------------------------------------------
> > > ----
> > > --
> > >
> > > [ 2.895507] ======================================================
> > > [ 2.895511] WARNING: possible circular locking dependency detected
> > > [ 2.895517] 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1 Tainted: G U
> > > [ 2.895521] ------------------------------------------------------
> > > [ 2.895525] earlyEvs/1839 is trying to acquire lock:
> > > [ 2.895530] 00000000ff344115 (&asd->mutex){+.+.}, at: ipu_isys_subdev_get_ffmt+0x32/0x90
> > > [ 2.895546]
> > > [ 2.895546] but task is already holding lock:
> > > [ 2.895550] 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> > > [ 2.895561]
> > > [ 2.895561] which lock already depends on the new lock.
> > > [ 2.895561]
> > > [ 2.895566]
> > > [ 2.895566] the existing dependency chain (in reverse order) is:
> > > [ 2.895570]
> > > [ 2.895570] -> #1 (&mdev->graph_mutex){+.+.}:
> > > [ 2.895583] __mutex_lock+0x80/0x9a0
> > > [ 2.895588] mutex_lock_nested+0x1b/0x20
> > > [ 2.895593] media_device_register_entity+0x92/0x1e0
> > > [ 2.895598] v4l2_device_register_subdev+0xc2/0x1b0
> > > [ 2.895604] ipu_isys_csi2_init+0x22c/0x520
> > > [ 2.895608] isys_probe+0x6cb/0xed0
> > > [ 2.895613] ipu_bus_probe+0xfd/0x2e0
> > > [ 2.895620] really_probe+0x268/0x3d0
> > > [ 2.895625] driver_probe_device+0x11a/0x130
> > > [ 2.895630] __device_attach_driver+0x86/0x100
> > > [ 2.895635] bus_for_each_drv+0x6e/0xb0
> > > [ 2.895640] __device_attach+0xdf/0x160
> > > [ 2.895645] device_initial_probe+0x13/0x20
> > > [ 2.895650] bus_probe_device+0xa6/0xc0
> > > [ 2.895655] deferred_probe_work_func+0x88/0xe0
> > > [ 2.895661] process_one_work+0x220/0x5c0
> > > [ 2.895665] worker_thread+0x1da/0x3b0
> > > [ 2.895670] kthread+0x12c/0x150
> > > [ 2.895675] ret_from_fork+0x3a/0x50
> > > [ 2.895678]
> > > [ 2.895678] -> #0 (&asd->mutex){+.+.}:
> > > [ 2.895688] lock_acquire+0x95/0x1a0
> > > [ 2.895693] __mutex_lock+0x80/0x9a0
> > > [ 2.895698] mutex_lock_nested+0x1b/0x20
> > > [ 2.895703] ipu_isys_subdev_get_ffmt+0x32/0x90
> > > [ 2.895708] ipu_isys_csi2_get_fmt+0x14/0x30
> > > [ 2.895713] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> > > [ 2.895718] v4l2_subdev_link_validate_one+0x67/0x120
> > > [ 2.895723] v4l2_subdev_link_validate+0x246/0x490
> > > [ 2.895728] csi2_link_validate+0xc6/0x220
> > > [ 2.895733] __media_pipeline_start+0x15b/0x2f0
> > > [ 2.895738] media_pipeline_start+0x33/0x50
> > > [ 2.895743] ipu_isys_video_prepare_streaming+0x1e0/0x610
> > > [ 2.895748] start_streaming+0x186/0x3a0
> > > [ 2.895753] vb2_start_streaming+0x6d/0x130
> > > [ 2.895758] vb2_core_streamon+0x108/0x140
> > > [ 2.895762] vb2_streamon+0x29/0x50
> > > [ 2.895767] vb2_ioctl_streamon+0x42/0x50
> > > [ 2.895772] v4l_streamon+0x20/0x30
> > > [ 2.895776] __video_do_ioctl+0x1af/0x3c0
> > > [ 2.895781] video_usercopy+0x27e/0x7e0
> > > [ 2.895785] video_ioctl2+0x15/0x20
> > > [ 2.895789] v4l2_ioctl+0x49/0x50
> > > [ 2.895794] do_video_ioctl+0x93c/0x2360
> > > [ 2.895799] v4l2_compat_ioctl32+0x93/0xe0
> > > [ 2.895806] __ia32_compat_sys_ioctl+0x73a/0x1c90
> > > [ 2.895813] do_fast_syscall_32+0x9a/0x2d6
> > > [ 2.895818] entry_SYSENTER_compat+0x6d/0x7c
> > > [ 2.895821]
> > > [ 2.895821] other info that might help us debug this:
> > > [ 2.895821]
> > > [ 2.895826] Possible unsafe locking scenario:
> > > [ 2.895826]
> > > [ 2.895830] CPU0 CPU1
> > > [ 2.895833] ---- ----
> > > [ 2.895836] lock(&mdev->graph_mutex);
> > > [ 2.895842] lock(&asd->mutex);
> > > [ 2.895847] lock(&mdev->graph_mutex);
> > > [ 2.895852] lock(&asd->mutex);
> > > [ 2.895857]
> > > [ 2.895857] *** DEADLOCK ***
> > > [ 2.895857]
> > > [ 2.895863] 3 locks held by earlyEvs/1839:
> > > [ 2.895866] #0: 00000000ed860090 (&av->mutex){+.+.}, at: __video_do_ioctl+0xbf/0x3c0
> > > [ 2.895876] #1: 000000000cb253e7 (&isys->stream_mutex){+.+.}, at: start_streaming+0x5c/0x3a0
> > > [ 2.895886] #2: 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> > > [ 2.895896]
> > > [ 2.895896] stack backtrace:
> > > [ 2.895903] CPU: 0 PID: 1839 Comm: earlyEvs Tainted: G U 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1
> > > [ 2.895907] Call Trace:
> > > [ 2.895915] dump_stack+0x70/0xa5
> > > [ 2.895921] print_circular_bug.isra.35+0x1d8/0x1e6
> > > [ 2.895927] __lock_acquire+0x1284/0x1340
> > > [ 2.895931] ? __lock_acquire+0x2b5/0x1340
> > > [ 2.895940] lock_acquire+0x95/0x1a0
> > > [ 2.895945] ? lock_acquire+0x95/0x1a0
> > > [ 2.895950] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> > > [ 2.895956] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> > > [ 2.895961] __mutex_lock+0x80/0x9a0
> > > [ 2.895966] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> > > [ 2.895971] ? crlmodule_get_format+0x43/0x50
> > > [ 2.895979] mutex_lock_nested+0x1b/0x20
> > > [ 2.895984] ? mutex_lock_nested+0x1b/0x20
> > > [ 2.895989] ipu_isys_subdev_get_ffmt+0x32/0x90
> > > [ 2.895995] ipu_isys_csi2_get_fmt+0x14/0x30
> > > [ 2.896001] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> > > [ 2.896006] v4l2_subdev_link_validate_one+0x67/0x120
> > > [ 2.896011] ? crlmodule_get_format+0x2a/0x50
> > > [ 2.896018] ? find_held_lock+0x35/0xa0
> > > [ 2.896023] ? crlmodule_get_format+0x43/0x50
> > > [ 2.896030] v4l2_subdev_link_validate+0x246/0x490
> > > [ 2.896035] ? __mutex_unlock_slowpath+0x58/0x2f0
> > > [ 2.896042] ? mutex_unlock+0x12/0x20
> > > [ 2.896046] ? crlmodule_get_format+0x43/0x50
> > > [ 2.896052] ? v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> > > [ 2.896057] ? v4l2_subdev_link_validate_one+0x67/0x120
> > > [ 2.896065] ? __is_insn_slot_addr+0xad/0x120
> > > [ 2.896070] ? kernel_text_address+0xc4/0x100
> > > [ 2.896078] ? v4l2_subdev_link_validate+0x246/0x490
> > > [ 2.896085] ? kernel_text_address+0xc4/0x100
> > > [ 2.896092] ? __lock_acquire+0x1106/0x1340
> > > [ 2.896096] ? __lock_acquire+0x1169/0x1340
> > > [ 2.896103] csi2_link_validate+0xc6/0x220
> > > [ 2.896110] ? __lock_is_held+0x5a/0xa0
> > > [ 2.896115] ? mark_held_locks+0x58/0x80
> > > [ 2.896122] ? __kmalloc+0x207/0x2e0
> > > [ 2.896127] ? __lock_is_held+0x5a/0xa0
> > > [ 2.896134] ? rcu_read_lock_sched_held+0x81/0x90
> > > [ 2.896139] ? __kmalloc+0x2a3/0x2e0
> > > [ 2.896144] ? media_pipeline_start+0x28/0x50
> > > [ 2.896150] ? __media_entity_enum_init+0x33/0x70
> > > [ 2.896155] ? csi2_has_route+0x18/0x20
> > > [ 2.896160] ? media_graph_walk_next.part.9+0xac/0x290
> > > [ 2.896166] __media_pipeline_start+0x15b/0x2f0
> > > [ 2.896173] ? rcu_read_lock_sched_held+0x81/0x90
> > > [ 2.896179] media_pipeline_start+0x33/0x50
> > > [ 2.896186] ipu_isys_video_prepare_streaming+0x1e0/0x610
> > > [ 2.896191] ? __lock_acquire+0x132e/0x1340
> > > [ 2.896198] ? __lock_acquire+0x2b5/0x1340
> > > [ 2.896204] ? lock_acquire+0x95/0x1a0
> > > [ 2.896209] ? start_streaming+0x5c/0x3a0
> > > [ 2.896215] ? start_streaming+0x5c/0x3a0
> > > [ 2.896221] ? __mutex_lock+0x391/0x9a0
> > > [ 2.896226] ? v4l_enable_media_source+0x2d/0x70
> > > [ 2.896233] ? find_held_lock+0x35/0xa0
> > > [ 2.896238] ? v4l_enable_media_source+0x57/0x70
> > > [ 2.896245] start_streaming+0x186/0x3a0
> > > [ 2.896250] ? __mutex_unlock_slowpath+0x58/0x2f0
> > > [ 2.896257] vb2_start_streaming+0x6d/0x130
> > > [ 2.896262] ? vb2_start_streaming+0x6d/0x130
> > > [ 2.896267] vb2_core_streamon+0x108/0x140
> > > [ 2.896273] vb2_streamon+0x29/0x50
> > > [ 2.896278] vb2_ioctl_streamon+0x42/0x50
> > > [ 2.896284] v4l_streamon+0x20/0x30
> > > [ 2.896288] __video_do_ioctl+0x1af/0x3c0
> > > [ 2.896296] ? __might_fault+0x85/0x90
> > > [ 2.896302] video_usercopy+0x27e/0x7e0
> > > [ 2.896307] ? copy_overflow+0x20/0x20
> > > [ 2.896313] ? find_held_lock+0x35/0xa0
> > > [ 2.896319] ? __might_fault+0x3e/0x90
> > > [ 2.896325] video_ioctl2+0x15/0x20
> > > [ 2.896330] v4l2_ioctl+0x49/0x50
> > > [ 2.896335] do_video_ioctl+0x93c/0x2360
> > > [ 2.896343] v4l2_compat_ioctl32+0x93/0xe0
> > > [ 2.896349] __ia32_compat_sys_ioctl+0x73a/0x1c90
> > > [ 2.896354] ? lockdep_hardirqs_on+0xef/0x180
> > > [ 2.896359] ? do_fast_syscall_32+0x3b/0x2d6
> > > [ 2.896364] do_fast_syscall_32+0x9a/0x2d6
> > > [ 2.896370] entry_SYSENTER_compat+0x6d/0x7c
> > > [ 2.896377] RIP: 0023:0xf7e79b79
> > > [ 2.896382] Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 0c 24 c3 8b 1c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
> > > [ 2.896387] RSP: 002b:00000000f76816bc EFLAGS: 00000292 ORIG_RAX: 0000000000000036
> > > [ 2.896393] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 0000000040045612
> > > [ 2.896396] RDX: 00000000f768172c RSI: 00000000f7d42d9c RDI: 00000000f768172c
> > > [ 2.896400] RBP: 00000000f7681708 R08: 0000000000000000 R09: 0000000000000000
> > > [ 2.896404] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > [ 2.896408] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > >
> > > ------------------------------------------------------------------
> > > ----
> > > --
> > >
> > > > [17818.936039] rcu: rcu_node 0:3 ->gp_seq 21808192 ->gp_seq_needed 21808196
> > > > [17818.936048] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 101730 ->gp_req_activity 101732 ->gp_wake_time 101730 ->gp_wake_seq 1357 - >gp_seq 1360 ->gp_seq_needed 1360 ->gp_flags 0x0
> > > > [17818.936056] rcu: rcu_bh: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 4312486108 ->gp_req_activity 4312486108 ->gp_wake_time 4312486108 - >gp_wake_seq 0 ->gp_seq -1200 ->gp_seq_needed -1200 ->gp_flags 0x0
> > > >
> > > > -----Original Message-----
> > > > From: Paul E. McKenney <[email protected]>
> > > > Sent: Thursday, December 13, 2018 12:40 PM
> > > > To: Zhang, Jun <[email protected]>
> > > > Cc: He, Bo <[email protected]>; Steven Rostedt
> > > > <[email protected]>; [email protected];
> > > > [email protected]; [email protected];
> > > > [email protected]; Xiao, Jin <[email protected]>; Zhang,
> > > > Yanmin <[email protected]>; Bai, Jie A
> > > > <[email protected]>; Sun, Yi J <[email protected]>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Thu, Dec 13, 2018 at 03:28:46AM +0000, Zhang, Jun wrote:
> > > > > Ok, we will test it, thanks!
> > > >
> > > > But please also try the sysrq-y with the earlier patch after a hang!
> > > >
> > > > Thanx, Paul
> > > >
> > > > > -----Original Message-----
> > > > > From: Paul E. McKenney [mailto:[email protected]]
> > > > > Sent: Thursday, December 13, 2018 10:43
> > > > > To: Zhang, Jun <[email protected]>
> > > > > Cc: He, Bo <[email protected]>; Steven Rostedt
> > > > > <[email protected]>; [email protected];
> > > > > [email protected]; [email protected];
> > > > > [email protected]; Xiao, Jin <[email protected]>; Zhang,
> > > > > Yanmin <[email protected]>; Bai, Jie A
> > > > > <[email protected]>; Sun, Yi J <[email protected]>
> > > > > Subject: Re: rcu_preempt caused oom
> > > > >
> > > > > On Thu, Dec 13, 2018 at 02:11:35AM +0000, Zhang, Jun wrote:
> > > > > > Hello, Paul
> > > > > >
> > > > > > I think the next patch is better.
> > > > > > Because ULONG_CMP_GE could cause double write, which has risk that write back old value.
> > > > > > Please help review.
> > > > > > I don't test it. If you agree, we will test it.
> > > > >
> > > > > Just to make sure that I understand, you are worried about something like the following, correct?
> > > > >
> > > > > o __note_gp_changes() compares rnp->gp_seq_needed and rdp->gp_seq_needed
> > > > > and finds them equal.
> > > > >
> > > > > o At just this time something like rcu_start_this_gp() assigns a new
> > > > > (larger) value to rdp->gp_seq_needed.
> > > > >
> > > > > o Then __note_gp_changes() overwrites rdp->gp_seq_needed with the
> > > > > old value.
> > > > >
> > > > > This cannot happen because __note_gp_changes() runs with interrupts disabled on the CPU corresponding to the rcu_data structure referenced by the rdp pointer. So there is no way for rcu_start_this_gp() to be invoked on the same CPU during this "if" statement.
> > > > >
> > > > > Of course, there could be bugs. For example:
> > > > >
> > > > > o __note_gp_changes() might be called on a different CPU than that
> > > > > corresponding to rdp. You can check this with something like:
> > > > >
> > > > > WARN_ON_ONCE(rdp->cpu != smp_processor_id());
> > > > >
> > > > > o The same things could happen with rcu_start_this_gp(), and the
> > > > > above WARN_ON_ONCE() would work there as well.
> > > > >
> > > > > o rcutree_prepare_cpu() is a special case, but is irrelevant unless
> > > > > you are doing CPU-hotplug operations. (It can run on a CPU other
> > > > > than rdp->cpu, but only at times when rdp->cpu is offline.)
> > > > >
> > > > > o Interrupts might not really be disabled.
> > > > >
> > > > > That said, your patch could reduce overhead slightly, given that the two values will be equal much of the time. So it might be worth testing just for that reason.
> > > > >
> > > > > So why not just test it anyway? If it makes the bug go away,
> > > > > I will be surprised, but it would not be the first surprise for me.
> > > > > ;-)
> > > > >
> > > > > Thanx, Paul
> > > > >
> > > > > > Thanks!
> > > > > >
> > > > > >
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > > > > 0b760c1..c00f34e 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
> > > > > > zero_cpu_stall_ticks(rdp);
> > > > > > }
> > > > > > rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */
> > > > > > - if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
> > > > > > + if (ULONG_CMP_LT(rdp->gp_seq_needed,
> > > > > > + rnp->gp_seq_needed)
> > > > > > + ||
> > > > > > + rdp->gpwrap)
> > > > > > rdp->gp_seq_needed = rnp->gp_seq_needed;
> > > > > > WRITE_ONCE(rdp->gpwrap, false);
> > > > > > rcu_gpnum_ovf(rnp, rdp);
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Paul E. McKenney [mailto:[email protected]]
> > > > > > Sent: Thursday, December 13, 2018 08:12
> > > > > > To: He, Bo <[email protected]>
> > > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > > [email protected]; [email protected];
> > > > > > [email protected]; [email protected];
> > > > > > Zhang, Jun <[email protected]>; Xiao, Jin
> > > > > > <[email protected]>; Zhang, Yanmin
> > > > > > <[email protected]>; Bai, Jie A <[email protected]>;
> > > > > > Sun, Yi J <[email protected]>
> > > > > > Subject: Re: rcu_preempt caused oom
> > > > > >
> > > > > > On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> > > > > > > I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> > > > > > > Please correct me if I have something wrong.
> > > > > >
> > > > > > That would be because I sent you the wrong patch, apologies!
> > > > > > :-/
> > > > > >
> > > > > > Please instead see the one below, which does add sysrq_rcu.
> > > > > >
> > > > > > Thanx, Paul
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Paul E. McKenney <[email protected]>
> > > > > > > Sent: Thursday, December 13, 2018 5:03 AM
> > > > > > > To: He, Bo <[email protected]>
> > > > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > > > [email protected]; [email protected];
> > > > > > > [email protected]; [email protected];
> > > > > > > Zhang, Jun <[email protected]>; Xiao, Jin
> > > > > > > <[email protected]>; Zhang, Yanmin
> > > > > > > <[email protected]>; Bai, Jie A <[email protected]>
> > > > > > > Subject: Re: rcu_preempt caused oom
> > > > > > >
> > > > > > > On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > > > > > > > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > > > > > > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > > > > > > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> > > > > > > >
> > > > > > > > Strange.
> > > > > > > >
> > > > > > > > Are the systems responsive to sysrq keys once failure occurs?
> > > > > > > > If so, I will provide you a sysrq-R or some such to dump out the RCU state.
> > > > > > >
> > > > > > > Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below. Only lightly tested.
> > > > > >
> > > > > > ------------------------------------------------------------
> > > > > > ----
> > > > > > --
> > > > > > --
> > > > > > --
> > > > > > --
> > > > > >
> > > > > > commit 04b6245c8458e8725f4169e62912c1fadfdf8141
> > > > > > Author: Paul E. McKenney <[email protected]>
> > > > > > Date: Wed Dec 12 16:10:09 2018 -0800
> > > > > >
> > > > > > rcu: Add sysrq rcu_node-dump capability
> > > > > >
> > > > > > Backported from v4.21/v5.0
> > > > > >
> > > > > > Life is hard if RCU manages to get stuck without triggering RCU CPU
> > > > > > stall warnings or triggering the rcu_check_gp_start_stall() checks
> > > > > > for failing to start a grace period. This commit therefore adds a
> > > > > > boot-time-selectable sysrq key (commandeering "y") that allows manually
> > > > > > dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter
> > > > > > must be set for this sysrq to be available.
> > > > > >
> > > > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > > >
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > > > > 0b760c1369f7..e9392a9d6291 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -61,6 +61,7 @@
> > > > > > #include <linux/trace_events.h> #include <linux/suspend.h>
> > > > > > #include <linux/ftrace.h>
> > > > > > +#include <linux/sysrq.h>
> > > > > >
> > > > > > #include "tree.h"
> > > > > > #include "rcu.h"
> > > > > > @@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
> > > > > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total #
> > > > > > rcu_nodes in use. */
> > > > > > /* panic() on RCU Stall sysctl. */ int
> > > > > > sysctl_panic_on_rcu_stall __read_mostly;
> > > > > > +/* Commandeer a sysrq key to dump RCU's tree. */ static
> > > > > > +bool sysrq_rcu; module_param(sysrq_rcu, bool, 0444);
> > > > > >
> > > > > > /*
> > > > > > * The rcu_scheduler_active variable is initialized to the
> > > > > > value @@
> > > > > > -662,6 +666,27 @@ void show_rcu_gp_kthreads(void) }
> > > > > > EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
> > > > > >
> > > > > > +/* Dump grace-period-request information due to commandeered sysrq.
> > > > > > +*/ static void sysrq_show_rcu(int key) {
> > > > > > + show_rcu_gp_kthreads();
> > > > > > +}
> > > > > > +
> > > > > > +static struct sysrq_key_op sysrq_rcudump_op = {
> > > > > > + .handler = sysrq_show_rcu,
> > > > > > + .help_msg = "show-rcu(y)",
> > > > > > + .action_msg = "Show RCU tree",
> > > > > > + .enable_mask = SYSRQ_ENABLE_DUMP, };
> > > > > > +
> > > > > > +static int __init rcu_sysrq_init(void) {
> > > > > > + if (sysrq_rcu)
> > > > > > + return register_sysrq_key('y', &sysrq_rcudump_op);
> > > > > > + return 0;
> > > > > > +}
> > > > > > +early_initcall(rcu_sysrq_init);
> > > > > > +
> > > > > > /*
> > > > > > * Send along grace-period-related data for rcutorture diagnostics.
> > > > > > */
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
>
>


Attachments:
0001-rcu-detect-the-preempt_rcu-hang-for-triage-jing-s-bo.patch (1.65 kB)
0001-rcu-detect-the-preempt_rcu-hang-for-triage-jing-s-bo.patch
0002-rcu-v2-detect-the-preempt_rcu-hang-for-triage-jing-s.patch (1.04 kB)
0002-rcu-v2-detect-the-preempt_rcu-hang-for-triage-jing-s.patch
Download all attachments

2018-12-17 05:01:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Mon, Dec 17, 2018 at 03:15:42AM +0000, He, Bo wrote:
> for double confirm the issue is not reproduce after 90 hours, we tried only add the enclosed patch on the easy reproduced build, the issue is not reproduced after 63 hours in the whole weekend on 16 boards.
> so current conclusion is the debug patch has extreme effect on the rcu issue.

This is not a surprise. (Please see the end of this email for a
replacement patch that won't suppress the bug.)

To see why this is not a surprise, let's take a closer look at your patch,
in light of the comment header for wait_event_idle_timeout_exclusive():

* Returns:
* 0 if the @condition evaluated to %false after the @timeout elapsed,
* 1 if the @condition evaluated to %true after the @timeout elapsed,
* or the remaining jiffies (at least 1) if the @condition evaluated
* to %true before the @timeout elapsed.

The situation we are seeing is that the RCU_GP_FLAG_INIT is set, but
the rcu_preempt task does not wake up. This would correspond to
the second case above, that is, a return value of 1. Looking now
at your patch, with comments interspersed below:

------------------------------------------------------------------------

From e8b583aa685b3b4f304f72398a80461bba09389c Mon Sep 17 00:00:00 2001
From: "he, bo" <[email protected]>
Date: Sun, 9 Dec 2018 18:11:33 +0800
Subject: [PATCH] rcu: detect the preempt_rcu hang for triage jing's board

Change-Id: I2ffceec2ae4847867753609e45c99afc66956003
Tracked-On:
Signed-off-by: he, bo <[email protected]>
---
kernel/rcu/tree.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 78c0cf2..d6de363 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2192,8 +2192,13 @@ static int __noreturn rcu_gp_kthread(void *arg)
int ret;
struct rcu_state *rsp = arg;
struct rcu_node *rnp = rcu_get_root(rsp);
+ pid_t rcu_preempt_pid;

rcu_bind_gp_kthread();
+ if(!strcmp(rsp->name, "rcu_preempt")) {
+ rcu_preempt_pid = rsp->gp_kthread->pid;
+ }
+
for (;;) {

/* Handle grace-period start. */
@@ -2202,8 +2207,19 @@ static int __noreturn rcu_gp_kthread(void *arg)
READ_ONCE(rsp->gp_seq),
TPS("reqwait"));
rsp->gp_state = RCU_GP_WAIT_GPS;
- swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
- RCU_GP_FLAG_INIT);
+ if (current->pid != rcu_preempt_pid) {
+ swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT);
+ } else {
+ ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT, 2*HZ);
+
+ if(!ret) {

We get here if ret==0. Therefore, the above "if" statement needs to
instead be "if (ret == 1) {".

In addition, in order to get event traces dumped, we also need:

rcu_ftrace_dump(DUMP_ALL);

+ show_rcu_gp_kthreads();
+ panic("hung_task: blocked in rcu_gp_kthread init");
+ }
+ }
+
rsp->gp_state = RCU_GP_DONE_GPS;
/* Locking provides needed memory barrier. */
if (rcu_gp_init(rsp))
--
2.7.4

------------------------------------------------------------------------

So, again, please change the "if(!ret) {" to "if (ret == 1) {", and
please add "rcu_ftrace_dump(DUMP_ALL);" right after this "if" statement,
as shown above.

With that change, I bet that you will again see failures.

> Compared with the swait_event_idle_timeout_exclusive will do 3 times to check the condition, while swait_event_idle_ exclusive will do 2 times check the condition.
>
> so today I will do another experiment, only change as below:
> - swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> - RCU_GP_FLAG_INIT);
> + ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> + RCU_GP_FLAG_INIT, MAX_SCHEDULE_TIMEOUT);
> +
>
> Can you get some clues from the experiment?

Again, please instead make the changes that I called out above, with
the replacement for your patch 0001 shown below.

Thanx, Paul

PS. I have been testing for quite some time, but am still unable
to reproduce this. So we must depend on you to reproduce it.

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b760c1369f7..86152af1a580 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2153,8 +2153,13 @@ static int __noreturn rcu_gp_kthread(void *arg)
int ret;
struct rcu_state *rsp = arg;
struct rcu_node *rnp = rcu_get_root(rsp);
+ pid_t rcu_preempt_pid;

rcu_bind_gp_kthread();
+ if(!strcmp(rsp->name, "rcu_preempt")) {
+ rcu_preempt_pid = rsp->gp_kthread->pid;
+ }
+
for (;;) {

/* Handle grace-period start. */
@@ -2163,8 +2168,20 @@ static int __noreturn rcu_gp_kthread(void *arg)
READ_ONCE(rsp->gp_seq),
TPS("reqwait"));
rsp->gp_state = RCU_GP_WAIT_GPS;
- swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
- RCU_GP_FLAG_INIT);
+ if (current->pid != rcu_preempt_pid) {
+ swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT);
+ } else {
+ ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT, 2*HZ);
+
+ if (ret == 1) {
+ rcu_ftrace_dump(DUMP_ALL);
+ show_rcu_gp_kthreads();
+ panic("hung_task: blocked in rcu_gp_kthread init");
+ }
+ }
+
rsp->gp_state = RCU_GP_DONE_GPS;
/* Locking provides needed memory barrier. */
if (rcu_gp_init(rsp))


2018-12-18 02:49:06

by Zhang, Jun

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

Hello, paul

In softirq context, and current is rcu_preempt-10, rcu_gp_kthread_wake don't wakeup rcu_preempt.
Maybe next patch could fix it. Please help review.

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b760c1..98f5b40 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1697,7 +1697,7 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
*/
static void rcu_gp_kthread_wake(struct rcu_state *rsp)
{
- if (current == rsp->gp_kthread ||
+ if (((current == rsp->gp_kthread) && !in_softirq()) ||
!READ_ONCE(rsp->gp_flags) ||
!rsp->gp_kthread)
return;

[44932.311439, 0][ rcu_preempt] rcu_preempt-10 [001] .n.. 44929.401037: rcu_grace_period: rcu_preempt 19063548 reqwait
......
[44932.311517, 0][ rcu_preempt] rcu_preempt-10 [001] d.s2 44929.402234: rcu_future_grace_period: rcu_preempt 19063548 19063552 0 0 3 Startleaf
[44932.311536, 0][ rcu_preempt] rcu_preempt-10 [001] d.s2 44929.402237: rcu_future_grace_period: rcu_preempt 19063548 19063552 0 0 3 Startedroot


-----Original Message-----
From: He, Bo
Sent: Tuesday, December 18, 2018 07:16
To: [email protected]
Cc: Zhang, Jun <[email protected]>; Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>; Chang, Junxiao <[email protected]>; Mei, Paul <[email protected]>
Subject: RE: rcu_preempt caused oom

Thanks for your comments, the issue could be panic with the change if (ret == 1). Here enclosed are the logs.

-----Original Message-----
From: Paul E. McKenney <[email protected]>
Sent: Monday, December 17, 2018 12:26 PM
To: He, Bo <[email protected]>
Cc: Zhang, Jun <[email protected]>; Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>; Chang, Junxiao <[email protected]>; Mei, Paul <[email protected]>
Subject: Re: rcu_preempt caused oom

On Mon, Dec 17, 2018 at 03:15:42AM +0000, He, Bo wrote:
> for double confirm the issue is not reproduce after 90 hours, we tried only add the enclosed patch on the easy reproduced build, the issue is not reproduced after 63 hours in the whole weekend on 16 boards.
> so current conclusion is the debug patch has extreme effect on the rcu issue.

This is not a surprise. (Please see the end of this email for a replacement patch that won't suppress the bug.)

To see why this is not a surprise, let's take a closer look at your patch, in light of the comment header for wait_event_idle_timeout_exclusive():

* Returns:
* 0 if the @condition evaluated to %false after the @timeout elapsed,
* 1 if the @condition evaluated to %true after the @timeout elapsed,
* or the remaining jiffies (at least 1) if the @condition evaluated
* to %true before the @timeout elapsed.

The situation we are seeing is that the RCU_GP_FLAG_INIT is set, but the rcu_preempt task does not wake up. This would correspond to the second case above, that is, a return value of 1. Looking now at your patch, with comments interspersed below:

------------------------------------------------------------------------

From e8b583aa685b3b4f304f72398a80461bba09389c Mon Sep 17 00:00:00 2001
From: "he, bo" <[email protected]>
Date: Sun, 9 Dec 2018 18:11:33 +0800
Subject: [PATCH] rcu: detect the preempt_rcu hang for triage jing's board

Change-Id: I2ffceec2ae4847867753609e45c99afc66956003
Tracked-On:
Signed-off-by: he, bo <[email protected]>
---
kernel/rcu/tree.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 78c0cf2..d6de363 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2192,8 +2192,13 @@ static int __noreturn rcu_gp_kthread(void *arg)
int ret;
struct rcu_state *rsp = arg;
struct rcu_node *rnp = rcu_get_root(rsp);
+ pid_t rcu_preempt_pid;

rcu_bind_gp_kthread();
+ if(!strcmp(rsp->name, "rcu_preempt")) {
+ rcu_preempt_pid = rsp->gp_kthread->pid;
+ }
+
for (;;) {

/* Handle grace-period start. */
@@ -2202,8 +2207,19 @@ static int __noreturn rcu_gp_kthread(void *arg)
READ_ONCE(rsp->gp_seq),
TPS("reqwait"));
rsp->gp_state = RCU_GP_WAIT_GPS;
- swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
- RCU_GP_FLAG_INIT);
+ if (current->pid != rcu_preempt_pid) {
+ swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT);
+ } else {
+ ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT, 2*HZ);
+
+ if(!ret) {

We get here if ret==0. Therefore, the above "if" statement needs to instead be "if (ret == 1) {".

In addition, in order to get event traces dumped, we also need:

rcu_ftrace_dump(DUMP_ALL);

+ show_rcu_gp_kthreads();
+ panic("hung_task: blocked in rcu_gp_kthread init");
+ }
+ }
+
rsp->gp_state = RCU_GP_DONE_GPS;
/* Locking provides needed memory barrier. */
if (rcu_gp_init(rsp))
--
2.7.4

------------------------------------------------------------------------

So, again, please change the "if(!ret) {" to "if (ret == 1) {", and please add "rcu_ftrace_dump(DUMP_ALL);" right after this "if" statement, as shown above.

With that change, I bet that you will again see failures.

> Compared with the swait_event_idle_timeout_exclusive will do 3 times to check the condition, while swait_event_idle_ exclusive will do 2 times check the condition.
>
> so today I will do another experiment, only change as below:
> - swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> - RCU_GP_FLAG_INIT);
> + ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> + RCU_GP_FLAG_INIT, MAX_SCHEDULE_TIMEOUT);
> +
>
> Can you get some clues from the experiment?

Again, please instead make the changes that I called out above, with the replacement for your patch 0001 shown below.

Thanx, Paul

PS. I have been testing for quite some time, but am still unable
to reproduce this. So we must depend on you to reproduce it.

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0b760c1369f7..86152af1a580 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2153,8 +2153,13 @@ static int __noreturn rcu_gp_kthread(void *arg)
int ret;
struct rcu_state *rsp = arg;
struct rcu_node *rnp = rcu_get_root(rsp);
+ pid_t rcu_preempt_pid;

rcu_bind_gp_kthread();
+ if(!strcmp(rsp->name, "rcu_preempt")) {
+ rcu_preempt_pid = rsp->gp_kthread->pid;
+ }
+
for (;;) {

/* Handle grace-period start. */
@@ -2163,8 +2168,20 @@ static int __noreturn rcu_gp_kthread(void *arg)
READ_ONCE(rsp->gp_seq),
TPS("reqwait"));
rsp->gp_state = RCU_GP_WAIT_GPS;
- swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
- RCU_GP_FLAG_INIT);
+ if (current->pid != rcu_preempt_pid) {
+ swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT);
+ } else {
+ ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT, 2*HZ);
+
+ if (ret == 1) {
+ rcu_ftrace_dump(DUMP_ALL);
+ show_rcu_gp_kthreads();
+ panic("hung_task: blocked in rcu_gp_kthread init");
+ }
+ }
+
rsp->gp_state = RCU_GP_DONE_GPS;
/* Locking provides needed memory barrier. */
if (rcu_gp_init(rsp))


2018-12-18 03:13:53

by He, Bo

[permalink] [raw]
Subject: RE: rcu_preempt caused oom

check with jun:
the scenario is more like:
@@@rcu_start_this_gp@@@ start after ___swait_event before schedule
rcu_gp_kthread--> swait_event_idle_exclusive--> __swait_event_idle--> ___swait_event--------->schedule
@@@ rcu_gp_kthread_wake skip wakeup in rcu_gp_kthread

then rcu_gp_kthread will sleep and can't wake up.

Jun's patch can workaround it, what's your ideas?


-----Original Message-----
From: Zhang, Jun
Sent: Tuesday, December 18, 2018 10:47 AM
To: He, Bo <[email protected]>; [email protected]
Cc: Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>; Chang, Junxiao <[email protected]>; Mei, Paul <[email protected]>
Subject: RE: rcu_preempt caused oom

Hello, paul

In softirq context, and current is rcu_preempt-10, rcu_gp_kthread_wake don't wakeup rcu_preempt.
Maybe next patch could fix it. Please help review.

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0b760c1..98f5b40 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1697,7 +1697,7 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
*/
static void rcu_gp_kthread_wake(struct rcu_state *rsp) {
- if (current == rsp->gp_kthread ||
+ if (((current == rsp->gp_kthread) && !in_softirq()) ||
!READ_ONCE(rsp->gp_flags) ||
!rsp->gp_kthread)
return;

[44932.311439, 0][ rcu_preempt] rcu_preempt-10 [001] .n.. 44929.401037: rcu_grace_period: rcu_preempt 19063548 reqwait
......
[44932.311517, 0][ rcu_preempt] rcu_preempt-10 [001] d.s2 44929.402234: rcu_future_grace_period: rcu_preempt 19063548 19063552 0 0 3 Startleaf
[44932.311536, 0][ rcu_preempt] rcu_preempt-10 [001] d.s2 44929.402237: rcu_future_grace_period: rcu_preempt 19063548 19063552 0 0 3 Startedroot


-----Original Message-----
From: He, Bo
Sent: Tuesday, December 18, 2018 07:16
To: [email protected]
Cc: Zhang, Jun <[email protected]>; Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>; Chang, Junxiao <[email protected]>; Mei, Paul <[email protected]>
Subject: RE: rcu_preempt caused oom

Thanks for your comments, the issue could be panic with the change if (ret == 1). Here enclosed are the logs.

-----Original Message-----
From: Paul E. McKenney <[email protected]>
Sent: Monday, December 17, 2018 12:26 PM
To: He, Bo <[email protected]>
Cc: Zhang, Jun <[email protected]>; Steven Rostedt <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Xiao, Jin <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A <[email protected]>; Sun, Yi J <[email protected]>; Chang, Junxiao <[email protected]>; Mei, Paul <[email protected]>
Subject: Re: rcu_preempt caused oom

On Mon, Dec 17, 2018 at 03:15:42AM +0000, He, Bo wrote:
> for double confirm the issue is not reproduce after 90 hours, we tried only add the enclosed patch on the easy reproduced build, the issue is not reproduced after 63 hours in the whole weekend on 16 boards.
> so current conclusion is the debug patch has extreme effect on the rcu issue.

This is not a surprise. (Please see the end of this email for a replacement patch that won't suppress the bug.)

To see why this is not a surprise, let's take a closer look at your patch, in light of the comment header for wait_event_idle_timeout_exclusive():

* Returns:
* 0 if the @condition evaluated to %false after the @timeout elapsed,
* 1 if the @condition evaluated to %true after the @timeout elapsed,
* or the remaining jiffies (at least 1) if the @condition evaluated
* to %true before the @timeout elapsed.

The situation we are seeing is that the RCU_GP_FLAG_INIT is set, but the rcu_preempt task does not wake up. This would correspond to the second case above, that is, a return value of 1. Looking now at your patch, with comments interspersed below:

------------------------------------------------------------------------

From e8b583aa685b3b4f304f72398a80461bba09389c Mon Sep 17 00:00:00 2001
From: "he, bo" <[email protected]>
Date: Sun, 9 Dec 2018 18:11:33 +0800
Subject: [PATCH] rcu: detect the preempt_rcu hang for triage jing's board

Change-Id: I2ffceec2ae4847867753609e45c99afc66956003
Tracked-On:
Signed-off-by: he, bo <[email protected]>
---
kernel/rcu/tree.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 78c0cf2..d6de363 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2192,8 +2192,13 @@ static int __noreturn rcu_gp_kthread(void *arg)
int ret;
struct rcu_state *rsp = arg;
struct rcu_node *rnp = rcu_get_root(rsp);
+ pid_t rcu_preempt_pid;

rcu_bind_gp_kthread();
+ if(!strcmp(rsp->name, "rcu_preempt")) {
+ rcu_preempt_pid = rsp->gp_kthread->pid;
+ }
+
for (;;) {

/* Handle grace-period start. */
@@ -2202,8 +2207,19 @@ static int __noreturn rcu_gp_kthread(void *arg)
READ_ONCE(rsp->gp_seq),
TPS("reqwait"));
rsp->gp_state = RCU_GP_WAIT_GPS;
- swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
- RCU_GP_FLAG_INIT);
+ if (current->pid != rcu_preempt_pid) {
+ swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT);
+ } else {
+ ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT, 2*HZ);
+
+ if(!ret) {

We get here if ret==0. Therefore, the above "if" statement needs to instead be "if (ret == 1) {".

In addition, in order to get event traces dumped, we also need:

rcu_ftrace_dump(DUMP_ALL);

+ show_rcu_gp_kthreads();
+ panic("hung_task: blocked in rcu_gp_kthread init");
+ }
+ }
+
rsp->gp_state = RCU_GP_DONE_GPS;
/* Locking provides needed memory barrier. */
if (rcu_gp_init(rsp))
--
2.7.4

------------------------------------------------------------------------

So, again, please change the "if(!ret) {" to "if (ret == 1) {", and please add "rcu_ftrace_dump(DUMP_ALL);" right after this "if" statement, as shown above.

With that change, I bet that you will again see failures.

> Compared with the swait_event_idle_timeout_exclusive will do 3 times to check the condition, while swait_event_idle_ exclusive will do 2 times check the condition.
>
> so today I will do another experiment, only change as below:
> - swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> - RCU_GP_FLAG_INIT);
> + ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> + RCU_GP_FLAG_INIT, MAX_SCHEDULE_TIMEOUT);
> +
>
> Can you get some clues from the experiment?

Again, please instead make the changes that I called out above, with the replacement for your patch 0001 shown below.

Thanx, Paul

PS. I have been testing for quite some time, but am still unable
to reproduce this. So we must depend on you to reproduce it.

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0b760c1369f7..86152af1a580 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2153,8 +2153,13 @@ static int __noreturn rcu_gp_kthread(void *arg)
int ret;
struct rcu_state *rsp = arg;
struct rcu_node *rnp = rcu_get_root(rsp);
+ pid_t rcu_preempt_pid;

rcu_bind_gp_kthread();
+ if(!strcmp(rsp->name, "rcu_preempt")) {
+ rcu_preempt_pid = rsp->gp_kthread->pid;
+ }
+
for (;;) {

/* Handle grace-period start. */
@@ -2163,8 +2168,20 @@ static int __noreturn rcu_gp_kthread(void *arg)
READ_ONCE(rsp->gp_seq),
TPS("reqwait"));
rsp->gp_state = RCU_GP_WAIT_GPS;
- swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
- RCU_GP_FLAG_INIT);
+ if (current->pid != rcu_preempt_pid) {
+ swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT);
+ } else {
+ ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
+ RCU_GP_FLAG_INIT, 2*HZ);
+
+ if (ret == 1) {
+ rcu_ftrace_dump(DUMP_ALL);
+ show_rcu_gp_kthreads();
+ panic("hung_task: blocked in rcu_gp_kthread init");
+ }
+ }
+
rsp->gp_state = RCU_GP_DONE_GPS;
/* Locking provides needed memory barrier. */
if (rcu_gp_init(rsp))


2018-12-18 05:36:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_preempt caused oom

On Tue, Dec 18, 2018 at 02:46:43AM +0000, Zhang, Jun wrote:
> Hello, paul
>
> In softirq context, and current is rcu_preempt-10, rcu_gp_kthread_wake don't wakeup rcu_preempt.
> Maybe next patch could fix it. Please help review.
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 0b760c1..98f5b40 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1697,7 +1697,7 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
> */
> static void rcu_gp_kthread_wake(struct rcu_state *rsp)
> {
> - if (current == rsp->gp_kthread ||
> + if (((current == rsp->gp_kthread) && !in_softirq()) ||

Close, but not quite. Please see below.

> !READ_ONCE(rsp->gp_flags) ||
> !rsp->gp_kthread)
> return;
>
> [44932.311439, 0][ rcu_preempt] rcu_preempt-10 [001] .n.. 44929.401037: rcu_grace_period: rcu_preempt 19063548 reqwait
> ......
> [44932.311517, 0][ rcu_preempt] rcu_preempt-10 [001] d.s2 44929.402234: rcu_future_grace_period: rcu_preempt 19063548 19063552 0 0 3 Startleaf
> [44932.311536, 0][ rcu_preempt] rcu_preempt-10 [001] d.s2 44929.402237: rcu_future_grace_period: rcu_preempt 19063548 19063552 0 0 3 Startedroot

Good catch! If the rcu_preempt kthread had just entered the function
swait_event_idle_exclusive(), which had just called __swait_event_idle()
which had just called ___swait_event(), which had just gotten done
checking the "condition", then yes, the rcu_preempt kthread could
sleep forever. This is a very narrow race window, but that matches
your experience with its not happening often -- and my experience with
it not happening at all.

However, for this to happen, the wakeup must happen within a softirq
handler that executes upon return from an interrupt that interrupted
___swait_event() just after the "if (condition)". For this, we don't want
in_softirq() but rather in_serving_softirq(), as shown in the patch below.
The patch you have above could result in spurious wakeups, as it is
checking for bottom halves being disabled, not just executing within a
softirq handler. Which might be better than not having enough wakeups,
but let's please try for just the right number. ;-)

So could you please instead test the patch below?

And if it works, could I please have your Signed-off-by so that I can
queue it? My patch is quite clearly derived from yours, after all!
And you should get credit for finding the problem and arriving at an
approximate fix, after all.

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e9392a9d6291..b9205b40b621 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1722,7 +1722,7 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
*/
static void rcu_gp_kthread_wake(struct rcu_state *rsp)
{
- if (current == rsp->gp_kthread ||
+ if ((current == rsp->gp_kthread && !in_serving_softirq()) ||
!READ_ONCE(rsp->gp_flags) ||
!rsp->gp_kthread)
return;


Subject: [tip:core/rcu] rcu: Do RCU GP kthread self-wakeup from softirq and interrupt

Commit-ID: 1d1f898df6586c5ea9aeaf349f13089c6fa37903
Gitweb: https://git.kernel.org/tip/1d1f898df6586c5ea9aeaf349f13089c6fa37903
Author: Zhang, Jun <[email protected]>
AuthorDate: Tue, 18 Dec 2018 06:55:01 -0800
Committer: Paul E. McKenney <[email protected]>
CommitDate: Fri, 25 Jan 2019 15:29:59 -0800

rcu: Do RCU GP kthread self-wakeup from softirq and interrupt

The rcu_gp_kthread_wake() function is invoked when it might be necessary
to wake the RCU grace-period kthread. Because self-wakeups are normally
a useless waste of CPU cycles, if rcu_gp_kthread_wake() is invoked from
this kthread, it naturally refuses to do the wakeup.

Unfortunately, natural though it might be, this heuristic fails when
rcu_gp_kthread_wake() is invoked from an interrupt or softirq handler
that interrupted the grace-period kthread just after the final check of
the wait-event condition but just before the schedule() call. In this
case, a wakeup is required, even though the call to rcu_gp_kthread_wake()
is within the RCU grace-period kthread's context. Failing to provide
this wakeup can result in grace periods failing to start, which in turn
results in out-of-memory conditions.

This race window is quite narrow, but it actually did happen during real
testing. It would of course need to be fixed even if it was strictly
theoretical in nature.

This patch does not Cc stable because it does not apply cleanly to
earlier kernel versions.

Fixes: 48a7639ce80c ("rcu: Make callers awaken grace-period kthread")
Reported-by: "He, Bo" <[email protected]>
Co-developed-by: "Zhang, Jun" <[email protected]>
Co-developed-by: "He, Bo" <[email protected]>
Co-developed-by: "xiao, jin" <[email protected]>
Co-developed-by: Bai, Jie A <[email protected]>
Signed-off: "Zhang, Jun" <[email protected]>
Signed-off: "He, Bo" <[email protected]>
Signed-off: "xiao, jin" <[email protected]>
Signed-off: Bai, Jie A <[email protected]>
Signed-off-by: "Zhang, Jun" <[email protected]>
[ paulmck: Switch from !in_softirq() to "!in_interrupt() &&
!in_serving_softirq() to avoid redundant wakeups and to also handle the
interrupt-handler scenario as well as the softirq-handler scenario that
actually occurred in testing. ]
Signed-off-by: Paul E. McKenney <[email protected]>
Link: https://lkml.kernel.org/r/CD6925E8781EFD4D8E11882D20FC406D52A11F61@SHSMSX104.ccr.corp.intel.com
---
kernel/rcu/tree.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9ceb93f848cd..21775eebb8f0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1593,15 +1593,23 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
}

/*
- * Awaken the grace-period kthread. Don't do a self-awaken, and don't
- * bother awakening when there is nothing for the grace-period kthread
- * to do (as in several CPUs raced to awaken, and we lost), and finally
- * don't try to awaken a kthread that has not yet been created. If
- * all those checks are passed, track some debug information and awaken.
+ * Awaken the grace-period kthread. Don't do a self-awaken (unless in
+ * an interrupt or softirq handler), and don't bother awakening when there
+ * is nothing for the grace-period kthread to do (as in several CPUs raced
+ * to awaken, and we lost), and finally don't try to awaken a kthread that
+ * has not yet been created. If all those checks are passed, track some
+ * debug information and awaken.
+ *
+ * So why do the self-wakeup when in an interrupt or softirq handler
+ * in the grace-period kthread's context? Because the kthread might have
+ * been interrupted just as it was going to sleep, and just after the final
+ * pre-sleep check of the awaken condition. In this case, a wakeup really
+ * is required, and is therefore supplied.
*/
static void rcu_gp_kthread_wake(void)
{
- if (current == rcu_state.gp_kthread ||
+ if ((current == rcu_state.gp_kthread &&
+ !in_interrupt() && !in_serving_softirq()) ||
!READ_ONCE(rcu_state.gp_flags) ||
!rcu_state.gp_kthread)
return;

Subject: [tip:core/rcu] rcu: Prevent needless ->gp_seq_needed update in __note_gp_changes()

Commit-ID: 13dc7d0c7a2ed438f0ec8e9fb365a1256d87cf87
Gitweb: https://git.kernel.org/tip/13dc7d0c7a2ed438f0ec8e9fb365a1256d87cf87
Author: Zhang, Jun <[email protected]>
AuthorDate: Wed, 19 Dec 2018 10:37:34 -0800
Committer: Paul E. McKenney <[email protected]>
CommitDate: Fri, 25 Jan 2019 15:30:00 -0800

rcu: Prevent needless ->gp_seq_needed update in __note_gp_changes()

Currently, __note_gp_changes() checks to see if the rcu_node structure's
->gp_seq_needed is greater than or equal to that of the rcu_data
structure, and if so, updates the rcu_data structure's ->gp_seq_needed
field. This results in a useless store in the case where the two fields
are equal.

This commit therefore carries out this store only in the case where the
rcu_node structure's ->gp_seq_needed is strictly greater than that of
the rcu_data structure.

Signed-off-by: "Zhang, Jun" <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Link: https://lkml.kernel.org/r/88DC34334CA3444C85D647DBFA962C2735AD5F77@SHSMSX104.ccr.corp.intel.com
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 21775eebb8f0..9d0e2ac9356e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1758,7 +1758,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp)
zero_cpu_stall_ticks(rdp);
}
rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */
- if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
+ if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed) || rdp->gpwrap)
rdp->gp_seq_needed = rnp->gp_seq_needed;
WRITE_ONCE(rdp->gpwrap, false);
rcu_gpnum_ovf(rnp, rdp);