Hello,
I observe that running stress-ng with the cpu-online and fstat tests
results in a deadlock of hung tasks:
[ 366.810486] INFO: task stress-ng-cpu-o:2590 blocked for more than 120
seconds.
[ 366.817689] Not tainted 4.9.0 #39
[ 366.821504] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 366.829320] stress-ng-cpu-o D 0 2590 2589 0x00000008
[ 366.834803] Call trace:
[ 366.837222] [<ffff000008085dd0>] __switch_to+0x60/0x70
[ 366.842338] [<ffff000008a23c18>] __schedule+0x178/0x648
[ 366.847550] [<ffff000008a24120>] schedule+0x38/0x98
[ 366.852408] [<ffff00000848b774>] blk_mq_freeze_queue_wait+0x64/0x1a8
[ 366.858749] [<ffff00000848e9d4>] blk_mq_queue_reinit_work+0x74/0x110
[ 366.865081] [<ffff00000848ea94>] blk_mq_queue_reinit_dead+0x24/0x30
[ 366.871335] [<ffff0000080c9898>] cpuhp_invoke_callback+0x98/0x4a8
[ 366.877411] [<ffff0000080cb084>] cpuhp_down_callbacks+0x114/0x150
[ 366.883484] [<ffff000008a22578>] _cpu_down+0x100/0x1d8
[ 366.888609] [<ffff0000080cbfdc>] do_cpu_down+0x4c/0x78
[ 366.893727] [<ffff0000080cc02c>] cpu_down+0x24/0x30
[ 366.898593] [<ffff0000086aaf28>] cpu_subsys_offline+0x20/0x30
[ 366.904318] [<ffff0000086a53d8>] device_offline+0xa8/0xd8
[ 366.909704] [<ffff0000086a550c>] online_store+0x4c/0xa8
[ 366.914907] [<ffff0000086a241c>] dev_attr_store+0x44/0x60
[ 366.920294] [<ffff0000082b6a24>] sysfs_kf_write+0x5c/0x78
[ 366.925672] [<ffff0000082b5cec>] kernfs_fop_write+0xbc/0x1e8
[ 366.931318] [<ffff000008238320>] __vfs_write+0x48/0x138
[ 366.936526] [<ffff000008239078>] vfs_write+0xa8/0x1c0
[ 366.941557] [<ffff00000823a08c>] SyS_write+0x54/0xb0
[ 366.946511] [<ffff000008083370>] el0_svc_naked+0x24/0x28
[ 366.951800] INFO: task stress-ng-fstat:2591 blocked for more than 120
seconds.
[ 366.959008] Not tainted 4.9.0 #39
[ 366.962823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 366.970640] stress-ng-fstat D 0 2591 2589 0x00000000
[ 366.976105] Call trace:
[ 366.978540] [<ffff000008085dd0>] __switch_to+0x60/0x70
[ 366.983658] [<ffff000008a23c18>] __schedule+0x178/0x648
[ 366.988870] [<ffff000008a24120>] schedule+0x38/0x98
[ 366.993727] [<ffff00000848b774>] blk_mq_freeze_queue_wait+0x64/0x1a8
[ 367.000068] [<ffff00000848e2d0>] blk_mq_freeze_queue+0x28/0x38
[ 367.005880] [<ffff0000086d480c>] lo_release+0x64/0x90
[ 367.010919] [<ffff000008278bd4>] __blkdev_put+0x26c/0x2c8
[ 367.016300] [<ffff000008278fec>] blkdev_put+0x54/0x128
[ 367.021418] [<ffff0000082790ec>] blkdev_close+0x2c/0x40
[ 367.026631] [<ffff00000823ab58>] __fput+0xa0/0x1e0
[ 367.031401] [<ffff00000823ad10>] ____fput+0x20/0x30
[ 367.036266] [<ffff0000080e7a40>] task_work_run+0xc8/0xe8
[ 367.041557] [<ffff0000080882b4>] do_notify_resume+0xac/0xb8
[ 367.047116] [<ffff000008083294>] work_pending+0x8/0x10
I have tested and found this issue to be reproducible on both x86 and
ARM64 architectures on 4.7, 4.8, 4.9, 4.10, and 4.11-rc3 kernels.
Using the below test methodology [1], the issue reproduces within a few
minutes.
Using ftrace, I have analyzed the issue on 4.9 and I believe I've found
the root cause [2].
Based on my analysis, I have developed a fix [3], which addresses the
issue as I am able to run stress-ng for over an hour where I was unable
to do so before, however I do not know the full extend of impacts from
this fix, and look for guidance from the community to determine the
final fix.
[1] Test methodology
--------------------
Boot a multicore system such as a desktop i5 system with nr_cpus=2
Enable all logging to determine when the deadlock occurs (prints from
test stop flowing out of the serial port)
echo 1 > /sys/module/printk/parameters/ignore_loglevel
Run stress-ng
stress-ng --fstat 1 --cpu-online 1 -t 3600
Wait for the test output to stop, and the hung task watchdog to fire.
[2] Analysis
------------
Again, this analysis is based upon the 4.9 kernel, but believe it to
still apply to newer kernels.
I conclude that the hung tasks occur due to a race condition which
results in a deadlock.
The race condition occurs between "normal" work in the block layer on a
core (the stress-ng-fstat task in the above dump) and cpu offline of
that core (the stress-ng-cpu-o task in the above dump).
The fput() from userspace in the fstat task results in a call to
blk_mq_freeze_queue(), which drops the last reference to the queue via
percpu_ref_kill(), and then waits for the ref count of the queue to hit
0 in blk_mq_freeze_queue_wait(). percpu_ref_kill() will result in
__percpu_ref_switch_to_atomic() which will use call_rcu_sched() to setup
delayed work to finalize the percpu_ref cleanup and drop the ref count to 0.
Note that call_rcu_sched() queues the work to a per-cpu queue, thus the
work can only be run on the core it is queued on, by the work thread
that is pinned to that core.
It is a race between this work running, and the cpu offline processing.
If the cpu offline processing is able to get to and process the
RCU/tree:online state before the queued work from the block layer, then
the pinned work thread will be migrated to another core via
rcutree_offline_cpu(), and the work will not be able to execute.
This race condition does not result in deadlock until later in the cpu
offline processing. Once we hit the block/mq:prepare state the block
layer freezes all the queues and waits for the ref counts to hit 0.
This normally works because at this point the cpu being offlined is dead
from cpu:teardown, and the offline processing is occuring on another
active cpu, so call_rcu_sched() will queue work to an active cpu where
it can get processed. However the fstat process already did that work
for one of the queues to be frozen in the block layer, so the processing
of the block/mq:prepare state waits on the same ref count as fstat to
hit 0. Thus we see the result of this as the stress-ng-cpu-o task above.
The block/mq:prepare processing stalls the cpu offline processing which
causes a deadlock because the processing does not get to the
RCU/tree:prepare state which migrates all of the queued work from the
offline cpu to another cpu, which would allow the work that the fstat
task queued to execute, drop the ref count to 0, and unblock both
stalled tasks.
By reordering the cpu offline states such the shutdown processing of
RCU/tree:prepare occurs before block/mq:prepare [3], we prevent deadlock
by enabling the queued work in the RCU framework to run elsewhere, and
eventually unblock the tasks waiting on the ref count.
However, it is not entirely clear what are the full ramifications of
this reorder. I assume the ordering of these cpu online/offline states
is carefully considered, and without that knowledge, I could not say for
certain that my fix [3] is safe.
What is the opinion of the domain experts?
--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.
[3] Proposed fix
---8>---
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index afe641c..9b86db9 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -49,6 +49,7 @@ enum cpuhp_state {
CPUHP_ARM_SHMOBILE_SCU_PREPARE,
CPUHP_SH_SH3X_PREPARE,
CPUHP_BLK_MQ_PREPARE,
+ CPUHP_RCUTREE_PREP2,
CPUHP_TIMERS_DEAD,
CPUHP_NOTF_ERR_INJ_PREPARE,
CPUHP_MIPS_SOC_PREPARE,
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 29de1a9..b46c573 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1289,6 +1289,11 @@ static int __init cpu_hotplug_pm_sync_init(void)
[CPUHP_RCUTREE_PREP] = {
.name = "RCU/tree:prepare",
.startup.single = rcutree_prepare_cpu,
+ .teardown.single = NULL,
+ },
+ [CPUHP_RCUTREE_PREP2] = {
+ .name = "RCU/tree:dead",
+ .startup.single = NULL,
.teardown.single = rcutree_dead_cpu,
},
/*
On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote:
> Hello,
>
> I observe that running stress-ng with the cpu-online and fstat tests
> results in a deadlock of hung tasks:
>
> [ 366.810486] INFO: task stress-ng-cpu-o:2590 blocked for more than
> 120 seconds.
> [ 366.817689] Not tainted 4.9.0 #39
> [ 366.821504] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 366.829320] stress-ng-cpu-o D 0 2590 2589 0x00000008
> [ 366.834803] Call trace:
> [ 366.837222] [<ffff000008085dd0>] __switch_to+0x60/0x70
> [ 366.842338] [<ffff000008a23c18>] __schedule+0x178/0x648
> [ 366.847550] [<ffff000008a24120>] schedule+0x38/0x98
> [ 366.852408] [<ffff00000848b774>] blk_mq_freeze_queue_wait+0x64/0x1a8
> [ 366.858749] [<ffff00000848e9d4>] blk_mq_queue_reinit_work+0x74/0x110
> [ 366.865081] [<ffff00000848ea94>] blk_mq_queue_reinit_dead+0x24/0x30
> [ 366.871335] [<ffff0000080c9898>] cpuhp_invoke_callback+0x98/0x4a8
> [ 366.877411] [<ffff0000080cb084>] cpuhp_down_callbacks+0x114/0x150
> [ 366.883484] [<ffff000008a22578>] _cpu_down+0x100/0x1d8
> [ 366.888609] [<ffff0000080cbfdc>] do_cpu_down+0x4c/0x78
> [ 366.893727] [<ffff0000080cc02c>] cpu_down+0x24/0x30
> [ 366.898593] [<ffff0000086aaf28>] cpu_subsys_offline+0x20/0x30
> [ 366.904318] [<ffff0000086a53d8>] device_offline+0xa8/0xd8
> [ 366.909704] [<ffff0000086a550c>] online_store+0x4c/0xa8
> [ 366.914907] [<ffff0000086a241c>] dev_attr_store+0x44/0x60
> [ 366.920294] [<ffff0000082b6a24>] sysfs_kf_write+0x5c/0x78
> [ 366.925672] [<ffff0000082b5cec>] kernfs_fop_write+0xbc/0x1e8
> [ 366.931318] [<ffff000008238320>] __vfs_write+0x48/0x138
> [ 366.936526] [<ffff000008239078>] vfs_write+0xa8/0x1c0
> [ 366.941557] [<ffff00000823a08c>] SyS_write+0x54/0xb0
> [ 366.946511] [<ffff000008083370>] el0_svc_naked+0x24/0x28
> [ 366.951800] INFO: task stress-ng-fstat:2591 blocked for more than
> 120 seconds.
> [ 366.959008] Not tainted 4.9.0 #39
> [ 366.962823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 366.970640] stress-ng-fstat D 0 2591 2589 0x00000000
> [ 366.976105] Call trace:
> [ 366.978540] [<ffff000008085dd0>] __switch_to+0x60/0x70
> [ 366.983658] [<ffff000008a23c18>] __schedule+0x178/0x648
> [ 366.988870] [<ffff000008a24120>] schedule+0x38/0x98
> [ 366.993727] [<ffff00000848b774>] blk_mq_freeze_queue_wait+0x64/0x1a8
> [ 367.000068] [<ffff00000848e2d0>] blk_mq_freeze_queue+0x28/0x38
> [ 367.005880] [<ffff0000086d480c>] lo_release+0x64/0x90
> [ 367.010919] [<ffff000008278bd4>] __blkdev_put+0x26c/0x2c8
> [ 367.016300] [<ffff000008278fec>] blkdev_put+0x54/0x128
> [ 367.021418] [<ffff0000082790ec>] blkdev_close+0x2c/0x40
> [ 367.026631] [<ffff00000823ab58>] __fput+0xa0/0x1e0
> [ 367.031401] [<ffff00000823ad10>] ____fput+0x20/0x30
> [ 367.036266] [<ffff0000080e7a40>] task_work_run+0xc8/0xe8
> [ 367.041557] [<ffff0000080882b4>] do_notify_resume+0xac/0xb8
> [ 367.047116] [<ffff000008083294>] work_pending+0x8/0x10
>
> I have tested and found this issue to be reproducible on both x86
> and ARM64 architectures on 4.7, 4.8, 4.9, 4.10, and 4.11-rc3
> kernels.
>
> Using the below test methodology [1], the issue reproduces within a
> few minutes.
>
> Using ftrace, I have analyzed the issue on 4.9 and I believe I've
> found the root cause [2].
>
> Based on my analysis, I have developed a fix [3], which addresses
> the issue as I am able to run stress-ng for over an hour where I was
> unable to do so before, however I do not know the full extend of
> impacts from this fix, and look for guidance from the community to
> determine the final fix.
>
>
> [1] Test methodology
> --------------------
> Boot a multicore system such as a desktop i5 system with nr_cpus=2
>
> Enable all logging to determine when the deadlock occurs (prints
> from test stop flowing out of the serial port)
> echo 1 > /sys/module/printk/parameters/ignore_loglevel
>
> Run stress-ng
> stress-ng --fstat 1 --cpu-online 1 -t 3600
>
> Wait for the test output to stop, and the hung task watchdog to fire.
>
>
> [2] Analysis
> ------------
> Again, this analysis is based upon the 4.9 kernel, but believe it to
> still apply to newer kernels.
>
> I conclude that the hung tasks occur due to a race condition which
> results in a deadlock.
>
> The race condition occurs between "normal" work in the block layer
> on a core (the stress-ng-fstat task in the above dump) and cpu
> offline of that core (the stress-ng-cpu-o task in the above dump).
>
> The fput() from userspace in the fstat task results in a call to
> blk_mq_freeze_queue(), which drops the last reference to the queue
> via percpu_ref_kill(), and then waits for the ref count of the queue
> to hit 0 in blk_mq_freeze_queue_wait(). percpu_ref_kill() will
> result in __percpu_ref_switch_to_atomic() which will use
> call_rcu_sched() to setup delayed work to finalize the percpu_ref
> cleanup and drop the ref count to 0.
>
> Note that call_rcu_sched() queues the work to a per-cpu queue, thus
> the work can only be run on the core it is queued on, by the work
> thread that is pinned to that core.
>
> It is a race between this work running, and the cpu offline processing.
One quick way to test this assumption is to build a kernel with Kconfig
options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y. This will
cause call_rcu_sched() to queue the work to a kthread, which can migrate
to some other CPU. If your analysis is correct, this should avoid
the deadlock. (Note that the deadlock should be fixed in any case,
just a diagnostic assumption-check procedure.)
> If the cpu offline processing is able to get to and process the
> RCU/tree:online state before the queued work from the block layer,
> then the pinned work thread will be migrated to another core via
> rcutree_offline_cpu(), and the work will not be able to execute.
>
> This race condition does not result in deadlock until later in the
> cpu offline processing. Once we hit the block/mq:prepare state the
> block layer freezes all the queues and waits for the ref counts to
> hit 0. This normally works because at this point the cpu being
> offlined is dead from cpu:teardown, and the offline processing is
> occuring on another active cpu, so call_rcu_sched() will queue work
> to an active cpu where it can get processed. However the fstat
> process already did that work for one of the queues to be frozen in
> the block layer, so the processing of the block/mq:prepare state
> waits on the same ref count as fstat to hit 0. Thus we see the
> result of this as the stress-ng-cpu-o task above.
>
> The block/mq:prepare processing stalls the cpu offline processing
> which causes a deadlock because the processing does not get to the
> RCU/tree:prepare state which migrates all of the queued work from
> the offline cpu to another cpu, which would allow the work that the
> fstat task queued to execute, drop the ref count to 0, and unblock
> both stalled tasks.
>
> By reordering the cpu offline states such the shutdown processing of
> RCU/tree:prepare occurs before block/mq:prepare [3], we prevent
> deadlock by enabling the queued work in the RCU framework to run
> elsewhere, and eventually unblock the tasks waiting on the ref
> count.
>
> However, it is not entirely clear what are the full ramifications of
> this reorder. I assume the ordering of these cpu online/offline
> states is carefully considered, and without that knowledge, I could
> not say for certain that my fix [3] is safe.
>
> What is the opinion of the domain experts?
I do hope that we can come up with a better fix. No offense intended,
as coming up with -any- fix in the CPU-hotplug domain is not to be
denigrated, but this looks to be at vest quite fragile.
Thanx, Paul
> --
> Jeffrey Hugo
> Qualcomm Datacenter Technologies as an affiliate of Qualcomm
> Technologies, Inc.
> Qualcomm Technologies, Inc. is a member of the
> Code Aurora Forum, a Linux Foundation Collaborative Project.
>
>
> [3] Proposed fix
> ---8>---
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index afe641c..9b86db9 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -49,6 +49,7 @@ enum cpuhp_state {
> CPUHP_ARM_SHMOBILE_SCU_PREPARE,
> CPUHP_SH_SH3X_PREPARE,
> CPUHP_BLK_MQ_PREPARE,
> + CPUHP_RCUTREE_PREP2,
> CPUHP_TIMERS_DEAD,
> CPUHP_NOTF_ERR_INJ_PREPARE,
> CPUHP_MIPS_SOC_PREPARE,
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 29de1a9..b46c573 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1289,6 +1289,11 @@ static int __init cpu_hotplug_pm_sync_init(void)
> [CPUHP_RCUTREE_PREP] = {
> .name = "RCU/tree:prepare",
> .startup.single = rcutree_prepare_cpu,
> + .teardown.single = NULL,
> + },
> + [CPUHP_RCUTREE_PREP2] = {
> + .name = "RCU/tree:dead",
> + .startup.single = NULL,
> .teardown.single = rcutree_dead_cpu,
> },
> /*
>
Hi Paul.
Thanks for the quick reply.
On 3/26/2017 5:28 PM, Paul E. McKenney wrote:
> On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote:
>> It is a race between this work running, and the cpu offline processing.
>
> One quick way to test this assumption is to build a kernel with Kconfig
> options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y. This will
> cause call_rcu_sched() to queue the work to a kthread, which can migrate
> to some other CPU. If your analysis is correct, this should avoid
> the deadlock. (Note that the deadlock should be fixed in any case,
> just a diagnostic assumption-check procedure.)
I enabled CONFIG_RCU_EXPERT=y, CONFIG_RCU_NOCB_CPU=y,
CONFIG_RCU_NOCB_CPU_ALL=y in my build. I've only had time so far to do
one test run however the issue reproduced, but it took a fair bit longer
to do so. An initial look at the data indicates that the work is still
not running. An odd observation, the two threads are no longer blocked
on the same queue, but different ones.
Let me look at this more and see what is going on now.
>> What is the opinion of the domain experts?
>
> I do hope that we can come up with a better fix. No offense intended,
> as coming up with -any- fix in the CPU-hotplug domain is not to be
> denigrated, but this looks to be at vest quite fragile.
>
> Thanx, Paul
>
None taken. I'm not particularly attached to the current fix. I agree,
it does appear to be quite fragile.
I'm still not sure what a better solution would be though. Maybe the
RCU framework flushes the work somehow during cpu offline? It would
need to ensure further work is not queued after that point, which seems
like it might be tricky to synchronize. I don't know enough about the
working of RCU to even attempt to implement that.
In any case, it seem like some more analysis is needed based on the
latest data.
--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.
On Mon, Mar 27, 2017 at 12:02:27PM -0600, Jeffrey Hugo wrote:
> Hi Paul.
>
> Thanks for the quick reply.
>
> On 3/26/2017 5:28 PM, Paul E. McKenney wrote:
> >On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote:
>
> >>It is a race between this work running, and the cpu offline processing.
> >
> >One quick way to test this assumption is to build a kernel with Kconfig
> >options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y. This will
> >cause call_rcu_sched() to queue the work to a kthread, which can migrate
> >to some other CPU. If your analysis is correct, this should avoid
> >the deadlock. (Note that the deadlock should be fixed in any case,
> >just a diagnostic assumption-check procedure.)
>
> I enabled CONFIG_RCU_EXPERT=y, CONFIG_RCU_NOCB_CPU=y,
> CONFIG_RCU_NOCB_CPU_ALL=y in my build. I've only had time so far to
> do one test run however the issue reproduced, but it took a fair bit
> longer to do so. An initial look at the data indicates that the
> work is still not running. An odd observation, the two threads are
> no longer blocked on the same queue, but different ones.
I was afraid of that...
> Let me look at this more and see what is going on now.
Another thing to try would be to affinity the "rcuo" kthreads to
some CPU that is never taken offline, just in case that kthread is
sometimes somehow getting stuck during the CPU-hotplug operation.
> >>What is the opinion of the domain experts?
> >
> >I do hope that we can come up with a better fix. No offense intended,
> >as coming up with -any- fix in the CPU-hotplug domain is not to be
> >denigrated, but this looks to be at vest quite fragile.
> >
> > Thanx, Paul
> >
>
> None taken. I'm not particularly attached to the current fix. I
> agree, it does appear to be quite fragile.
>
> I'm still not sure what a better solution would be though. Maybe
> the RCU framework flushes the work somehow during cpu offline? It
> would need to ensure further work is not queued after that point,
> which seems like it might be tricky to synchronize. I don't know
> enough about the working of RCU to even attempt to implement that.
There are some ways that RCU might be able to shrink the window during
which the outgoing CPU's callbacks are in limbo, but they are not free
of risk, so we really need to compleetly understand what is going on
before making any possibly ill-conceived changes. ;-)
> In any case, it seem like some more analysis is needed based on the
> latest data.
Looking forward to hearing about you find!
Thanx, Paul
> --
> Jeffrey Hugo
> Qualcomm Datacenter Technologies as an affiliate of Qualcomm
> Technologies, Inc.
> Qualcomm Technologies, Inc. is a member of the
> Code Aurora Forum, a Linux Foundation Collaborative Project.
>
On Mon, Mar 27, 2017 at 11:17:11AM -0700, Paul E. McKenney wrote:
> On Mon, Mar 27, 2017 at 12:02:27PM -0600, Jeffrey Hugo wrote:
> > Hi Paul.
> >
> > Thanks for the quick reply.
> >
> > On 3/26/2017 5:28 PM, Paul E. McKenney wrote:
> > >On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote:
> >
> > >>It is a race between this work running, and the cpu offline processing.
> > >
> > >One quick way to test this assumption is to build a kernel with Kconfig
> > >options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y. This will
> > >cause call_rcu_sched() to queue the work to a kthread, which can migrate
> > >to some other CPU. If your analysis is correct, this should avoid
> > >the deadlock. (Note that the deadlock should be fixed in any case,
> > >just a diagnostic assumption-check procedure.)
> >
> > I enabled CONFIG_RCU_EXPERT=y, CONFIG_RCU_NOCB_CPU=y,
> > CONFIG_RCU_NOCB_CPU_ALL=y in my build. I've only had time so far to
> > do one test run however the issue reproduced, but it took a fair bit
> > longer to do so. An initial look at the data indicates that the
> > work is still not running. An odd observation, the two threads are
> > no longer blocked on the same queue, but different ones.
>
> I was afraid of that...
>
> > Let me look at this more and see what is going on now.
>
> Another thing to try would be to affinity the "rcuo" kthreads to
> some CPU that is never taken offline, just in case that kthread is
> sometimes somehow getting stuck during the CPU-hotplug operation.
>
> > >>What is the opinion of the domain experts?
> > >
> > >I do hope that we can come up with a better fix. No offense intended,
> > >as coming up with -any- fix in the CPU-hotplug domain is not to be
> > >denigrated, but this looks to be at vest quite fragile.
> > >
> > > Thanx, Paul
> > >
> >
> > None taken. I'm not particularly attached to the current fix. I
> > agree, it does appear to be quite fragile.
> >
> > I'm still not sure what a better solution would be though. Maybe
> > the RCU framework flushes the work somehow during cpu offline? It
> > would need to ensure further work is not queued after that point,
> > which seems like it might be tricky to synchronize. I don't know
> > enough about the working of RCU to even attempt to implement that.
>
> There are some ways that RCU might be able to shrink the window during
> which the outgoing CPU's callbacks are in limbo, but they are not free
> of risk, so we really need to compleetly understand what is going on
> before making any possibly ill-conceived changes. ;-)
>
> > In any case, it seem like some more analysis is needed based on the
> > latest data.
>
> Looking forward to hearing about you find!
Hearing nothing, I eventually took unilateral action (I am a citizen of
USA, after all!) and produced the lightly tested patch shown below.
Does it help?
Thanx, Paul
------------------------------------------------------------------------
commit faeb334286b762ba6b829ad62e3d39a97674bb56
Author: Paul E. McKenney <[email protected]>
Date: Tue Jun 20 12:11:34 2017 -0700
rcu: Migrate callbacks earlier in the CPU-offline timeline
RCU callbacks must be migrated away from an outgoing CPU, and this is
done near the end of the CPU-hotplug operation, after the outgoing CPU is
long gone. Unfortunately, this means that other CPU-hotplug callbacks
can execute while the outgoing CPU's callbacks are still immobilized
on the long-gone CPU's callback lists. If any of these CPU-hotplug
callbacks must wait, either directly or indirectly, for the invocation
of any of the immobilized RCU callbacks, the system will hang.
This commit avoids such hangs by migrating the callbacks away from the
outgoing CPU immediately upon its departure. This migration makes use
of the cpuhp_report_idle_dead function that is called by the outgoing CPU
just after it has entered the idle loop for the last time. This function
IPIs an online CPU in order to wake up the thread orchestrating the
CPU-hotplug operation, and this commit makes this IPI handler migrate the
immobilized RCU callbacks. Thus, RCU is able to advance these callbacks
and invoke them, which allows all the after-the-fact CPU-hotplug callbacks
to wait on these RCU callbacks without risk of a hang.
Reported-by: Jeffrey Hugo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Anna-Maria Gleixner <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Richard Weinberger <[email protected]>
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index ecca9d2e4e4a..96f1baf62ab8 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -109,6 +109,7 @@ void rcu_bh_qs(void);
void rcu_check_callbacks(int user);
void rcu_report_dead(unsigned int cpu);
void rcu_cpu_starting(unsigned int cpu);
+void rcutree_migrate_callbacks(int cpu);
#ifdef CONFIG_RCU_STALL_COMMON
void rcu_sysrq_start(void);
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 9ae6fbe5b5cf..2f86f2f22a72 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -59,6 +59,7 @@ struct cpuhp_cpu_state {
struct hlist_node *node;
enum cpuhp_state cb_state;
int result;
+ int cpu;
struct completion done;
#endif
};
@@ -736,6 +737,7 @@ static void cpuhp_complete_idle_dead(void *arg)
{
struct cpuhp_cpu_state *st = arg;
+ rcutree_migrate_callbacks(st->cpu);
complete(&st->done);
}
@@ -1828,5 +1830,11 @@ void __init boot_cpu_init(void)
*/
void __init boot_cpu_state_init(void)
{
+ int cpu;
+
per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;
+#ifdef CONFIG_SMP
+ for_each_possible_cpu(cpu)
+ per_cpu_ptr(&cpuhp_state, cpu)->cpu = cpu;
+#endif
}
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 94ec7455fc46..e6d534946c7f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2547,85 +2547,6 @@ rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp)
}
/*
- * Send the specified CPU's RCU callbacks to the orphanage. The
- * specified CPU must be offline, and the caller must hold the
- * ->orphan_lock.
- */
-static void
-rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
- struct rcu_node *rnp, struct rcu_data *rdp)
-{
- lockdep_assert_held(&rsp->orphan_lock);
-
- /* No-CBs CPUs do not have orphanable callbacks. */
- if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) || rcu_is_nocb_cpu(rdp->cpu))
- return;
-
- /*
- * Orphan the callbacks. First adjust the counts. This is safe
- * because _rcu_barrier() excludes CPU-hotplug operations, so it
- * cannot be running now. Thus no memory barrier is required.
- */
- rdp->n_cbs_orphaned += rcu_segcblist_n_cbs(&rdp->cblist);
- rcu_segcblist_extract_count(&rdp->cblist, &rsp->orphan_done);
-
- /*
- * Next, move those callbacks still needing a grace period to
- * the orphanage, where some other CPU will pick them up.
- * Some of the callbacks might have gone partway through a grace
- * period, but that is too bad. They get to start over because we
- * cannot assume that grace periods are synchronized across CPUs.
- */
- rcu_segcblist_extract_pend_cbs(&rdp->cblist, &rsp->orphan_pend);
-
- /*
- * Then move the ready-to-invoke callbacks to the orphanage,
- * where some other CPU will pick them up. These will not be
- * required to pass though another grace period: They are done.
- */
- rcu_segcblist_extract_done_cbs(&rdp->cblist, &rsp->orphan_done);
-
- /* Finally, disallow further callbacks on this CPU. */
- rcu_segcblist_disable(&rdp->cblist);
-}
-
-/*
- * Adopt the RCU callbacks from the specified rcu_state structure's
- * orphanage. The caller must hold the ->orphan_lock.
- */
-static void rcu_adopt_orphan_cbs(struct rcu_state *rsp, unsigned long flags)
-{
- struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
-
- lockdep_assert_held(&rsp->orphan_lock);
-
- /* No-CBs CPUs are handled specially. */
- if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
- rcu_nocb_adopt_orphan_cbs(rsp, rdp, flags))
- return;
-
- /* Do the accounting first. */
- rdp->n_cbs_adopted += rsp->orphan_done.len;
- if (rsp->orphan_done.len_lazy != rsp->orphan_done.len)
- rcu_idle_count_callbacks_posted();
- rcu_segcblist_insert_count(&rdp->cblist, &rsp->orphan_done);
-
- /*
- * We do not need a memory barrier here because the only way we
- * can get here if there is an rcu_barrier() in flight is if
- * we are the task doing the rcu_barrier().
- */
-
- /* First adopt the ready-to-invoke callbacks, then the done ones. */
- rcu_segcblist_insert_done_cbs(&rdp->cblist, &rsp->orphan_done);
- WARN_ON_ONCE(rsp->orphan_done.head);
- rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rsp->orphan_pend);
- WARN_ON_ONCE(rsp->orphan_pend.head);
- WARN_ON_ONCE(rcu_segcblist_empty(&rdp->cblist) !=
- !rcu_segcblist_n_cbs(&rdp->cblist));
-}
-
-/*
* Trace the fact that this CPU is going offline.
*/
static void rcu_cleanup_dying_cpu(struct rcu_state *rsp)
@@ -2695,7 +2616,6 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
*/
static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
{
- unsigned long flags;
struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */
@@ -2704,18 +2624,6 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
/* Adjust any no-longer-needed kthreads. */
rcu_boost_kthread_setaffinity(rnp, -1);
-
- /* Orphan the dead CPU's callbacks, and adopt them if appropriate. */
- raw_spin_lock_irqsave(&rsp->orphan_lock, flags);
- rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp);
- rcu_adopt_orphan_cbs(rsp, flags);
- raw_spin_unlock_irqrestore(&rsp->orphan_lock, flags);
-
- WARN_ONCE(rcu_segcblist_n_cbs(&rdp->cblist) != 0 ||
- !rcu_segcblist_empty(&rdp->cblist),
- "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, 1stCB=%p\n",
- cpu, rcu_segcblist_n_cbs(&rdp->cblist),
- rcu_segcblist_first_cb(&rdp->cblist));
}
/*
@@ -3927,6 +3835,116 @@ void rcu_report_dead(unsigned int cpu)
for_each_rcu_flavor(rsp)
rcu_cleanup_dying_idle_cpu(cpu, rsp);
}
+
+/*
+ * Send the specified CPU's RCU callbacks to the orphanage. The
+ * specified CPU must be offline, and the caller must hold the
+ * ->orphan_lock.
+ */
+static void
+rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
+ struct rcu_node *rnp, struct rcu_data *rdp)
+{
+ lockdep_assert_held(&rsp->orphan_lock);
+
+ /* No-CBs CPUs do not have orphanable callbacks. */
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) || rcu_is_nocb_cpu(rdp->cpu))
+ return;
+
+ /*
+ * Orphan the callbacks. First adjust the counts. This is safe
+ * because _rcu_barrier() excludes CPU-hotplug operations, so it
+ * cannot be running now. Thus no memory barrier is required.
+ */
+ rdp->n_cbs_orphaned += rcu_segcblist_n_cbs(&rdp->cblist);
+ rcu_segcblist_extract_count(&rdp->cblist, &rsp->orphan_done);
+
+ /*
+ * Next, move those callbacks still needing a grace period to
+ * the orphanage, where some other CPU will pick them up.
+ * Some of the callbacks might have gone partway through a grace
+ * period, but that is too bad. They get to start over because we
+ * cannot assume that grace periods are synchronized across CPUs.
+ */
+ rcu_segcblist_extract_pend_cbs(&rdp->cblist, &rsp->orphan_pend);
+
+ /*
+ * Then move the ready-to-invoke callbacks to the orphanage,
+ * where some other CPU will pick them up. These will not be
+ * required to pass though another grace period: They are done.
+ */
+ rcu_segcblist_extract_done_cbs(&rdp->cblist, &rsp->orphan_done);
+
+ /* Finally, disallow further callbacks on this CPU. */
+ rcu_segcblist_disable(&rdp->cblist);
+}
+
+/*
+ * Adopt the RCU callbacks from the specified rcu_state structure's
+ * orphanage. The caller must hold the ->orphan_lock.
+ */
+static void rcu_adopt_orphan_cbs(struct rcu_state *rsp, unsigned long flags)
+{
+ struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
+
+ lockdep_assert_held(&rsp->orphan_lock);
+
+ /* No-CBs CPUs are handled specially. */
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
+ rcu_nocb_adopt_orphan_cbs(rsp, rdp, flags))
+ return;
+
+ /* Do the accounting first. */
+ rdp->n_cbs_adopted += rsp->orphan_done.len;
+ if (rsp->orphan_done.len_lazy != rsp->orphan_done.len)
+ rcu_idle_count_callbacks_posted();
+ rcu_segcblist_insert_count(&rdp->cblist, &rsp->orphan_done);
+
+ /*
+ * We do not need a memory barrier here because the only way we
+ * can get here if there is an rcu_barrier() in flight is if
+ * we are the task doing the rcu_barrier().
+ */
+
+ /* First adopt the ready-to-invoke callbacks, then the done ones. */
+ rcu_segcblist_insert_done_cbs(&rdp->cblist, &rsp->orphan_done);
+ WARN_ON_ONCE(rsp->orphan_done.head);
+ rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rsp->orphan_pend);
+ WARN_ON_ONCE(rsp->orphan_pend.head);
+ WARN_ON_ONCE(rcu_segcblist_empty(&rdp->cblist) !=
+ !rcu_segcblist_n_cbs(&rdp->cblist));
+}
+
+/* Orphan the dead CPU's callbacks, and then adopt them. */
+static void rcu_migrate_callbacks(int cpu, struct rcu_state *rsp)
+{
+ unsigned long flags;
+ struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
+ struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */
+
+ raw_spin_lock_irqsave(&rsp->orphan_lock, flags);
+ rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp);
+ rcu_adopt_orphan_cbs(rsp, flags);
+ raw_spin_unlock_irqrestore(&rsp->orphan_lock, flags);
+ WARN_ONCE(rcu_segcblist_n_cbs(&rdp->cblist) != 0 ||
+ !rcu_segcblist_empty(&rdp->cblist),
+ "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, 1stCB=%p\n",
+ cpu, rcu_segcblist_n_cbs(&rdp->cblist),
+ rcu_segcblist_first_cb(&rdp->cblist));
+}
+
+/*
+ * The outgoing CPU has just passed through the dying-idle state,
+ * and we are being invoked from the CPU that was IPIed to continue the
+ * offline operation. We need to migrate the outgoing CPU's callbacks.
+ */
+void rcutree_migrate_callbacks(int cpu)
+{
+ struct rcu_state *rsp;
+
+ for_each_rcu_flavor(rsp)
+ rcu_migrate_callbacks(cpu, rsp);
+}
#endif
/*
On 6/20/2017 5:46 PM, Paul E. McKenney wrote:
> On Mon, Mar 27, 2017 at 11:17:11AM -0700, Paul E. McKenney wrote:
>> On Mon, Mar 27, 2017 at 12:02:27PM -0600, Jeffrey Hugo wrote:
>>> Hi Paul.
>>>
>>> Thanks for the quick reply.
>>>
>>> On 3/26/2017 5:28 PM, Paul E. McKenney wrote:
>>>> On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote:
>>>
>>>>> It is a race between this work running, and the cpu offline processing.
>>>>
>>>> One quick way to test this assumption is to build a kernel with Kconfig
>>>> options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y. This will
>>>> cause call_rcu_sched() to queue the work to a kthread, which can migrate
>>>> to some other CPU. If your analysis is correct, this should avoid
>>>> the deadlock. (Note that the deadlock should be fixed in any case,
>>>> just a diagnostic assumption-check procedure.)
>>>
>>> I enabled CONFIG_RCU_EXPERT=y, CONFIG_RCU_NOCB_CPU=y,
>>> CONFIG_RCU_NOCB_CPU_ALL=y in my build. I've only had time so far to
>>> do one test run however the issue reproduced, but it took a fair bit
>>> longer to do so. An initial look at the data indicates that the
>>> work is still not running. An odd observation, the two threads are
>>> no longer blocked on the same queue, but different ones.
>>
>> I was afraid of that...
>>
>>> Let me look at this more and see what is going on now.
>>
>> Another thing to try would be to affinity the "rcuo" kthreads to
>> some CPU that is never taken offline, just in case that kthread is
>> sometimes somehow getting stuck during the CPU-hotplug operation.
>>
>>>>> What is the opinion of the domain experts?
>>>>
>>>> I do hope that we can come up with a better fix. No offense intended,
>>>> as coming up with -any- fix in the CPU-hotplug domain is not to be
>>>> denigrated, but this looks to be at vest quite fragile.
>>>>
>>>> Thanx, Paul
>>>>
>>>
>>> None taken. I'm not particularly attached to the current fix. I
>>> agree, it does appear to be quite fragile.
>>>
>>> I'm still not sure what a better solution would be though. Maybe
>>> the RCU framework flushes the work somehow during cpu offline? It
>>> would need to ensure further work is not queued after that point,
>>> which seems like it might be tricky to synchronize. I don't know
>>> enough about the working of RCU to even attempt to implement that.
>>
>> There are some ways that RCU might be able to shrink the window during
>> which the outgoing CPU's callbacks are in limbo, but they are not free
>> of risk, so we really need to compleetly understand what is going on
>> before making any possibly ill-conceived changes. ;-)
>>
>>> In any case, it seem like some more analysis is needed based on the
>>> latest data.
>>
>> Looking forward to hearing about you find!
>
> Hearing nothing, I eventually took unilateral action (I am a citizen of
> USA, after all!) and produced the lightly tested patch shown below.
>
> Does it help?
>
> Thanx, Paul
>
Wow, has it been 3 months already? I am extremely sorry, I've been
preempted multiple times, and this has sat on my todo list where I keep
thinking I need to find time to come back to it but apparently not doing
enough to make that happen.
Thank you for not forgetting about this. I promise I will somehow clear
my schedule to test this next week.
Thank you again.
--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.
On Wed, Jun 21, 2017 at 08:39:45AM -0600, Jeffrey Hugo wrote:
> On 6/20/2017 5:46 PM, Paul E. McKenney wrote:
> >On Mon, Mar 27, 2017 at 11:17:11AM -0700, Paul E. McKenney wrote:
> >>On Mon, Mar 27, 2017 at 12:02:27PM -0600, Jeffrey Hugo wrote:
> >>>Hi Paul.
> >>>
> >>>Thanks for the quick reply.
> >>>
> >>>On 3/26/2017 5:28 PM, Paul E. McKenney wrote:
> >>>>On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote:
> >>>
> >>>>>It is a race between this work running, and the cpu offline processing.
> >>>>
> >>>>One quick way to test this assumption is to build a kernel with Kconfig
> >>>>options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y. This will
> >>>>cause call_rcu_sched() to queue the work to a kthread, which can migrate
> >>>>to some other CPU. If your analysis is correct, this should avoid
> >>>>the deadlock. (Note that the deadlock should be fixed in any case,
> >>>>just a diagnostic assumption-check procedure.)
> >>>
> >>>I enabled CONFIG_RCU_EXPERT=y, CONFIG_RCU_NOCB_CPU=y,
> >>>CONFIG_RCU_NOCB_CPU_ALL=y in my build. I've only had time so far to
> >>>do one test run however the issue reproduced, but it took a fair bit
> >>>longer to do so. An initial look at the data indicates that the
> >>>work is still not running. An odd observation, the two threads are
> >>>no longer blocked on the same queue, but different ones.
> >>
> >>I was afraid of that...
> >>
> >>>Let me look at this more and see what is going on now.
> >>
> >>Another thing to try would be to affinity the "rcuo" kthreads to
> >>some CPU that is never taken offline, just in case that kthread is
> >>sometimes somehow getting stuck during the CPU-hotplug operation.
> >>
> >>>>>What is the opinion of the domain experts?
> >>>>
> >>>>I do hope that we can come up with a better fix. No offense intended,
> >>>>as coming up with -any- fix in the CPU-hotplug domain is not to be
> >>>>denigrated, but this looks to be at vest quite fragile.
> >>>>
> >>>> Thanx, Paul
> >>>>
> >>>
> >>>None taken. I'm not particularly attached to the current fix. I
> >>>agree, it does appear to be quite fragile.
> >>>
> >>>I'm still not sure what a better solution would be though. Maybe
> >>>the RCU framework flushes the work somehow during cpu offline? It
> >>>would need to ensure further work is not queued after that point,
> >>>which seems like it might be tricky to synchronize. I don't know
> >>>enough about the working of RCU to even attempt to implement that.
> >>
> >>There are some ways that RCU might be able to shrink the window during
> >>which the outgoing CPU's callbacks are in limbo, but they are not free
> >>of risk, so we really need to compleetly understand what is going on
> >>before making any possibly ill-conceived changes. ;-)
> >>
> >>>In any case, it seem like some more analysis is needed based on the
> >>>latest data.
> >>
> >>Looking forward to hearing about you find!
> >
> >Hearing nothing, I eventually took unilateral action (I am a citizen of
> >USA, after all!) and produced the lightly tested patch shown below.
> >
> >Does it help?
> >
> > Thanx, Paul
> >
>
> Wow, has it been 3 months already? I am extremely sorry, I've been
> preempted multiple times, and this has sat on my todo list where I
> keep thinking I need to find time to come back to it but apparently
> not doing enough to make that happen.
>
> Thank you for not forgetting about this. I promise I will somehow
> clear my schedule to test this next week.
No worries, and I am very much looking forward to seeing the results of
your testing.
Thanx, Paul
On Wed, Jun 21, 2017 at 09:18:53AM -0700, Paul E. McKenney wrote:
> On Wed, Jun 21, 2017 at 08:39:45AM -0600, Jeffrey Hugo wrote:
> > On 6/20/2017 5:46 PM, Paul E. McKenney wrote:
> > >On Mon, Mar 27, 2017 at 11:17:11AM -0700, Paul E. McKenney wrote:
> > >>On Mon, Mar 27, 2017 at 12:02:27PM -0600, Jeffrey Hugo wrote:
> > >>>Hi Paul.
> > >>>
> > >>>Thanks for the quick reply.
> > >>>
> > >>>On 3/26/2017 5:28 PM, Paul E. McKenney wrote:
> > >>>>On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote:
> > >>>
> > >>>>>It is a race between this work running, and the cpu offline processing.
> > >>>>
> > >>>>One quick way to test this assumption is to build a kernel with Kconfig
> > >>>>options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y. This will
> > >>>>cause call_rcu_sched() to queue the work to a kthread, which can migrate
> > >>>>to some other CPU. If your analysis is correct, this should avoid
> > >>>>the deadlock. (Note that the deadlock should be fixed in any case,
> > >>>>just a diagnostic assumption-check procedure.)
> > >>>
> > >>>I enabled CONFIG_RCU_EXPERT=y, CONFIG_RCU_NOCB_CPU=y,
> > >>>CONFIG_RCU_NOCB_CPU_ALL=y in my build. I've only had time so far to
> > >>>do one test run however the issue reproduced, but it took a fair bit
> > >>>longer to do so. An initial look at the data indicates that the
> > >>>work is still not running. An odd observation, the two threads are
> > >>>no longer blocked on the same queue, but different ones.
> > >>
> > >>I was afraid of that...
> > >>
> > >>>Let me look at this more and see what is going on now.
> > >>
> > >>Another thing to try would be to affinity the "rcuo" kthreads to
> > >>some CPU that is never taken offline, just in case that kthread is
> > >>sometimes somehow getting stuck during the CPU-hotplug operation.
> > >>
> > >>>>>What is the opinion of the domain experts?
> > >>>>
> > >>>>I do hope that we can come up with a better fix. No offense intended,
> > >>>>as coming up with -any- fix in the CPU-hotplug domain is not to be
> > >>>>denigrated, but this looks to be at vest quite fragile.
> > >>>>
> > >>>> Thanx, Paul
> > >>>>
> > >>>
> > >>>None taken. I'm not particularly attached to the current fix. I
> > >>>agree, it does appear to be quite fragile.
> > >>>
> > >>>I'm still not sure what a better solution would be though. Maybe
> > >>>the RCU framework flushes the work somehow during cpu offline? It
> > >>>would need to ensure further work is not queued after that point,
> > >>>which seems like it might be tricky to synchronize. I don't know
> > >>>enough about the working of RCU to even attempt to implement that.
> > >>
> > >>There are some ways that RCU might be able to shrink the window during
> > >>which the outgoing CPU's callbacks are in limbo, but they are not free
> > >>of risk, so we really need to compleetly understand what is going on
> > >>before making any possibly ill-conceived changes. ;-)
> > >>
> > >>>In any case, it seem like some more analysis is needed based on the
> > >>>latest data.
> > >>
> > >>Looking forward to hearing about you find!
> > >
> > >Hearing nothing, I eventually took unilateral action (I am a citizen of
> > >USA, after all!) and produced the lightly tested patch shown below.
> > >
> > >Does it help?
> > >
> > > Thanx, Paul
> > >
> >
> > Wow, has it been 3 months already? I am extremely sorry, I've been
> > preempted multiple times, and this has sat on my todo list where I
> > keep thinking I need to find time to come back to it but apparently
> > not doing enough to make that happen.
> >
> > Thank you for not forgetting about this. I promise I will somehow
> > clear my schedule to test this next week.
>
> No worries, and I am very much looking forward to seeing the results of
> your testing.
And please see below for an updated patch based on LKML review and
more intensive testing.
Thanx, Paul
------------------------------------------------------------------------
commit 4384c26f62d90e9685a50d65dcd352dbe96ae220
Author: Paul E. McKenney <[email protected]>
Date: Tue Jun 20 12:11:34 2017 -0700
rcu: Migrate callbacks earlier in the CPU-offline timeline
RCU callbacks must be migrated away from an outgoing CPU, and this is
done near the end of the CPU-hotplug operation, after the outgoing CPU is
long gone. Unfortunately, this means that other CPU-hotplug callbacks
can execute while the outgoing CPU's callbacks are still immobilized
on the long-gone CPU's callback lists. If any of these CPU-hotplug
callbacks must wait, either directly or indirectly, for the invocation
of any of the immobilized RCU callbacks, the system will hang.
This commit avoids such hangs by migrating the callbacks away from the
outgoing CPU immediately upon its departure, shortly after the return
from __cpu_die() in takedown_cpu(). Thus, RCU is able to advance these
callbacks and invoke them, which allows all the after-the-fact CPU-hotplug
callbacks to wait on these RCU callbacks without risk of a hang.
While in the neighborhood, this commit also moves rcu_send_cbs_to_orphanage()
and rcu_adopt_orphan_cbs() under a pre-existing #ifdef to avoid including
dead code on the one hand and to avoid define-without-use warnings on the
other hand.
Reported-by: Jeffrey Hugo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Anna-Maria Gleixner <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Richard Weinberger <[email protected]>
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index ecca9d2e4e4a..96f1baf62ab8 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -109,6 +109,7 @@ void rcu_bh_qs(void);
void rcu_check_callbacks(int user);
void rcu_report_dead(unsigned int cpu);
void rcu_cpu_starting(unsigned int cpu);
+void rcutree_migrate_callbacks(int cpu);
#ifdef CONFIG_RCU_STALL_COMMON
void rcu_sysrq_start(void);
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 9ae6fbe5b5cf..f5f3db2403fa 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -729,6 +729,7 @@ static int takedown_cpu(unsigned int cpu)
__cpu_die(cpu);
tick_cleanup_dead_cpu(cpu);
+ rcutree_migrate_callbacks(cpu);
return 0;
}
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 94ec7455fc46..e6d534946c7f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2547,85 +2547,6 @@ rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp)
}
/*
- * Send the specified CPU's RCU callbacks to the orphanage. The
- * specified CPU must be offline, and the caller must hold the
- * ->orphan_lock.
- */
-static void
-rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
- struct rcu_node *rnp, struct rcu_data *rdp)
-{
- lockdep_assert_held(&rsp->orphan_lock);
-
- /* No-CBs CPUs do not have orphanable callbacks. */
- if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) || rcu_is_nocb_cpu(rdp->cpu))
- return;
-
- /*
- * Orphan the callbacks. First adjust the counts. This is safe
- * because _rcu_barrier() excludes CPU-hotplug operations, so it
- * cannot be running now. Thus no memory barrier is required.
- */
- rdp->n_cbs_orphaned += rcu_segcblist_n_cbs(&rdp->cblist);
- rcu_segcblist_extract_count(&rdp->cblist, &rsp->orphan_done);
-
- /*
- * Next, move those callbacks still needing a grace period to
- * the orphanage, where some other CPU will pick them up.
- * Some of the callbacks might have gone partway through a grace
- * period, but that is too bad. They get to start over because we
- * cannot assume that grace periods are synchronized across CPUs.
- */
- rcu_segcblist_extract_pend_cbs(&rdp->cblist, &rsp->orphan_pend);
-
- /*
- * Then move the ready-to-invoke callbacks to the orphanage,
- * where some other CPU will pick them up. These will not be
- * required to pass though another grace period: They are done.
- */
- rcu_segcblist_extract_done_cbs(&rdp->cblist, &rsp->orphan_done);
-
- /* Finally, disallow further callbacks on this CPU. */
- rcu_segcblist_disable(&rdp->cblist);
-}
-
-/*
- * Adopt the RCU callbacks from the specified rcu_state structure's
- * orphanage. The caller must hold the ->orphan_lock.
- */
-static void rcu_adopt_orphan_cbs(struct rcu_state *rsp, unsigned long flags)
-{
- struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
-
- lockdep_assert_held(&rsp->orphan_lock);
-
- /* No-CBs CPUs are handled specially. */
- if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
- rcu_nocb_adopt_orphan_cbs(rsp, rdp, flags))
- return;
-
- /* Do the accounting first. */
- rdp->n_cbs_adopted += rsp->orphan_done.len;
- if (rsp->orphan_done.len_lazy != rsp->orphan_done.len)
- rcu_idle_count_callbacks_posted();
- rcu_segcblist_insert_count(&rdp->cblist, &rsp->orphan_done);
-
- /*
- * We do not need a memory barrier here because the only way we
- * can get here if there is an rcu_barrier() in flight is if
- * we are the task doing the rcu_barrier().
- */
-
- /* First adopt the ready-to-invoke callbacks, then the done ones. */
- rcu_segcblist_insert_done_cbs(&rdp->cblist, &rsp->orphan_done);
- WARN_ON_ONCE(rsp->orphan_done.head);
- rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rsp->orphan_pend);
- WARN_ON_ONCE(rsp->orphan_pend.head);
- WARN_ON_ONCE(rcu_segcblist_empty(&rdp->cblist) !=
- !rcu_segcblist_n_cbs(&rdp->cblist));
-}
-
-/*
* Trace the fact that this CPU is going offline.
*/
static void rcu_cleanup_dying_cpu(struct rcu_state *rsp)
@@ -2695,7 +2616,6 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
*/
static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
{
- unsigned long flags;
struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */
@@ -2704,18 +2624,6 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
/* Adjust any no-longer-needed kthreads. */
rcu_boost_kthread_setaffinity(rnp, -1);
-
- /* Orphan the dead CPU's callbacks, and adopt them if appropriate. */
- raw_spin_lock_irqsave(&rsp->orphan_lock, flags);
- rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp);
- rcu_adopt_orphan_cbs(rsp, flags);
- raw_spin_unlock_irqrestore(&rsp->orphan_lock, flags);
-
- WARN_ONCE(rcu_segcblist_n_cbs(&rdp->cblist) != 0 ||
- !rcu_segcblist_empty(&rdp->cblist),
- "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, 1stCB=%p\n",
- cpu, rcu_segcblist_n_cbs(&rdp->cblist),
- rcu_segcblist_first_cb(&rdp->cblist));
}
/*
@@ -3927,6 +3835,116 @@ void rcu_report_dead(unsigned int cpu)
for_each_rcu_flavor(rsp)
rcu_cleanup_dying_idle_cpu(cpu, rsp);
}
+
+/*
+ * Send the specified CPU's RCU callbacks to the orphanage. The
+ * specified CPU must be offline, and the caller must hold the
+ * ->orphan_lock.
+ */
+static void
+rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
+ struct rcu_node *rnp, struct rcu_data *rdp)
+{
+ lockdep_assert_held(&rsp->orphan_lock);
+
+ /* No-CBs CPUs do not have orphanable callbacks. */
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) || rcu_is_nocb_cpu(rdp->cpu))
+ return;
+
+ /*
+ * Orphan the callbacks. First adjust the counts. This is safe
+ * because _rcu_barrier() excludes CPU-hotplug operations, so it
+ * cannot be running now. Thus no memory barrier is required.
+ */
+ rdp->n_cbs_orphaned += rcu_segcblist_n_cbs(&rdp->cblist);
+ rcu_segcblist_extract_count(&rdp->cblist, &rsp->orphan_done);
+
+ /*
+ * Next, move those callbacks still needing a grace period to
+ * the orphanage, where some other CPU will pick them up.
+ * Some of the callbacks might have gone partway through a grace
+ * period, but that is too bad. They get to start over because we
+ * cannot assume that grace periods are synchronized across CPUs.
+ */
+ rcu_segcblist_extract_pend_cbs(&rdp->cblist, &rsp->orphan_pend);
+
+ /*
+ * Then move the ready-to-invoke callbacks to the orphanage,
+ * where some other CPU will pick them up. These will not be
+ * required to pass though another grace period: They are done.
+ */
+ rcu_segcblist_extract_done_cbs(&rdp->cblist, &rsp->orphan_done);
+
+ /* Finally, disallow further callbacks on this CPU. */
+ rcu_segcblist_disable(&rdp->cblist);
+}
+
+/*
+ * Adopt the RCU callbacks from the specified rcu_state structure's
+ * orphanage. The caller must hold the ->orphan_lock.
+ */
+static void rcu_adopt_orphan_cbs(struct rcu_state *rsp, unsigned long flags)
+{
+ struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
+
+ lockdep_assert_held(&rsp->orphan_lock);
+
+ /* No-CBs CPUs are handled specially. */
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
+ rcu_nocb_adopt_orphan_cbs(rsp, rdp, flags))
+ return;
+
+ /* Do the accounting first. */
+ rdp->n_cbs_adopted += rsp->orphan_done.len;
+ if (rsp->orphan_done.len_lazy != rsp->orphan_done.len)
+ rcu_idle_count_callbacks_posted();
+ rcu_segcblist_insert_count(&rdp->cblist, &rsp->orphan_done);
+
+ /*
+ * We do not need a memory barrier here because the only way we
+ * can get here if there is an rcu_barrier() in flight is if
+ * we are the task doing the rcu_barrier().
+ */
+
+ /* First adopt the ready-to-invoke callbacks, then the done ones. */
+ rcu_segcblist_insert_done_cbs(&rdp->cblist, &rsp->orphan_done);
+ WARN_ON_ONCE(rsp->orphan_done.head);
+ rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rsp->orphan_pend);
+ WARN_ON_ONCE(rsp->orphan_pend.head);
+ WARN_ON_ONCE(rcu_segcblist_empty(&rdp->cblist) !=
+ !rcu_segcblist_n_cbs(&rdp->cblist));
+}
+
+/* Orphan the dead CPU's callbacks, and then adopt them. */
+static void rcu_migrate_callbacks(int cpu, struct rcu_state *rsp)
+{
+ unsigned long flags;
+ struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
+ struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */
+
+ raw_spin_lock_irqsave(&rsp->orphan_lock, flags);
+ rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp);
+ rcu_adopt_orphan_cbs(rsp, flags);
+ raw_spin_unlock_irqrestore(&rsp->orphan_lock, flags);
+ WARN_ONCE(rcu_segcblist_n_cbs(&rdp->cblist) != 0 ||
+ !rcu_segcblist_empty(&rdp->cblist),
+ "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, 1stCB=%p\n",
+ cpu, rcu_segcblist_n_cbs(&rdp->cblist),
+ rcu_segcblist_first_cb(&rdp->cblist));
+}
+
+/*
+ * The outgoing CPU has just passed through the dying-idle state,
+ * and we are being invoked from the CPU that was IPIed to continue the
+ * offline operation. We need to migrate the outgoing CPU's callbacks.
+ */
+void rcutree_migrate_callbacks(int cpu)
+{
+ struct rcu_state *rsp;
+
+ for_each_rcu_flavor(rsp)
+ rcu_migrate_callbacks(cpu, rsp);
+}
#endif
/*
On 6/22/2017 9:34 PM, Paul E. McKenney wrote:
> On Wed, Jun 21, 2017 at 09:18:53AM -0700, Paul E. McKenney wrote:
>> No worries, and I am very much looking forward to seeing the results of
>> your testing.
>
> And please see below for an updated patch based on LKML review and
> more intensive testing.
>
I spent some time on this today. It didn't go as I expected. I
validated the issue is reproducible as before on 4.11 and 4.12 rcs 1
through 4. However, the version of stress-ng that I was using ran into
constant errors starting with rc5, making it nearly impossible to make
progress toward reproduction. Upgrading stress-ng to tip fixes the
issue, however, I've still been unable to repro the issue.
Its my unfounded suspicion that something went in between rc4 and rc5
which changed the timing, and didn't actually fix the issue. I will run
the test overnight for 5 hours to try to repro.
The patch you sent appears to be based on linux-next, and appears to
have a number of dependencies which prevent it from cleanly applying on
anything current that I'm able to repro on at this time. Do you want to
provide a rebased version of the patch which applies to say 4.11? I
could easily test that and report back.
--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.
On Tue, Jun 27, 2017 at 04:32:09PM -0600, Jeffrey Hugo wrote:
> On 6/22/2017 9:34 PM, Paul E. McKenney wrote:
> >On Wed, Jun 21, 2017 at 09:18:53AM -0700, Paul E. McKenney wrote:
> >>No worries, and I am very much looking forward to seeing the results of
> >>your testing.
> >
> >And please see below for an updated patch based on LKML review and
> >more intensive testing.
> >
>
> I spent some time on this today. It didn't go as I expected. I
> validated the issue is reproducible as before on 4.11 and 4.12 rcs 1
> through 4. However, the version of stress-ng that I was using ran
> into constant errors starting with rc5, making it nearly impossible
> to make progress toward reproduction. Upgrading stress-ng to tip
> fixes the issue, however, I've still been unable to repro the issue.
>
> Its my unfounded suspicion that something went in between rc4 and
> rc5 which changed the timing, and didn't actually fix the issue. I
> will run the test overnight for 5 hours to try to repro.
>
> The patch you sent appears to be based on linux-next, and appears to
> have a number of dependencies which prevent it from cleanly applying
> on anything current that I'm able to repro on at this time. Do you
> want to provide a rebased version of the patch which applies to say
> 4.11? I could easily test that and report back.
Here is a very lightly tested backport to v4.11.
Thanx, Paul
------------------------------------------------------------------------
commit 5557f5d8435dcfdc81ff090aceb57734448d3051
Author: Paul E. McKenney <[email protected]>
Date: Tue Jun 20 12:11:34 2017 -0700
rcu: Migrate callbacks earlier in the CPU-offline timeline
RCU callbacks must be migrated away from an outgoing CPU, and this is
done near the end of the CPU-hotplug operation, after the outgoing CPU is
long gone. Unfortunately, this means that other CPU-hotplug callbacks
can execute while the outgoing CPU's callbacks are still immobilized
on the long-gone CPU's callback lists. If any of these CPU-hotplug
callbacks must wait, either directly or indirectly, for the invocation
of any of the immobilized RCU callbacks, the system will hang.
This commit avoids such hangs by migrating the callbacks away from the
outgoing CPU immediately upon its departure, shortly after the return
from __cpu_die() in takedown_cpu(). Thus, RCU is able to advance these
callbacks and invoke them, which allows all the after-the-fact CPU-hotplug
callbacks to wait on these RCU callbacks without risk of a hang.
While in the neighborhood, this commit also moves rcu_send_cbs_to_orphanage()
and rcu_adopt_orphan_cbs() under a pre-existing #ifdef to avoid including
dead code on the one hand and to avoid define-without-use warnings on the
other hand.
Reported-by: Jeffrey Hugo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Anna-Maria Gleixner <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Richard Weinberger <[email protected]>
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index de88b33c0974..183d69438776 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -295,6 +295,7 @@ void rcu_bh_qs(void);
void rcu_check_callbacks(int user);
void rcu_report_dead(unsigned int cpu);
void rcu_cpu_starting(unsigned int cpu);
+void rcutree_migrate_callbacks(int cpu);
#ifndef CONFIG_TINY_RCU
void rcu_end_inkernel_boot(void);
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 37b223e4fc05..21be6ab54ea2 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -729,6 +729,7 @@ static int takedown_cpu(unsigned int cpu)
__cpu_die(cpu);
tick_cleanup_dead_cpu(cpu);
+ rcutree_migrate_callbacks(cpu);
return 0;
}
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 50fee7689e71..63206f81574a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2636,114 +2636,6 @@ rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp)
}
/*
- * Send the specified CPU's RCU callbacks to the orphanage. The
- * specified CPU must be offline, and the caller must hold the
- * ->orphan_lock.
- */
-static void
-rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
- struct rcu_node *rnp, struct rcu_data *rdp)
-{
- /* No-CBs CPUs do not have orphanable callbacks. */
- if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) || rcu_is_nocb_cpu(rdp->cpu))
- return;
-
- /*
- * Orphan the callbacks. First adjust the counts. This is safe
- * because _rcu_barrier() excludes CPU-hotplug operations, so it
- * cannot be running now. Thus no memory barrier is required.
- */
- if (rdp->nxtlist != NULL) {
- rsp->qlen_lazy += rdp->qlen_lazy;
- rsp->qlen += rdp->qlen;
- rdp->n_cbs_orphaned += rdp->qlen;
- rdp->qlen_lazy = 0;
- WRITE_ONCE(rdp->qlen, 0);
- }
-
- /*
- * Next, move those callbacks still needing a grace period to
- * the orphanage, where some other CPU will pick them up.
- * Some of the callbacks might have gone partway through a grace
- * period, but that is too bad. They get to start over because we
- * cannot assume that grace periods are synchronized across CPUs.
- * We don't bother updating the ->nxttail[] array yet, instead
- * we just reset the whole thing later on.
- */
- if (*rdp->nxttail[RCU_DONE_TAIL] != NULL) {
- *rsp->orphan_nxttail = *rdp->nxttail[RCU_DONE_TAIL];
- rsp->orphan_nxttail = rdp->nxttail[RCU_NEXT_TAIL];
- *rdp->nxttail[RCU_DONE_TAIL] = NULL;
- }
-
- /*
- * Then move the ready-to-invoke callbacks to the orphanage,
- * where some other CPU will pick them up. These will not be
- * required to pass though another grace period: They are done.
- */
- if (rdp->nxtlist != NULL) {
- *rsp->orphan_donetail = rdp->nxtlist;
- rsp->orphan_donetail = rdp->nxttail[RCU_DONE_TAIL];
- }
-
- /*
- * Finally, initialize the rcu_data structure's list to empty and
- * disallow further callbacks on this CPU.
- */
- init_callback_list(rdp);
- rdp->nxttail[RCU_NEXT_TAIL] = NULL;
-}
-
-/*
- * Adopt the RCU callbacks from the specified rcu_state structure's
- * orphanage. The caller must hold the ->orphan_lock.
- */
-static void rcu_adopt_orphan_cbs(struct rcu_state *rsp, unsigned long flags)
-{
- int i;
- struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
-
- /* No-CBs CPUs are handled specially. */
- if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
- rcu_nocb_adopt_orphan_cbs(rsp, rdp, flags))
- return;
-
- /* Do the accounting first. */
- rdp->qlen_lazy += rsp->qlen_lazy;
- rdp->qlen += rsp->qlen;
- rdp->n_cbs_adopted += rsp->qlen;
- if (rsp->qlen_lazy != rsp->qlen)
- rcu_idle_count_callbacks_posted();
- rsp->qlen_lazy = 0;
- rsp->qlen = 0;
-
- /*
- * We do not need a memory barrier here because the only way we
- * can get here if there is an rcu_barrier() in flight is if
- * we are the task doing the rcu_barrier().
- */
-
- /* First adopt the ready-to-invoke callbacks. */
- if (rsp->orphan_donelist != NULL) {
- *rsp->orphan_donetail = *rdp->nxttail[RCU_DONE_TAIL];
- *rdp->nxttail[RCU_DONE_TAIL] = rsp->orphan_donelist;
- for (i = RCU_NEXT_SIZE - 1; i >= RCU_DONE_TAIL; i--)
- if (rdp->nxttail[i] == rdp->nxttail[RCU_DONE_TAIL])
- rdp->nxttail[i] = rsp->orphan_donetail;
- rsp->orphan_donelist = NULL;
- rsp->orphan_donetail = &rsp->orphan_donelist;
- }
-
- /* And then adopt the callbacks that still need a grace period. */
- if (rsp->orphan_nxtlist != NULL) {
- *rdp->nxttail[RCU_NEXT_TAIL] = rsp->orphan_nxtlist;
- rdp->nxttail[RCU_NEXT_TAIL] = rsp->orphan_nxttail;
- rsp->orphan_nxtlist = NULL;
- rsp->orphan_nxttail = &rsp->orphan_nxtlist;
- }
-}
-
-/*
* Trace the fact that this CPU is going offline.
*/
static void rcu_cleanup_dying_cpu(struct rcu_state *rsp)
@@ -2805,14 +2697,12 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
/*
* The CPU has been completely removed, and some other CPU is reporting
- * this fact from process context. Do the remainder of the cleanup,
- * including orphaning the outgoing CPU's RCU callbacks, and also
- * adopting them. There can only be one CPU hotplug operation at a time,
- * so no other CPU can be attempting to update rcu_cpu_kthread_task.
+ * this fact from process context. Do the remainder of the cleanup.
+ * There can only be one CPU hotplug operation at a time, so no need for
+ * explicit locking.
*/
static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
{
- unsigned long flags;
struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */
@@ -2821,16 +2711,6 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
/* Adjust any no-longer-needed kthreads. */
rcu_boost_kthread_setaffinity(rnp, -1);
-
- /* Orphan the dead CPU's callbacks, and adopt them if appropriate. */
- raw_spin_lock_irqsave(&rsp->orphan_lock, flags);
- rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp);
- rcu_adopt_orphan_cbs(rsp, flags);
- raw_spin_unlock_irqrestore(&rsp->orphan_lock, flags);
-
- WARN_ONCE(rdp->qlen != 0 || rdp->nxtlist != NULL,
- "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, nxtlist=%p\n",
- cpu, rdp->qlen, rdp->nxtlist);
}
/*
@@ -4011,6 +3891,140 @@ void rcu_report_dead(unsigned int cpu)
for_each_rcu_flavor(rsp)
rcu_cleanup_dying_idle_cpu(cpu, rsp);
}
+
+/*
+ * Send the specified CPU's RCU callbacks to the orphanage. The
+ * specified CPU must be offline, and the caller must hold the
+ * ->orphan_lock.
+ */
+static void
+rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
+ struct rcu_node *rnp, struct rcu_data *rdp)
+{
+ /* No-CBs CPUs do not have orphanable callbacks. */
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) || rcu_is_nocb_cpu(rdp->cpu))
+ return;
+
+ /*
+ * Orphan the callbacks. First adjust the counts. This is safe
+ * because _rcu_barrier() excludes CPU-hotplug operations, so it
+ * cannot be running now. Thus no memory barrier is required.
+ */
+ if (rdp->nxtlist != NULL) {
+ rsp->qlen_lazy += rdp->qlen_lazy;
+ rsp->qlen += rdp->qlen;
+ rdp->n_cbs_orphaned += rdp->qlen;
+ rdp->qlen_lazy = 0;
+ WRITE_ONCE(rdp->qlen, 0);
+ }
+
+ /*
+ * Next, move those callbacks still needing a grace period to
+ * the orphanage, where some other CPU will pick them up.
+ * Some of the callbacks might have gone partway through a grace
+ * period, but that is too bad. They get to start over because we
+ * cannot assume that grace periods are synchronized across CPUs.
+ * We don't bother updating the ->nxttail[] array yet, instead
+ * we just reset the whole thing later on.
+ */
+ if (*rdp->nxttail[RCU_DONE_TAIL] != NULL) {
+ *rsp->orphan_nxttail = *rdp->nxttail[RCU_DONE_TAIL];
+ rsp->orphan_nxttail = rdp->nxttail[RCU_NEXT_TAIL];
+ *rdp->nxttail[RCU_DONE_TAIL] = NULL;
+ }
+
+ /*
+ * Then move the ready-to-invoke callbacks to the orphanage,
+ * where some other CPU will pick them up. These will not be
+ * required to pass though another grace period: They are done.
+ */
+ if (rdp->nxtlist != NULL) {
+ *rsp->orphan_donetail = rdp->nxtlist;
+ rsp->orphan_donetail = rdp->nxttail[RCU_DONE_TAIL];
+ }
+
+ /*
+ * Finally, initialize the rcu_data structure's list to empty and
+ * disallow further callbacks on this CPU.
+ */
+ init_callback_list(rdp);
+ rdp->nxttail[RCU_NEXT_TAIL] = NULL;
+}
+
+/*
+ * Adopt the RCU callbacks from the specified rcu_state structure's
+ * orphanage. The caller must hold the ->orphan_lock.
+ */
+static void rcu_adopt_orphan_cbs(struct rcu_state *rsp, unsigned long flags)
+{
+ int i;
+ struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
+
+ /* No-CBs CPUs are handled specially. */
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
+ rcu_nocb_adopt_orphan_cbs(rsp, rdp, flags))
+ return;
+
+ /* Do the accounting first. */
+ rdp->qlen_lazy += rsp->qlen_lazy;
+ rdp->qlen += rsp->qlen;
+ rdp->n_cbs_adopted += rsp->qlen;
+ if (rsp->qlen_lazy != rsp->qlen)
+ rcu_idle_count_callbacks_posted();
+ rsp->qlen_lazy = 0;
+ rsp->qlen = 0;
+
+ /*
+ * We do not need a memory barrier here because the only way we
+ * can get here if there is an rcu_barrier() in flight is if
+ * we are the task doing the rcu_barrier().
+ */
+
+ /* First adopt the ready-to-invoke callbacks. */
+ if (rsp->orphan_donelist != NULL) {
+ *rsp->orphan_donetail = *rdp->nxttail[RCU_DONE_TAIL];
+ *rdp->nxttail[RCU_DONE_TAIL] = rsp->orphan_donelist;
+ for (i = RCU_NEXT_SIZE - 1; i >= RCU_DONE_TAIL; i--)
+ if (rdp->nxttail[i] == rdp->nxttail[RCU_DONE_TAIL])
+ rdp->nxttail[i] = rsp->orphan_donetail;
+ rsp->orphan_donelist = NULL;
+ rsp->orphan_donetail = &rsp->orphan_donelist;
+ }
+
+ /* And then adopt the callbacks that still need a grace period. */
+ if (rsp->orphan_nxtlist != NULL) {
+ *rdp->nxttail[RCU_NEXT_TAIL] = rsp->orphan_nxtlist;
+ rdp->nxttail[RCU_NEXT_TAIL] = rsp->orphan_nxttail;
+ rsp->orphan_nxtlist = NULL;
+ rsp->orphan_nxttail = &rsp->orphan_nxtlist;
+ }
+}
+
+/* Orphan the dead CPU's callbacks, and then adopt them. */
+static void rcu_migrate_callbacks(int cpu, struct rcu_state *rsp)
+{
+ unsigned long flags;
+ struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
+ struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */
+
+ raw_spin_lock_irqsave(&rsp->orphan_lock, flags);
+ rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp);
+ rcu_adopt_orphan_cbs(rsp, flags);
+ raw_spin_unlock_irqrestore(&rsp->orphan_lock, flags);
+}
+
+/*
+ * The outgoing CPU has just passed through the dying-idle state,
+ * and we are being invoked from the CPU that was IPIed to continue the
+ * offline operation. We need to migrate the outgoing CPU's callbacks.
+ */
+void rcutree_migrate_callbacks(int cpu)
+{
+ struct rcu_state *rsp;
+
+ for_each_rcu_flavor(rsp)
+ rcu_migrate_callbacks(cpu, rsp);
+}
#endif
static int rcu_pm_notify(struct notifier_block *self,
On 6/27/2017 6:11 PM, Paul E. McKenney wrote:
> On Tue, Jun 27, 2017 at 04:32:09PM -0600, Jeffrey Hugo wrote:
>> On 6/22/2017 9:34 PM, Paul E. McKenney wrote:
>>> On Wed, Jun 21, 2017 at 09:18:53AM -0700, Paul E. McKenney wrote:
>>>> No worries, and I am very much looking forward to seeing the results of
>>>> your testing.
>>>
>>> And please see below for an updated patch based on LKML review and
>>> more intensive testing.
>>>
>>
>> I spent some time on this today. It didn't go as I expected. I
>> validated the issue is reproducible as before on 4.11 and 4.12 rcs 1
>> through 4. However, the version of stress-ng that I was using ran
>> into constant errors starting with rc5, making it nearly impossible
>> to make progress toward reproduction. Upgrading stress-ng to tip
>> fixes the issue, however, I've still been unable to repro the issue.
>>
>> Its my unfounded suspicion that something went in between rc4 and
>> rc5 which changed the timing, and didn't actually fix the issue. I
>> will run the test overnight for 5 hours to try to repro.
>>
>> The patch you sent appears to be based on linux-next, and appears to
>> have a number of dependencies which prevent it from cleanly applying
>> on anything current that I'm able to repro on at this time. Do you
>> want to provide a rebased version of the patch which applies to say
>> 4.11? I could easily test that and report back.
>
> Here is a very lightly tested backport to v4.11.
>
Works for me. Always reproduced the lockup within 2 minutes on stock
4.11. With the change applied, I was able to test for 2 hours in the
same conditions, and 4 hours with the full system and not encounter an
issue.
Feel free to add:
Tested-by: Jeffrey Hugo <[email protected]>
I'm going to go back to 4.12-rc5 and see if I can get either repro the
issue, or identify what changed. Hopefully I can get to linux-next and
double check the original version of the change as well.
--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.
On Thu, Jun 29, 2017 at 10:29:12AM -0600, Jeffrey Hugo wrote:
> On 6/27/2017 6:11 PM, Paul E. McKenney wrote:
> >On Tue, Jun 27, 2017 at 04:32:09PM -0600, Jeffrey Hugo wrote:
> >>On 6/22/2017 9:34 PM, Paul E. McKenney wrote:
> >>>On Wed, Jun 21, 2017 at 09:18:53AM -0700, Paul E. McKenney wrote:
> >>>>No worries, and I am very much looking forward to seeing the results of
> >>>>your testing.
> >>>
> >>>And please see below for an updated patch based on LKML review and
> >>>more intensive testing.
> >>>
> >>
> >>I spent some time on this today. It didn't go as I expected. I
> >>validated the issue is reproducible as before on 4.11 and 4.12 rcs 1
> >>through 4. However, the version of stress-ng that I was using ran
> >>into constant errors starting with rc5, making it nearly impossible
> >>to make progress toward reproduction. Upgrading stress-ng to tip
> >>fixes the issue, however, I've still been unable to repro the issue.
> >>
> >>Its my unfounded suspicion that something went in between rc4 and
> >>rc5 which changed the timing, and didn't actually fix the issue. I
> >>will run the test overnight for 5 hours to try to repro.
> >>
> >>The patch you sent appears to be based on linux-next, and appears to
> >>have a number of dependencies which prevent it from cleanly applying
> >>on anything current that I'm able to repro on at this time. Do you
> >>want to provide a rebased version of the patch which applies to say
> >>4.11? I could easily test that and report back.
> >
> >Here is a very lightly tested backport to v4.11.
> >
>
> Works for me. Always reproduced the lockup within 2 minutes on stock
> 4.11. With the change applied, I was able to test for 2 hours in
> the same conditions, and 4 hours with the full system and not
> encounter an issue.
>
> Feel free to add:
> Tested-by: Jeffrey Hugo <[email protected]>
Applied, thank you!
> I'm going to go back to 4.12-rc5 and see if I can get either repro
> the issue, or identify what changed. Hopefully I can get to
> linux-next and double check the original version of the change as
> well.
Looking forward to hearing what you find!
Thanx, Paul
Commit-ID: a58163d8ca2c8d288ee9f95989712f98473a5ac2
Gitweb: http://git.kernel.org/tip/a58163d8ca2c8d288ee9f95989712f98473a5ac2
Author: Paul E. McKenney <[email protected]>
AuthorDate: Tue, 20 Jun 2017 12:11:34 -0700
Committer: Paul E. McKenney <[email protected]>
CommitDate: Tue, 25 Jul 2017 13:03:43 -0700
rcu: Migrate callbacks earlier in the CPU-offline timeline
RCU callbacks must be migrated away from an outgoing CPU, and this is
done near the end of the CPU-hotplug operation, after the outgoing CPU is
long gone. Unfortunately, this means that other CPU-hotplug callbacks
can execute while the outgoing CPU's callbacks are still immobilized
on the long-gone CPU's callback lists. If any of these CPU-hotplug
callbacks must wait, either directly or indirectly, for the invocation
of any of the immobilized RCU callbacks, the system will hang.
This commit avoids such hangs by migrating the callbacks away from the
outgoing CPU immediately upon its departure, shortly after the return
from __cpu_die() in takedown_cpu(). Thus, RCU is able to advance these
callbacks and invoke them, which allows all the after-the-fact CPU-hotplug
callbacks to wait on these RCU callbacks without risk of a hang.
While in the neighborhood, this commit also moves rcu_send_cbs_to_orphanage()
and rcu_adopt_orphan_cbs() under a pre-existing #ifdef to avoid including
dead code on the one hand and to avoid define-without-use warnings on the
other hand.
Reported-by: Jeffrey Hugo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Anna-Maria Gleixner <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Richard Weinberger <[email protected]>
---
include/linux/rcupdate.h | 1 +
kernel/cpu.c | 1 +
kernel/rcu/tree.c | 209 +++++++++++++++++++++++++----------------------
3 files changed, 115 insertions(+), 96 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index f816fc7..cf307eb 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -110,6 +110,7 @@ void rcu_bh_qs(void);
void rcu_check_callbacks(int user);
void rcu_report_dead(unsigned int cpu);
void rcu_cpu_starting(unsigned int cpu);
+void rcutree_migrate_callbacks(int cpu);
#ifdef CONFIG_RCU_STALL_COMMON
void rcu_sysrq_start(void);
diff --git a/kernel/cpu.c b/kernel/cpu.c
index eee0331..bfbd649 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -650,6 +650,7 @@ static int takedown_cpu(unsigned int cpu)
__cpu_die(cpu);
tick_cleanup_dead_cpu(cpu);
+ rcutree_migrate_callbacks(cpu);
return 0;
}
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 51d4c3a..9bb5dff 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2563,85 +2563,6 @@ rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp)
}
/*
- * Send the specified CPU's RCU callbacks to the orphanage. The
- * specified CPU must be offline, and the caller must hold the
- * ->orphan_lock.
- */
-static void
-rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
- struct rcu_node *rnp, struct rcu_data *rdp)
-{
- lockdep_assert_held(&rsp->orphan_lock);
-
- /* No-CBs CPUs do not have orphanable callbacks. */
- if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) || rcu_is_nocb_cpu(rdp->cpu))
- return;
-
- /*
- * Orphan the callbacks. First adjust the counts. This is safe
- * because _rcu_barrier() excludes CPU-hotplug operations, so it
- * cannot be running now. Thus no memory barrier is required.
- */
- rdp->n_cbs_orphaned += rcu_segcblist_n_cbs(&rdp->cblist);
- rcu_segcblist_extract_count(&rdp->cblist, &rsp->orphan_done);
-
- /*
- * Next, move those callbacks still needing a grace period to
- * the orphanage, where some other CPU will pick them up.
- * Some of the callbacks might have gone partway through a grace
- * period, but that is too bad. They get to start over because we
- * cannot assume that grace periods are synchronized across CPUs.
- */
- rcu_segcblist_extract_pend_cbs(&rdp->cblist, &rsp->orphan_pend);
-
- /*
- * Then move the ready-to-invoke callbacks to the orphanage,
- * where some other CPU will pick them up. These will not be
- * required to pass though another grace period: They are done.
- */
- rcu_segcblist_extract_done_cbs(&rdp->cblist, &rsp->orphan_done);
-
- /* Finally, disallow further callbacks on this CPU. */
- rcu_segcblist_disable(&rdp->cblist);
-}
-
-/*
- * Adopt the RCU callbacks from the specified rcu_state structure's
- * orphanage. The caller must hold the ->orphan_lock.
- */
-static void rcu_adopt_orphan_cbs(struct rcu_state *rsp, unsigned long flags)
-{
- struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
-
- lockdep_assert_held(&rsp->orphan_lock);
-
- /* No-CBs CPUs are handled specially. */
- if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
- rcu_nocb_adopt_orphan_cbs(rsp, rdp, flags))
- return;
-
- /* Do the accounting first. */
- rdp->n_cbs_adopted += rsp->orphan_done.len;
- if (rsp->orphan_done.len_lazy != rsp->orphan_done.len)
- rcu_idle_count_callbacks_posted();
- rcu_segcblist_insert_count(&rdp->cblist, &rsp->orphan_done);
-
- /*
- * We do not need a memory barrier here because the only way we
- * can get here if there is an rcu_barrier() in flight is if
- * we are the task doing the rcu_barrier().
- */
-
- /* First adopt the ready-to-invoke callbacks, then the done ones. */
- rcu_segcblist_insert_done_cbs(&rdp->cblist, &rsp->orphan_done);
- WARN_ON_ONCE(rsp->orphan_done.head);
- rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rsp->orphan_pend);
- WARN_ON_ONCE(rsp->orphan_pend.head);
- WARN_ON_ONCE(rcu_segcblist_empty(&rdp->cblist) !=
- !rcu_segcblist_n_cbs(&rdp->cblist));
-}
-
-/*
* Trace the fact that this CPU is going offline.
*/
static void rcu_cleanup_dying_cpu(struct rcu_state *rsp)
@@ -2704,14 +2625,12 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
/*
* The CPU has been completely removed, and some other CPU is reporting
- * this fact from process context. Do the remainder of the cleanup,
- * including orphaning the outgoing CPU's RCU callbacks, and also
- * adopting them. There can only be one CPU hotplug operation at a time,
- * so no other CPU can be attempting to update rcu_cpu_kthread_task.
+ * this fact from process context. Do the remainder of the cleanup.
+ * There can only be one CPU hotplug operation at a time, so no need for
+ * explicit locking.
*/
static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
{
- unsigned long flags;
struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */
@@ -2720,18 +2639,6 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
/* Adjust any no-longer-needed kthreads. */
rcu_boost_kthread_setaffinity(rnp, -1);
-
- /* Orphan the dead CPU's callbacks, and adopt them if appropriate. */
- raw_spin_lock_irqsave(&rsp->orphan_lock, flags);
- rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp);
- rcu_adopt_orphan_cbs(rsp, flags);
- raw_spin_unlock_irqrestore(&rsp->orphan_lock, flags);
-
- WARN_ONCE(rcu_segcblist_n_cbs(&rdp->cblist) != 0 ||
- !rcu_segcblist_empty(&rdp->cblist),
- "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, 1stCB=%p\n",
- cpu, rcu_segcblist_n_cbs(&rdp->cblist),
- rcu_segcblist_first_cb(&rdp->cblist));
}
/*
@@ -3937,6 +3844,116 @@ void rcu_report_dead(unsigned int cpu)
for_each_rcu_flavor(rsp)
rcu_cleanup_dying_idle_cpu(cpu, rsp);
}
+
+/*
+ * Send the specified CPU's RCU callbacks to the orphanage. The
+ * specified CPU must be offline, and the caller must hold the
+ * ->orphan_lock.
+ */
+static void
+rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
+ struct rcu_node *rnp, struct rcu_data *rdp)
+{
+ lockdep_assert_held(&rsp->orphan_lock);
+
+ /* No-CBs CPUs do not have orphanable callbacks. */
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) || rcu_is_nocb_cpu(rdp->cpu))
+ return;
+
+ /*
+ * Orphan the callbacks. First adjust the counts. This is safe
+ * because _rcu_barrier() excludes CPU-hotplug operations, so it
+ * cannot be running now. Thus no memory barrier is required.
+ */
+ rdp->n_cbs_orphaned += rcu_segcblist_n_cbs(&rdp->cblist);
+ rcu_segcblist_extract_count(&rdp->cblist, &rsp->orphan_done);
+
+ /*
+ * Next, move those callbacks still needing a grace period to
+ * the orphanage, where some other CPU will pick them up.
+ * Some of the callbacks might have gone partway through a grace
+ * period, but that is too bad. They get to start over because we
+ * cannot assume that grace periods are synchronized across CPUs.
+ */
+ rcu_segcblist_extract_pend_cbs(&rdp->cblist, &rsp->orphan_pend);
+
+ /*
+ * Then move the ready-to-invoke callbacks to the orphanage,
+ * where some other CPU will pick them up. These will not be
+ * required to pass though another grace period: They are done.
+ */
+ rcu_segcblist_extract_done_cbs(&rdp->cblist, &rsp->orphan_done);
+
+ /* Finally, disallow further callbacks on this CPU. */
+ rcu_segcblist_disable(&rdp->cblist);
+}
+
+/*
+ * Adopt the RCU callbacks from the specified rcu_state structure's
+ * orphanage. The caller must hold the ->orphan_lock.
+ */
+static void rcu_adopt_orphan_cbs(struct rcu_state *rsp, unsigned long flags)
+{
+ struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
+
+ lockdep_assert_held(&rsp->orphan_lock);
+
+ /* No-CBs CPUs are handled specially. */
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
+ rcu_nocb_adopt_orphan_cbs(rsp, rdp, flags))
+ return;
+
+ /* Do the accounting first. */
+ rdp->n_cbs_adopted += rsp->orphan_done.len;
+ if (rsp->orphan_done.len_lazy != rsp->orphan_done.len)
+ rcu_idle_count_callbacks_posted();
+ rcu_segcblist_insert_count(&rdp->cblist, &rsp->orphan_done);
+
+ /*
+ * We do not need a memory barrier here because the only way we
+ * can get here if there is an rcu_barrier() in flight is if
+ * we are the task doing the rcu_barrier().
+ */
+
+ /* First adopt the ready-to-invoke callbacks, then the done ones. */
+ rcu_segcblist_insert_done_cbs(&rdp->cblist, &rsp->orphan_done);
+ WARN_ON_ONCE(rsp->orphan_done.head);
+ rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rsp->orphan_pend);
+ WARN_ON_ONCE(rsp->orphan_pend.head);
+ WARN_ON_ONCE(rcu_segcblist_empty(&rdp->cblist) !=
+ !rcu_segcblist_n_cbs(&rdp->cblist));
+}
+
+/* Orphan the dead CPU's callbacks, and then adopt them. */
+static void rcu_migrate_callbacks(int cpu, struct rcu_state *rsp)
+{
+ unsigned long flags;
+ struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
+ struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */
+
+ raw_spin_lock_irqsave(&rsp->orphan_lock, flags);
+ rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp);
+ rcu_adopt_orphan_cbs(rsp, flags);
+ raw_spin_unlock_irqrestore(&rsp->orphan_lock, flags);
+ WARN_ONCE(rcu_segcblist_n_cbs(&rdp->cblist) != 0 ||
+ !rcu_segcblist_empty(&rdp->cblist),
+ "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, 1stCB=%p\n",
+ cpu, rcu_segcblist_n_cbs(&rdp->cblist),
+ rcu_segcblist_first_cb(&rdp->cblist));
+}
+
+/*
+ * The outgoing CPU has just passed through the dying-idle state,
+ * and we are being invoked from the CPU that was IPIed to continue the
+ * offline operation. We need to migrate the outgoing CPU's callbacks.
+ */
+void rcutree_migrate_callbacks(int cpu)
+{
+ struct rcu_state *rsp;
+
+ for_each_rcu_flavor(rsp)
+ rcu_migrate_callbacks(cpu, rsp);
+}
#endif
/*
On 6/29/2017 6:18 PM, Paul E. McKenney wrote:
> On Thu, Jun 29, 2017 at 10:29:12AM -0600, Jeffrey Hugo wrote:
>> On 6/27/2017 6:11 PM, Paul E. McKenney wrote:
>>> On Tue, Jun 27, 2017 at 04:32:09PM -0600, Jeffrey Hugo wrote:
>>>> On 6/22/2017 9:34 PM, Paul E. McKenney wrote:
>>>>> On Wed, Jun 21, 2017 at 09:18:53AM -0700, Paul E. McKenney wrote:
>>>>>> No worries, and I am very much looking forward to seeing the results of
>>>>>> your testing.
>>>>>
>>>>> And please see below for an updated patch based on LKML review and
>>>>> more intensive testing.
>>>>>
>>>>
>>>> I spent some time on this today. It didn't go as I expected. I
>>>> validated the issue is reproducible as before on 4.11 and 4.12 rcs 1
>>>> through 4. However, the version of stress-ng that I was using ran
>>>> into constant errors starting with rc5, making it nearly impossible
>>>> to make progress toward reproduction. Upgrading stress-ng to tip
>>>> fixes the issue, however, I've still been unable to repro the issue.
>>>>
>>>> Its my unfounded suspicion that something went in between rc4 and
>>>> rc5 which changed the timing, and didn't actually fix the issue. I
>>>> will run the test overnight for 5 hours to try to repro.
>>>>
>>>> The patch you sent appears to be based on linux-next, and appears to
>>>> have a number of dependencies which prevent it from cleanly applying
>>>> on anything current that I'm able to repro on at this time. Do you
>>>> want to provide a rebased version of the patch which applies to say
>>>> 4.11? I could easily test that and report back.
>>>
>>> Here is a very lightly tested backport to v4.11.
>>>
>>
>> Works for me. Always reproduced the lockup within 2 minutes on stock
>> 4.11. With the change applied, I was able to test for 2 hours in
>> the same conditions, and 4 hours with the full system and not
>> encounter an issue.
>>
>> Feel free to add:
>> Tested-by: Jeffrey Hugo <[email protected]>
>
> Applied, thank you!
>
>> I'm going to go back to 4.12-rc5 and see if I can get either repro
>> the issue, or identify what changed. Hopefully I can get to
>> linux-next and double check the original version of the change as
>> well.
>
> Looking forward to hearing what you find!
>
> Thanx, Paul
>
According to git bisect, the following is what "changed"
commit 9d0eb4624601ac978b9e89be4aeadbd51ab2c830
Merge: 5faab9e 9bc1f09
Author: Linus Torvalds <[email protected]>
Date: Sun Jun 11 11:07:25 2017 -0700
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"Bug fixes (ARM, s390, x86)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: async_pf: avoid async pf injection when in guest mode
KVM: cpuid: Fix read/write out-of-bounds vulnerability in cpuid
emulation
arm: KVM: Allow unaligned accesses at HYP
arm64: KVM: Allow unaligned accesses at EL2
arm64: KVM: Preserve RES1 bits in SCTLR_EL2
KVM: arm/arm64: Handle possible NULL stage2 pud when ageing pages
KVM: nVMX: Fix exception injection
kvm: async_pf: fix rcu_irq_enter() with irqs enabled
KVM: arm/arm64: vgic-v3: Fix nr_pre_bits bitfield extraction
KVM: s390: fix ais handling vs cpu model
KVM: arm/arm64: Fix isues with GICv2 on GICv3 migration
Nothing really stands out to me which would "fix" the issue.
--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.
On Sun, Aug 20, 2017 at 01:31:01PM -0600, Jeffrey Hugo wrote:
> On 6/29/2017 6:18 PM, Paul E. McKenney wrote:
> >On Thu, Jun 29, 2017 at 10:29:12AM -0600, Jeffrey Hugo wrote:
> >>On 6/27/2017 6:11 PM, Paul E. McKenney wrote:
> >>>On Tue, Jun 27, 2017 at 04:32:09PM -0600, Jeffrey Hugo wrote:
> >>>>On 6/22/2017 9:34 PM, Paul E. McKenney wrote:
> >>>>>On Wed, Jun 21, 2017 at 09:18:53AM -0700, Paul E. McKenney wrote:
> >>>>>>No worries, and I am very much looking forward to seeing the results of
> >>>>>>your testing.
> >>>>>
> >>>>>And please see below for an updated patch based on LKML review and
> >>>>>more intensive testing.
> >>>>>
> >>>>
> >>>>I spent some time on this today. It didn't go as I expected. I
> >>>>validated the issue is reproducible as before on 4.11 and 4.12 rcs 1
> >>>>through 4. However, the version of stress-ng that I was using ran
> >>>>into constant errors starting with rc5, making it nearly impossible
> >>>>to make progress toward reproduction. Upgrading stress-ng to tip
> >>>>fixes the issue, however, I've still been unable to repro the issue.
> >>>>
> >>>>Its my unfounded suspicion that something went in between rc4 and
> >>>>rc5 which changed the timing, and didn't actually fix the issue. I
> >>>>will run the test overnight for 5 hours to try to repro.
> >>>>
> >>>>The patch you sent appears to be based on linux-next, and appears to
> >>>>have a number of dependencies which prevent it from cleanly applying
> >>>>on anything current that I'm able to repro on at this time. Do you
> >>>>want to provide a rebased version of the patch which applies to say
> >>>>4.11? I could easily test that and report back.
> >>>
> >>>Here is a very lightly tested backport to v4.11.
> >>>
> >>
> >>Works for me. Always reproduced the lockup within 2 minutes on stock
> >>4.11. With the change applied, I was able to test for 2 hours in
> >>the same conditions, and 4 hours with the full system and not
> >>encounter an issue.
> >>
> >>Feel free to add:
> >>Tested-by: Jeffrey Hugo <[email protected]>
> >
> >Applied, thank you!
> >
> >>I'm going to go back to 4.12-rc5 and see if I can get either repro
> >>the issue, or identify what changed. Hopefully I can get to
> >>linux-next and double check the original version of the change as
> >>well.
> >
> >Looking forward to hearing what you find!
> >
> > Thanx, Paul
> >
>
> According to git bisect, the following is what "changed"
>
> commit 9d0eb4624601ac978b9e89be4aeadbd51ab2c830
> Merge: 5faab9e 9bc1f09
> Author: Linus Torvalds <[email protected]>
> Date: Sun Jun 11 11:07:25 2017 -0700
>
> Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
>
> Pull KVM fixes from Paolo Bonzini:
> "Bug fixes (ARM, s390, x86)"
>
> * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
> KVM: async_pf: avoid async pf injection when in guest mode
> KVM: cpuid: Fix read/write out-of-bounds vulnerability in
> cpuid emulation
> arm: KVM: Allow unaligned accesses at HYP
> arm64: KVM: Allow unaligned accesses at EL2
> arm64: KVM: Preserve RES1 bits in SCTLR_EL2
> KVM: arm/arm64: Handle possible NULL stage2 pud when ageing pages
> KVM: nVMX: Fix exception injection
> kvm: async_pf: fix rcu_irq_enter() with irqs enabled
> KVM: arm/arm64: vgic-v3: Fix nr_pre_bits bitfield extraction
> KVM: s390: fix ais handling vs cpu model
> KVM: arm/arm64: Fix isues with GICv2 on GICv3 migration
>
> Nothing really stands out to me which would "fix" the issue.
My guess would be an undo of the change that provoked the problem
in the first place. Did you try bisecting within the above group
of commits?
Either way, CCing Paolo for his thoughts?
Thanx, Paul
On 20/08/2017 22:56, Paul E. McKenney wrote:
>> KVM: async_pf: avoid async pf injection when in guest mode
>> KVM: cpuid: Fix read/write out-of-bounds vulnerability in cpuid emulation
>> arm: KVM: Allow unaligned accesses at HYP
>> arm64: KVM: Allow unaligned accesses at EL2
>> arm64: KVM: Preserve RES1 bits in SCTLR_EL2
>> KVM: arm/arm64: Handle possible NULL stage2 pud when ageing pages
>> KVM: nVMX: Fix exception injection
>> kvm: async_pf: fix rcu_irq_enter() with irqs enabled
>> KVM: arm/arm64: vgic-v3: Fix nr_pre_bits bitfield extraction
>> KVM: s390: fix ais handling vs cpu model
>> KVM: arm/arm64: Fix isues with GICv2 on GICv3 migration
>>
>> Nothing really stands out to me which would "fix" the issue.
>
> My guess would be an undo of the change that provoked the problem
> in the first place. Did you try bisecting within the above group
> of commits?
>
> Either way, CCing Paolo for his thoughts?
There is "kvm: async_pf: fix rcu_irq_enter() with irqs enabled", but it
would have caused splats, not deadlocks.
If you are using nested virtualization, "KVM: async_pf: avoid async pf
injection when in guest mode" can be a wildcard, but only if you have
memory pressure.
My bet is still on the former changing the timing just a little bit.
Paolo
On 8/22/2017 10:12 AM, Paolo Bonzini wrote:
> On 20/08/2017 22:56, Paul E. McKenney wrote:
>>> KVM: async_pf: avoid async pf injection when in guest mode
>>> KVM: cpuid: Fix read/write out-of-bounds vulnerability in cpuid emulation
>>> arm: KVM: Allow unaligned accesses at HYP
>>> arm64: KVM: Allow unaligned accesses at EL2
>>> arm64: KVM: Preserve RES1 bits in SCTLR_EL2
>>> KVM: arm/arm64: Handle possible NULL stage2 pud when ageing pages
>>> KVM: nVMX: Fix exception injection
>>> kvm: async_pf: fix rcu_irq_enter() with irqs enabled
>>> KVM: arm/arm64: vgic-v3: Fix nr_pre_bits bitfield extraction
>>> KVM: s390: fix ais handling vs cpu model
>>> KVM: arm/arm64: Fix isues with GICv2 on GICv3 migration
>>>
>>> Nothing really stands out to me which would "fix" the issue.
>>
>> My guess would be an undo of the change that provoked the problem
>> in the first place. Did you try bisecting within the above group
>> of commits?
>>
>> Either way, CCing Paolo for his thoughts?
>
> There is "kvm: async_pf: fix rcu_irq_enter() with irqs enabled", but it
> would have caused splats, not deadlocks.
>
> If you are using nested virtualization, "KVM: async_pf: avoid async pf
> injection when in guest mode" can be a wildcard, but only if you have
> memory pressure.
>
> My bet is still on the former changing the timing just a little bit.
>
> Paolo
>
I'm sorry, I must have done the bisect incorrectly.
I attempted to bisect the KVM changes from the merge, but was seeing
that the issue didn't repro with any of them. I double checked the
merge commit, and found it did not introduce a "fix".
I redid the bisect, and it identified the following change this time. I
double checked that reverting the change reintroduces the deadlock, and
cherry-picking the change onto 4.12-rc4 (known to exhibit the issue)
causes the issue to disappear. I'm pretty sure (knock on wood) that the
bisect result is actually correct this time.
commit 6460495709aeb651896bc8e5c134b2e4ca7d34a8
Author: James Wang <[email protected]>
Date: Thu Jun 8 14:52:51 2017 +0800
Fix loop device flush before configure v3
While installing SLES-12 (based on v4.4), I found that the installer
will stall for 60+ seconds during LVM disk scan. The root cause was
determined to be the removal of a bound device check in loop_flush()
by commit b5dd2f6047ca ("block: loop: improve performance via blk-mq").
Restoring this check, examining ->lo_state as set by loop_set_fd()
eliminates the bad behavior.
Test method:
modprobe loop max_loop=64
dd if=/dev/zero of=disk bs=512 count=200K
for((i=0;i<4;i++))do losetup -f disk; done
mkfs.ext4 -F /dev/loop0
for((i=0;i<4;i++))do mkdir t$i; mount /dev/loop$i t$i;done
for f in `ls /dev/loop[0-9]*|sort`; do \
echo $f; dd if=$f of=/dev/null bs=512 count=1; \
done
Test output: stock patched
/dev/loop0 18.1217e-05 8.3842e-05
/dev/loop1 6.1114e-05 0.000147979
/dev/loop10 0.414701 0.000116564
/dev/loop11 0.7474 6.7942e-05
/dev/loop12 0.747986 8.9082e-05
/dev/loop13 0.746532 7.4799e-05
/dev/loop14 0.480041 9.3926e-05
/dev/loop15 1.26453 7.2522e-05
Note that from loop10 onward, the device is not mounted, yet the
stock kernel consumes several orders of magnitude more wall time
than it does for a mounted device.
(Thanks for Mike Galbraith <[email protected]>, give a changelog review.)
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: James Wang <[email protected]>
Fixes: b5dd2f6047ca ("block: loop: improve performance via blk-mq")
Signed-off-by: Jens Axboe <[email protected]>
Considering the original analysis of the issue, it seems plausible that
this change could be fixing it.
--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.