Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752261AbdHCDzt (ORCPT ); Wed, 2 Aug 2017 23:55:49 -0400 Received: from mail-pf0-f178.google.com ([209.85.192.178]:34163 "EHLO mail-pf0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752237AbdHCDzp (ORCPT ); Wed, 2 Aug 2017 23:55:45 -0400 Subject: Re: rcu_sched stall while waiting in csd_lock_wait() To: Will Deacon Cc: linux-arm-kernel , open list , Peter Zijlstra , "Paul E. McKenney" , marc.zyngier@arm.com, mark.rutland@arm.com References: <20170802080827.GA15219@arm.com> From: Pratyush Anand Message-ID: Date: Thu, 3 Aug 2017 09:25:40 +0530 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20170802080827.GA15219@arm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8238 Lines: 155 Hi Will, On Wednesday 02 August 2017 01:38 PM, Will Deacon wrote: > Hi Pratyush, > > On Wed, Aug 02, 2017 at 09:01:19AM +0530, Pratyush Anand wrote: >> I am observing following rcu_sched stall while executing `perf record -a -- >> sleep 1` with one of the arm64 platform. It looks like that stalled cpu was >> waiting in csd_lock_wait() from where it never came out,and so the stall. >> Any help/pointer for further debugging would be very helpful. Problem also >> reproduced with 4.13.0-rc3. > When you say "also", which other kernel(s) show the problem? Is this a > recent regression? Which platform are you running on? Other than 4.13.0-rc3 I had tested with 4.11 based kernel which has 4.11 vanila+some 4.12 patches back ported, and log which I had attached was from same kernel, thats why I mentioned that it reproduced with vanilla upstream as well. > > It would be interesting to know what the other CPUs are doing, in particular > the target of the cross-call. Either it crashed spectacularly and didn't > unlock the csd lock, or the IPI somehow wasn't delivered. > > Do you see any other splats if you enable lock debugging? It was same. Following is the log from 4.13.0-rc3 + patch pointed by Peter: [ 173.649589] perf: interrupt took too long (4952 > 4902), lowering kernel.perf_event_max_sample_rate to 40300 [ 201.340926] INFO: rcu_sched self-detected stall on CPU [ 201.345115] 9-...: (6499 ticks this GP) idle=e1a/140000000000001/0 softirq=334/334 fqs=3250 [ 201.353617] (t=6500 jiffies g=313 c=312 q=428) [ 201.358220] Task dump for CPU 9: [ 201.361431] perf R running task 0 1888 1864 0x00000202 [ 201.368462] Call trace: [ 201.370897] [] dump_backtrace+0x0/0x28c [ 201.376276] [] show_stack+0x24/0x2c [ 201.381312] [] sched_show_task+0x19c/0x26c [ 201.386952] [] dump_cpu_task+0x48/0x54 [ 201.392250] [] rcu_dump_cpu_stacks+0xac/0xf4 [ 201.398063] [] rcu_check_callbacks+0x908/0xc90 [ 201.404053] [] update_process_times+0x34/0x5c [ 201.409957] [] tick_sched_handle.isra.16+0x4c/0x70 [ 201.416292] [] tick_sched_timer+0x48/0x88 [ 201.421847] [] __hrtimer_run_queues+0x17c/0x604 [ 201.427924] [] hrtimer_interrupt+0xa4/0x1e8 [ 201.433656] [] arch_timer_handler_phys+0x3c/0x48 [ 201.439818] [] handle_percpu_devid_irq+0xdc/0x42c [ 201.446069] [] generic_handle_irq+0x34/0x4c [ 201.451796] [] __handle_domain_irq+0x6c/0xc4 [ 201.457611] [] gic_handle_irq+0xa0/0x1b0 [ 201.463080] Exception stack(0xffff8016df013a40 to 0xffff8016df013b70) [ 201.469504] 3a40: 0000000000000000 0000000000000003 0000000000000000 ffff8016df013bd0 [ 201.477316] 3a60: ffff8016df013bd0 0000000000080000 ffff8016df013bb8 ffff0000082113c8 [ 201.485129] 3a80: 0000000000000000 0000ffffe507a9f0 00000000ffffffff 0000000000000000 [ 201.492941] 3aa0: 0000000000000005 ffffffffffffffff 002f547d23157399 00003a2a9f82ac9c [ 201.500754] 3ac0: 0000000000000000 0000000000000000 0000ffffe507a7e0 ffff000008f5b000 [ 201.508566] 3ae0: ffff8016df013c08 ffff000008213fcc 0000000000000013 ffff8017616a7800 [ 201.516379] 3b00: ffff000008f5b000 ffff0000082179d4 0000000000000000 ffff0000088c1000 [ 201.524191] 3b20: ffff8017616a7800 ffff8016df013b70 ffff00000818bf04 ffff8016df013b70 [ 201.532004] 3b40: ffff00000818bf28 0000000020400145 ffff000008213fcc 0000000000000013 [ 201.539816] 3b60: 0001000000000000 ffff8016df013bb8 [ 201.544677] [] el1_irq+0xb8/0x140 [ 201.549539] [] smp_call_function_single+0x160/0x184 [ 201.555965] [] cpu_function_call+0x40/0x64 [ 201.561605] [] event_function_call+0x120/0x128 [ 201.567594] [] _perf_event_disable+0x44/0x64 [ 201.573410] [] perf_event_for_each_child+0x3c/0x84 [ 201.579747] [] perf_ioctl+0x21c/0x9a4 [ 201.584957] [] do_vfs_ioctl+0xcc/0x874 [ 201.590250] [] sys_ioctl+0x90/0xa4 [ 201.595198] [] __sys_trace_return+0x0/0x4 [ 239.003035] INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 9-... } 6592 jiffies s: 1149 root: 0x1/. [ 239.012199] blocking rcu_node structures: l=1:0-14:0x200/. [ 239.017695] Task dump for CPU 9: [ 239.020880] perf R running task 0 1888 1864 0x00000202 [ 239.027929] Call trace: [ 239.030346] [] __switch_to+0x64/0x70 [ 239.035484] [] free_pcppages_bulk+0x43c/0x640 [ 262.304244] perf: interrupt took too long (6221 > 6190), lowering kernel.perf_event_max_sample_rate to 32100 [ 367.009704] INFO: task kworker/15:2:1187 blocked for more than 120 seconds. [ 367.015713] Tainted: G W 4.13.0-rc3+ #2 [ 367.021200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 367.028994] kworker/15:2 D 0 1187 2 0x00000000 [ 367.034481] Workqueue: usb_hub_wq hub_event [ 367.038629] Call trace: [ 367.041077] [] __switch_to+0x64/0x70 [ 367.046184] [] __schedule+0x410/0xcbc [ 367.051403] [] schedule+0x40/0xa4 [ 367.056253] [] _synchronize_rcu_expedited.constprop.64+0x370/0x3e8 [ 367.063990] [] synchronize_sched_expedited+0x7c/0xf0 [ 367.070501] [] synchronize_rcu_expedited+0x14/0x1c [ 367.076825] [] bdi_unregister+0x78/0x200 [ 367.082307] [] del_gendisk+0x130/0x240 [ 367.087592] [] sr_remove+0x30/0x90 [sr_mod] [ 367.093333] [] device_release_driver_internal+0x15c/0x1dc [ 367.100275] [] device_release_driver+0x28/0x34 [ 367.106251] [] bus_remove_device+0xe4/0x14c [ 367.111993] [] device_del+0x1dc/0x308 [ 367.117191] [] __scsi_remove_device+0xfc/0x128 [ 367.123192] [] scsi_forget_host+0x74/0x78 [ 367.128778] [] scsi_remove_host+0x7c/0x120 [ 367.134396] [] usb_stor_disconnect+0x58/0xc8 [usb_storage] [ 367.141421] [] usb_unbind_interface+0x7c/0x268 [ 367.147397] [] device_release_driver_internal+0x15c/0x1dc [ 367.154355] [] device_release_driver+0x28/0x34 [ 367.160344] [] bus_remove_device+0xe4/0x14c [ 367.166060] [] device_del+0x1dc/0x308 [ 367.171282] [] usb_disable_device+0xc4/0x2ac [ 367.177084] [] usb_disconnect+0x98/0x248 [ 367.182566] [] hub_quiesce+0x64/0xac [ 367.187674] [] hub_event+0x110/0xa8c [ 367.192810] [] process_one_work+0x228/0x6a8 [ 367.198568] [] worker_thread+0x60/0x3bc [ 367.203922] [] kthread+0x114/0x140 [ 367.208898] [] ret_from_fork+0x10/0x40 [ 367.214171] INFO: task kworker/15:3:1890 blocked for more than 120 seconds. [ 367.221108] Tainted: G W 4.13.0-rc3+ #2 [ 367.226563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 367.234388] kworker/15:3 D 0 1890 2 0x00000200 [ 367.239918] Workqueue: events wait_rcu_exp_gp [ 367.244184] Call trace: [ 367.246615] [] __switch_to+0x64/0x70 [ 367.251751] [] __schedule+0x410/0xcbc [ 367.256945] [] schedule+0x40/0xa4 [ 367.261820] [] schedule_timeout+0x1e8/0x454 [ 367.267536] [] rcu_exp_wait_wake+0xb8/0x5b0 [ 367.273278] [] wait_rcu_exp_gp+0x3c/0x4c [ 367.278776] [] process_one_work+0x228/0x6a8 [ 367.284475] [] worker_thread+0x60/0x3bc [ 367.289931] [] kthread+0x114/0x140 [ 367.294793] [] ret_from_fork+0x10/0x40 [ 367.300100] INFO: lockdep is turned off. -- Regards Pratyush