Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752858AbbF2SLt (ORCPT ); Mon, 29 Jun 2015 14:11:49 -0400 Received: from e31.co.us.ibm.com ([32.97.110.149]:57476 "EHLO e31.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751850AbbF2SLk (ORCPT ); Mon, 29 Jun 2015 14:11:40 -0400 X-Helo: d03dlp03.boulder.ibm.com X-MailFrom: paulmck@linux.vnet.ibm.com X-RcptTo: linux-kernel@vger.kernel.org Date: Mon, 29 Jun 2015 11:11:34 -0700 From: "Paul E. McKenney" To: Mark Rutland Cc: "peterz@infradead.org" , linux-kernel@vger.kernel.org, "umgwanakikbuti@gmail.com" , torvalds@linux-foundation.org, "mingo@kernel.org" , Will Deacon Subject: Re: PRREMPT RCU stalls with 9edfbfed3f544a78 ("sched/core: Rework rq->clock update skips") Message-ID: <20150629181134.GN3717@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20150629173751.GB19863@leverpostej> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150629173751.GB19863@leverpostej> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15062918-8236-0000-0000-00000CC7C36C Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6948 Lines: 134 On Mon, Jun 29, 2015 at 06:37:52PM +0100, Mark Rutland wrote: > Hi, > > Recently (since ~v4.0) I've been seeing new RCU stall warnings when > running hackbench (and cpu-affine cyclictest instances) on > CONFIG_PREEMPT=y kernels, both arm64 and x86_64: > > arm64 > ----- > INFO: rcu_preempt detected stalls on CPUs/tasks: > Tasks blocked on level-0 rcu_node (CPUs 0-5): So you appear to be blocked on a task that was preempted within its RCU read-side critical section. There is a bug that causes the task not to be printed, which is fixed by 82efed06d5e3 ("rcu: Fix missing task information during rcu-preempt stall"). Not that this will fix the stall itself, but it will at least allow you to see the task that is causing the stall. > (detected by 1, t=2102 jiffies, g=957, c=956, q=23) > All QSes seen, last rcu_preempt kthread activity 2 (4294981808-4294981806), jiffies_till_next_fqs=1, root ->qsmask 0x0 > hackbench R running task 0 10388 9559 0x00000000 > Call trace: > [] dump_backtrace+0x0/0x124 > [] show_stack+0x10/0x1c > [] sched_show_task+0xa4/0x104 > [] rcu_check_callbacks+0x998/0x9a0 > [] update_process_times+0x38/0x6c > [] tick_sched_handle.isra.16+0x1c/0x68 > [] tick_sched_timer+0x40/0x88 > [] __run_hrtimer.isra.34+0x48/0x108 > [] hrtimer_interrupt+0xc4/0x248 > [] arch_timer_handler_phys+0x28/0x38 > [] handle_percpu_devid_irq+0x74/0x9c > [] generic_handle_irq+0x30/0x4c > [] __handle_domain_irq+0x5c/0xac > [] gic_handle_irq+0x30/0x80 > Exception stack(0xffffffc073bc3a80 to 0xffffffc073bc3ba0) > 3a80: 00000064 00000000 00000064 00000000 73bc3bc0 ffffffc0 00331ffc ffffffc0 > 3aa0: 00000064 00000000 00000064 00000000 ffffffff ffffffff 160c3d20 00000000 > 3ac0: e9d018ac 0000007f 00000064 00000000 5a046200 ffffffc9 000000c0 00000000 > 3ae0: 00000000 00000000 00000000 00000000 ffffffff 00000000 00000018 00000000 > 3b00: 00000002 00000000 00000000 00000000 948d9a5c 0000007f 94a1e590 0000007f > 3b20: 0019b7e0 ffffffc0 94a316d0 0000007f e9d01530 0000007f 00000064 00000000 > 3b40: 00000064 00000000 00000064 00000000 00000000 00000000 00000064 00000000 > 3b60: 73bc3d78 ffffffc0 757be080 ffffffc9 6241b980 ffffffc9 00000001 00000000 > 3b80: 55da5500 ffffffc9 73bc3bc0 ffffffc0 00331f9c ffffffc0 73bc3bc0 ffffffc0 > [] el1_irq+0x64/0xd8 > [] skb_copy_datagram_from_iter+0x5c/0x1f0 > [] unix_stream_sendmsg+0xf0/0x334 > [] sock_sendmsg+0x14/0x58 > [] sock_write_iter+0x74/0xd4 > [] __vfs_write+0xac/0x10c > [] vfs_write+0x8c/0x194 > [] SyS_write+0x40/0xa0 > > x86_64 > ------ > [ 2616.307011] INFO: rcu_preempt detected stalls on CPUs/tasks: > [ 2616.311868] Tasks blocked on level-0 rcu_node (CPUs 0-3): Same here, same diagnostic fix. > [ 2616.311868] (detected by 1, t=21014 jiffies, g=10009, c=10008, q=274) > [ 2616.322320] All QSes seen, last rcu_preempt kthread activity 3 (4297283620-4297283617), jiffies_till_next_fqs=3, root ->qsmask 0x0 > [ 2616.322320] hackbench R running task 13984 30174 29054 0x00000008 > [ 2616.322320] 00000000000036a0 ffff8801bec83c98 ffffffff8107b751 0000000000000003 > [ 2616.322320] ffffffff81e3b8c0 ffffffff81f01280 ffffffff81e3b8c0 ffff8801bec83d28 > [ 2616.352044] ffffffff810a90f7 0000000000000000 ffffffff8107faac 000000000000f6a0 > [ 2616.352044] Call Trace: > [ 2616.352044] [] sched_show_task+0xe1/0x150 > [ 2616.352044] [] rcu_check_callbacks+0x8f7/0x900 > [ 2616.352044] [] ? account_system_time+0x9c/0x190 > [ 2616.352044] [] update_process_times+0x34/0x60 > [ 2616.352044] [] tick_sched_handle.isra.15+0x31/0x40 > [ 2616.352044] [] tick_sched_timer+0x3f/0x70 > [ 2616.352044] [] __run_hrtimer+0x7f/0x230 > [ 2616.352044] [] ? tick_sched_handle.isra.15+0x40/0x40 > [ 2616.416043] [] hrtimer_interrupt+0xf3/0x220 > [ 2616.416043] [] hpet_interrupt_handler+0x11/0x30 > [ 2616.416043] [] handle_irq_event_percpu+0x47/0x1d0 > [ 2616.428774] [] handle_irq_event+0x3c/0x60 > [ 2616.428774] [] handle_edge_irq+0x8f/0x130 > [ 2616.428774] [] handle_irq+0xf7/0x180 > [ 2616.450371] [] ? get_parent_ip+0x11/0x50 > [ 2616.450371] [] do_IRQ+0x58/0x100 > [ 2616.450371] [] common_interrupt+0x6b/0x6b > [ 2616.467277] [] ? sock_def_readable+0x67/0x70 > [ 2616.467277] [] ? avc_has_perm+0x105/0x1b0 > [ 2616.467277] [] ? avc_has_perm+0xd7/0x1b0 > [ 2616.482406] [] inode_has_perm.isra.29+0x22/0x30 > [ 2616.482406] [] file_has_perm+0x87/0xa0 > [ 2616.493232] [] ? fsnotify+0x2d6/0x400 > [ 2616.493232] [] selinux_file_permission+0xa5/0x120 > [ 2616.493232] [] security_file_permission+0x1e/0xa0 > [ 2616.493232] [] rw_verify_area+0x51/0xd0 > [ 2616.493232] [] vfs_write+0x6f/0x1b0 > [ 2616.493232] [] SyS_write+0x41/0xb0 > [ 2616.493232] [] ? SyS_poll+0x60/0xf0 > [ 2616.493232] [] system_call_fastpath+0x12/0x6a > > I've bisected to 9edfbfed3f544a78 ("sched/core: Rework rq->clock update > skips"), and with that reverted I no longer see stalls. For v4.1 I also > had to revert 44fb085bfa17628c "sched/deadline: Add rq->clock update > skip for dl task yield" as it calls function introduced in > 9edfbfed3f544a78. > > Any ideas as to what could be going on? I don't see why those commits would prevent a preempted task from running, but then I don't claim to fully understand these changes. > To reproduce the issue I'm running the following in parallel on > dual-core machines: > > $ while true; do ./hackbench 100 process 1000; sleep 30; done > # ./cyclictest -a0 -n -M -m > # ./cyclictest -a1 -n -M -m > > Usually it takes a couple of minutes to trigger, but I've seen it take > up to half an hour. With the reverts I haven't seen stalls after testing > for an hour or so. I would want to see at least a 5-hour run before being all that sure, but an hour is at least promising. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/