Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756972AbbLDXTn (ORCPT ); Fri, 4 Dec 2015 18:19:43 -0500 Received: from e39.co.us.ibm.com ([32.97.110.160]:53293 "EHLO e39.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756949AbbLDXTk (ORCPT ); Fri, 4 Dec 2015 18:19:40 -0500 X-IBM-Helo: d03dlp01.boulder.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Fri, 4 Dec 2015 15:20:22 -0800 From: "Paul E. McKenney" To: tglx@linutronix.de, peterz@infradead.org, preeti@linux.vnet.ibm.com, viresh.kumar@linaro.org, mtosatti@redhat.com, fweisbec@gmail.com Cc: linux-kernel@vger.kernel.org, sasha.levin@oracle.com Subject: Possible issue with commit 4961b6e11825? Message-ID: <20151204232022.GA15891@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15120423-0033-0000-0000-0000071EA65B Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2972 Lines: 61 Hello! Are there any known issues with commit 4961b6e11825 (sched: core: Use hrtimer_start[_expires]())? The reason that I ask is that I am about 90% sure that an rcutorture failure bisects to that commit. I will be running more tests on 3497d206c4d9 (perf: core: Use hrtimer_start()), which is the predecessor of 4961b6e11825, and which, unlike 4961b6e11825, passes a 12-hour rcutorture test with scenario TREE03. In contrast, 4961b6e11825 gets 131 RCU CPU stall warnings, 132 reports of one of RCU's grace-period kthreads being starved, and 525 reports of one of rcutorture's kthreads being starved. Most of the test runs hang on shutdown, which is no surprise if an RCU CPU stall is happening at about that time. But perhaps 3497d206c4d9 was just getting lucky, hence additional testing over the weekend. Reproducing this takes some doing. A multisocket x86 box with significant background computation noise seems to be able to reproduce this with high probability in a twelve-hour test. I -can- make it happen on a single-socket four-core system (eight hardware threads, and with significant background computational noise), but I ran the test for several days before seeing the first error. In addition, the probability of hitting this is greatly reduced when running the tests on the multisocket x86 box without the background computational noise. (I recently taught some IBMers about ppcmem and herd, and gave them some problems to solve, which is where the background noise came from, in case you were wondering. An unexpected benefit from those tools!) The starving of RCU's grace-period kthreads is quite surprising, as diagnostics indicate that they are in a wait_event_interruptible_timeout() with a three-jiffy timeout. The starvation is not subtle: 21-second starvation periods are quite common, and 84-second starvation periods occur from time to time. In addition, rcutorture goes idle every few seconds in order to test ramp-up and ramp-down effects, which should rule out starvation due to heavy load. Besides, I never see any softlockup warnings, which should appear in the heavy-load-starvation case. The commit log for 4961b6e11825 is as follows: sched: core: Use hrtimer_start[_expires]() hrtimer_start() now enforces a timer interrupt when an already expired timer is enqueued. Get rid of the __hrtimer_start_range_ns() invocations and the loops around it. Is it possible that I need to adjust RCU or rcutorture code to account for these newly enforced timer interrupts? Or is there a known bug with this commit whose fix I need to apply when bisecting? (There were two other fixes that I needed to do this with, so I figured I should ask.) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/