Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755608Ab1D2Aba (ORCPT ); Thu, 28 Apr 2011 20:31:30 -0400 Received: from e7.ny.us.ibm.com ([32.97.182.137]:52005 "EHLO e7.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752761Ab1D2Ab3 (ORCPT ); Thu, 28 Apr 2011 20:31:29 -0400 Date: Thu, 28 Apr 2011 17:31:10 -0700 From: "Paul E. McKenney" To: Jim Schutt Cc: linux-kernel@vger.kernel.org Subject: Re: mcelog stalls on 2.6.39-rc5 Message-ID: <20110429003110.GC2191@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <4DB87585.3010607@sandia.gov> <20110427230333.GD2135@linux.vnet.ibm.com> <4DB97459.3040301@sandia.gov> <4DB990A1.8090205@sandia.gov> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4DB990A1.8090205@sandia.gov> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4278 Lines: 94 On Thu, Apr 28, 2011 at 10:06:57AM -0600, Jim Schutt wrote: > Jim Schutt wrote: > >Paul E. McKenney wrote: > >>On Wed, Apr 27, 2011 at 01:59:01PM -0600, Jim Schutt wrote: > >>>Hi, > >>> > >>>Testing 2.6.39-rc5 is giving me the following stall: > >>> > >>> [ 5767.731001] INFO: rcu_sched_state detected stall on CPU > >>>1 (t=60001 jiffies) > >>> [ 5767.732001] INFO: rcu_sched_state detected stalls on > >>>CPUs/tasks: { 1} (detected by 0, t=60002 jiffies) > >>> [ 5947.763001] INFO: rcu_sched_state detected stall on CPU > >>>1 (t=240032 jiffies) > >>> [ 5947.764001] INFO: rcu_sched_state detected stalls on > >>>CPUs/tasks: { 1} (detected by 0, t=240034 jiffies) > >>> [ 6024.489362] libceph: mon0 172.17.40.34:6789 socket closed > >>> [ 6121.281139] INFO: task mcelog:6513 blocked for more > >>>than 120 seconds. > >>> [ 6121.287575] "echo 0 > > >>>/proc/sys/kernel/hung_task_timeout_secs" disables this > >>>message. > >>> [ 6121.295397] ffff880177aefce8 0000000000000082 > >>>ffffffff810339b6 ffff880226d35a40 > >>> [ 6121.302840] ffff88018226c3b0 ffff88018226c3b0 > >>>0000000000011e80 ffff880226d35a40 > >>> [ 6121.310284] ffff88018226c760 ffff880177aefe80 > >>>ffff880177aefd18 ffffffff813af047 > >>> [ 6121.317725] Call Trace: > >>> [ 6121.320176] [] ? > >>>calc_load_account_idle+0xe/0x1d > >>> [ 6121.326437] [] schedule+0x159/0x193 > >>> [ 6121.331569] [] schedule_timeout+0x36/0xe2 > >>> [ 6121.337223] [] ? trace_hardirqs_on+0x9/0x20 > >>> [ 6121.343047] [] do_wait_for_common+0x97/0xe3 > >>> [ 6121.348967] [] ? try_to_wake_up+0x200/0x200 > >>> [ 6121.354794] [] ? __raw_spin_lock_irq+0x17/0x2f > >>> [ 6121.360878] [] wait_for_common+0x36/0x4d > >>> [ 6121.366441] [] wait_for_completion+0x1d/0x1f > >>> [ 6121.372356] [] synchronize_sched+0x40/0x49 > >>> [ 6121.378096] [] ? find_get_pid+0x1b/0x1b > >>> [ 6121.383574] [] mce_read+0x17f/0x25d > >>> [ 6121.388707] [] ? rw_verify_area+0xac/0xdb > >>> [ 6121.394358] [] vfs_read+0xa9/0xe1 > >>> [ 6121.399317] [] sys_read+0x4c/0x70 > >>> [ 6121.404278] [] system_call_fastpath+0x16/0x1b > >>> [ 6127.795001] INFO: rcu_sched_state detected stall on CPU > >>>1 (t=420064 jiffies) > >>> [ 6127.796001] INFO: rcu_sched_state detected stalls on > >>>CPUs/tasks: { 1} (detected by 0, t=420066 jiffies) > >>> [ 6241.410171] INFO: task mcelog:6513 blocked for more > >>>than 120 seconds. > >>> > >>>Reverting commit a4dd99250dc makes the stalls go away: > >>> > >>> rcu: create new rcu_access_index() and use in mce > >>> > >>> The MCE subsystem needs to sample an RCU-protected index outside of > >>> any protection for that index. If this was a pointer, we would use > >>> rcu_access_pointer(), but there is no corresponding > >>>rcu_access_index(). > >>> This commit therefore creates an rcu_access_index() and applies it > >>> to MCE. > >>> > >>> Signed-off-by: Paul E. McKenney > >>> Tested-by: Zdenek Kabelac > >> > >>Wow! This is just instructions, just wrapping the access in > >>ACCESS_ONCE(). > >> > >>Was the original RCU CPU stall repeatable? > > > >Yes. I saw it on two different machines. > >Both were running as Ceph clients, doing sustained > >streaming writes, if that helps any. > > > >I can attempt to repeat with any extra debugging > >you'd like me to try. > > Evidently I cannot reproduce this after all, so my revert > test was incorrect. > > If I can learn how to reproduce this I'll post with details; > in the meantime ignore this, and sorry for the noise. OK -- I must confess that I could not understand how this change could cause the problem! ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/