Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932598AbZACQsm (ORCPT ); Sat, 3 Jan 2009 11:48:42 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1758904AbZACQsb (ORCPT ); Sat, 3 Jan 2009 11:48:31 -0500 Received: from relay2.sgi.com ([192.48.179.30]:47693 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1759021AbZACQsa (ORCPT ); Sat, 3 Jan 2009 11:48:30 -0500 Message-ID: <495F96DA.5010601@sgi.com> Date: Sat, 03 Jan 2009 08:48:26 -0800 From: Mike Travis User-Agent: Thunderbird 2.0.0.6 (X11/20070801) MIME-Version: 1.0 To: Ingo Molnar CC: Rusty Russell , Linus Torvalds , linux-kernel@vger.kernel.org Subject: Re: [PULL] cpumask tree References: <20090102203839.GA26850@elte.hu> <200901031751.00076.rusty@rustcorp.com.au> <20090103105208.GA19080@elte.hu> <495F7D27.8020104@sgi.com> <20090103150634.GA28693@elte.hu> <495F84D8.1070404@sgi.com> <20090103154733.GA8111@elte.hu> <20090103160017.GA8920@elte.hu> <495F8DCA.1060905@sgi.com> <20090103164255.GA20657@elte.hu> In-Reply-To: <20090103164255.GA20657@elte.hu> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8011 Lines: 154 Ingo Molnar wrote: > * Mike Travis wrote: > >> Ingo Molnar wrote: >>> * Ingo Molnar wrote: >>> >>>> i suspect it's: >>>> >>>> | commit 2d22bd5e74519854458ad372a89006e65f45e628 >>>> | Author: Mike Travis >>>> | Date: Wed Dec 31 18:08:46 2008 -0800 >>>> | >>>> | x86: cleanup remaining cpumask_t code in microcode_core.c >>>> >>>> as the microcode is loaded during CPU onlining. >>> yep, that's the bad one. Should i revert it or do you have a safe fix in >>> mind? >>> >>> Ingo >> Probably revert for now. There are a few more following patches that >> also use 'work_on_cpu' so a better (more global?) fix should be used. >> >> Any thought on using a recursive lock for cpu-hotplug-lock? (At least >> for get_online_cpus()?) > > but the problem has nothing to do with self-recursion. Take a look at the > lockdep warning i posted (also below) - the locks are simply taken in the > wrong order. > > your change adds this cpu_hotplug.lock usage: > > [ 43.652000] -> #1 (&cpu_hotplug.lock){--..}: > [ 43.652000] [] __lock_acquire+0xf10/0x1360 > [ 43.652000] [] lock_acquire+0x99/0xd0 > [ 43.652000] [] __mutex_lock_common+0xaa/0x450 > [ 43.652000] [] mutex_lock_nested+0x3f/0x50 > [ 43.652000] [] get_online_cpus+0x3a/0x50 > [ 43.652000] [] work_on_cpu+0x6c/0xc0 > [ 43.652000] [] mc_sysdev_add+0x92/0xa0 > [ 43.652000] [] sysdev_driver_register+0xb0/0x140 > [ 43.652000] [] microcode_init+0xb2/0x13b > [ 43.652000] [] do_one_initcall+0x41/0x180 > [ 43.652000] [] kernel_init+0x145/0x19d > [ 43.652000] [] child_rip+0xa/0x20 > [ 43.652000] [] 0xffffffffffffffff > > which nests the inside sysdev_drivers_lock - which is wrong > [sysdev_drivers_lock is a pretty lowlevel lock that generally nests inside > the CPU hotplug lock]. > > If you want to use work_on_cpu() it should be done on a higher level, so > that sysdev_drivers_lock is taken after the hotplug lock. > > Ingo Ok, thanks, I will look in that direction. Mike > > [ 43.376051] lockdep: fixing up alternatives. > [ 43.380007] SMP alternatives: switching to UP code > [ 43.616014] CPU0 attaching NULL sched-domain. > [ 43.620068] CPU1 attaching NULL sched-domain. > [ 43.644482] CPU0 attaching NULL sched-domain. > [ 43.648264] > [ 43.648265] ======================================================= > [ 43.652000] [ INFO: possible circular locking dependency detected ] > [ 43.652000] 2.6.28-05081-geeff031-dirty #37 > [ 43.652000] ------------------------------------------------------- > [ 43.652000] S99local/1238 is trying to acquire lock: > [ 43.652000] (sysdev_drivers_lock){--..}, at: [] sysdev_unregister+0x1d/0x80 > [ 43.652000] > [ 43.652000] but task is already holding lock: > [ 43.652000] (&cpu_hotplug.lock){--..}, at: [] cpu_hotplug_begin+0x27/0x60 > [ 43.652000] > [ 43.652000] which lock already depends on the new lock. > [ 43.652000] > [ 43.652000] > [ 43.652000] the existing dependency chain (in reverse order) is: > [ 43.652000] > [ 43.652000] -> #1 (&cpu_hotplug.lock){--..}: > [ 43.652000] [] __lock_acquire+0xf10/0x1360 > [ 43.652000] [] lock_acquire+0x99/0xd0 > [ 43.652000] [] __mutex_lock_common+0xaa/0x450 > [ 43.652000] [] mutex_lock_nested+0x3f/0x50 > [ 43.652000] [] get_online_cpus+0x3a/0x50 > [ 43.652000] [] work_on_cpu+0x6c/0xc0 > [ 43.652000] [] mc_sysdev_add+0x92/0xa0 > [ 43.652000] [] sysdev_driver_register+0xb0/0x140 > [ 43.652000] [] microcode_init+0xb2/0x13b > [ 43.652000] [] do_one_initcall+0x41/0x180 > [ 43.652000] [] kernel_init+0x145/0x19d > [ 43.652000] [] child_rip+0xa/0x20 > [ 43.652000] [] 0xffffffffffffffff > [ 43.652000] > [ 43.652000] -> #0 (sysdev_drivers_lock){--..}: > [ 43.652000] [] __lock_acquire+0xfec/0x1360 > [ 43.652000] [] lock_acquire+0x99/0xd0 > [ 43.652000] [] __mutex_lock_common+0xaa/0x450 > [ 43.652000] [] mutex_lock_nested+0x3f/0x50 > [ 43.652000] [] sysdev_unregister+0x1d/0x80 > [ 43.652000] [] mce_cpu_callback+0xce/0x101 > [ 43.652000] [] notifier_call_chain+0x65/0xa0 > [ 43.652000] [] raw_notifier_call_chain+0x16/0x20 > [ 43.652000] [] _cpu_down+0x240/0x350 > [ 43.652000] [] cpu_down+0x7b/0xa0 > [ 43.652000] [] store_online+0x48/0xa0 > [ 43.652000] [] sysdev_store+0x20/0x30 > [ 43.652000] [] sysfs_write_file+0xcf/0x140 > [ 43.652000] [] vfs_write+0xc7/0x150 > [ 43.652000] [] sys_write+0x55/0x90 > [ 43.652000] [] system_call_fastpath+0x16/0x1b > [ 43.652000] [] 0xffffffffffffffff > [ 43.652000] > [ 43.652000] other info that might help us debug this: > [ 43.652000] > [ 43.652000] 3 locks held by S99local/1238: > [ 43.652000] #0: (&buffer->mutex){--..}, at: [] sysfs_write_file+0x48/0x140 > [ 43.652000] #1: (cpu_add_remove_lock){--..}, at: [] cpu_down+0x2f/0xa0 > [ 43.652000] #2: (&cpu_hotplug.lock){--..}, at: [] cpu_hotplug_begin+0x27/0x60 > [ 43.652000] > [ 43.652000] stack backtrace: > [ 43.652000] Pid: 1238, comm: S99local Not tainted 2.6.28-05081-geeff031-dirty #37 > [ 43.652000] Call Trace: > [ 43.652000] [] print_circular_bug_tail+0xa4/0x100 > [ 43.652000] [] __lock_acquire+0xfec/0x1360 > [ 43.652000] [] lock_acquire+0x99/0xd0 > [ 43.652000] [] ? sysdev_unregister+0x1d/0x80 > [ 43.652000] [] __mutex_lock_common+0xaa/0x450 > [ 43.652000] [] ? sysdev_unregister+0x1d/0x80 > [ 43.652000] [] ? sysdev_unregister+0x1d/0x80 > [ 43.652000] [] mutex_lock_nested+0x3f/0x50 > [ 43.652000] [] sysdev_unregister+0x1d/0x80 > [ 43.652000] [] mce_cpu_callback+0xce/0x101 > [ 43.652000] [] notifier_call_chain+0x65/0xa0 > [ 43.652000] [] raw_notifier_call_chain+0x16/0x20 > [ 43.652000] [] _cpu_down+0x240/0x350 > [ 43.652000] [] ? wait_for_common+0xe3/0x1b0 > [ 43.652000] [] cpu_down+0x7b/0xa0 > [ 43.652000] [] store_online+0x48/0xa0 > [ 43.652000] [] sysdev_store+0x20/0x30 > [ 43.652000] [] sysfs_write_file+0xcf/0x140 > [ 43.652000] [] vfs_write+0xc7/0x150 > [ 43.652000] [] sys_write+0x55/0x90 > [ 43.652000] [] system_call_fastpath+0x16/0x1b > [ 43.652104] device: 'msr1': device_unregister > [ 43.656005] PM: Removing info for No Bus:msr1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/