Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753299AbbEAGDw (ORCPT ); Fri, 1 May 2015 02:03:52 -0400 Received: from mail-ie0-f175.google.com ([209.85.223.175]:35749 "EHLO mail-ie0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750742AbbEAGDt (ORCPT ); Fri, 1 May 2015 02:03:49 -0400 MIME-Version: 1.0 In-Reply-To: <20150430162945.GH3488@pd.tnic> References: <1430319893-23399-1-git-send-email-ethan.zhao@oracle.com> <20150430162945.GH3488@pd.tnic> Date: Fri, 1 May 2015 14:03:48 +0800 Message-ID: Subject: Re: [PATCH] x86/mce: fix mce_restart() race with CPU hotplug operation From: Ethan Zhao To: Borislav Petkov Cc: Ethan Zhao , tony.luck@intel.com, Thomas Gleixner , mingo@redhat.com, hpa@zytor.com, x86@kernel.org, linux-edac@vger.kernel.org, LKML , tim.uglow@oracle.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11228 Lines: 229 On Fri, May 1, 2015 at 12:29 AM, Borislav Petkov wrote: > On Thu, Apr 30, 2015 at 12:04:53AM +0900, Ethan Zhao wrote: >> while testing CPU hotplug and MCE with following two scripts, >> >> script 1: >> >> for i in {1..30}; do while :; do ((a=$RANDOM%160)); echo 0 >> >> /sys/devices/system/cpu/cpu${i}/online; echo 1 >> >> /sys/devices/system/cpu/cpu${i}/online; done & done >> >> script 2: >> >> while :; do for i in $(ls >> /sys/devices/system/machinecheck/machinecheck*/check_interval); do echo 1 >> >> $i; done; done > > For the record, it is a public secret that CPU hotplug is broken. IOW, > you're wasting your time with those senseless pounder tests but ok. :<, Someone else is stressing the CPU hotplug, seems it is fragile. My job is holding the system, not panic to the ground. Thanks, Ethan > > ... > >> --- >> arch/x86/kernel/cpu/mcheck/mce.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c >> index 3c036cb..fcc2794 100644 >> --- a/arch/x86/kernel/cpu/mcheck/mce.c >> +++ b/arch/x86/kernel/cpu/mcheck/mce.c >> @@ -1338,8 +1338,10 @@ static void mce_timer_delete_all(void) >> { >> int cpu; >> >> + get_online_cpus(); >> for_each_online_cpu(cpu) >> del_timer_sync(&per_cpu(mce_timer, cpu)); >> + put_online_cpus(); >> } >> >> static void mce_do_trigger(struct work_struct *work) >> @@ -2085,7 +2087,9 @@ static void mce_cpu_restart(void *data) >> static void mce_restart(void) >> { >> mce_timer_delete_all(); >> + get_online_cpus(); >> on_each_cpu(mce_cpu_restart, NULL, 1); >> + put_online_cpus(); > > With your patch applied I get on 4.1-rc1+: > > --- > [ 41.364909] kvm: disabling virtualization on CPU1 > [ 41.371083] smpboot: CPU 1 is now offline > [ 41.381190] x86: Booting SMP configuration: > [ 41.385405] smpboot: Booting Node 0 Processor 1 APIC 0x2 > [ 41.402901] kvm: enabling virtualization on CPU1 > [ 41.440944] kvm: disabling virtualization on CPU1 > [ 41.447010] smpboot: CPU 1 is now offline > [ 41.486082] kvm: disabling virtualization on CPU6 > [ 41.491827] smpboot: CPU 6 is now offline > [ 41.497521] smpboot: Booting Node 0 Processor 6 APIC 0x5 > [ 41.514983] kvm: enabling virtualization on CPU6 > [ 41.561643] kvm: disabling virtualization on CPU6 > [ 41.566848] smpboot: CPU 6 is now offline > [ 41.572606] smpboot: Booting Node 0 Processor 6 APIC 0x5 > [ 41.590049] kvm: enabling virtualization on CPU6 > [ 41.636817] kvm: disabling virtualization on CPU6 > [ 41.642575] smpboot: CPU 6 is now offline > [ 41.676812] kvm: disabling virtualization on CPU7 > [ 41.682429] smpboot: CPU 7 is now offline > [ 41.687974] smpboot: Booting Node 0 Processor 7 APIC 0x7 > [ 41.705416] kvm: enabling virtualization on CPU7 > [ 41.752739] kvm: disabling virtualization on CPU7 > [ 41.758455] smpboot: CPU 7 is now offline > [ 41.764089] smpboot: Booting Node 0 Processor 7 APIC 0x7 > [ 41.781561] kvm: enabling virtualization on CPU7 > [ 41.831610] kvm: disabling virtualization on CPU7 > [ 41.837280] smpboot: CPU 7 is now offline > > [ 41.843341] ====================================================== > [ 41.849561] [ INFO: possible circular locking dependency detected ] > [ 41.855883] 4.1.0-rc1+ #2 Not tainted > [ 41.859564] ------------------------------------------------------- > [ 41.865871] script2.sh/2071 is trying to acquire lock: > [ 41.871044] (cpu_hotplug.lock){++++++}, at: [] get_online_cpus+0x32/0x80 > [ 41.879521] > but task is already holding lock: > [ 41.885392] (s_active#121){++++.+}, at: [] kernfs_fop_write+0x6e/0x1a0 > [ 41.893695] > which lock already depends on the new lock. > > [ 41.901925] > the existing dependency chain (in reverse order) is: > [ 41.909465] > -> #2 (s_active#121){++++.+}: > [ 41.913739] [] lock_acquire+0xd1/0x2b0 > [ 41.919718] [] __kernfs_remove+0x228/0x300 > [ 41.926046] [] kernfs_remove_by_name_ns+0x49/0xb0 > [ 41.932976] [] sysfs_remove_file_ns+0x15/0x20 > [ 41.939552] [] device_remove_file+0x19/0x20 > [ 41.945968] [] mce_device_remove+0x54/0xd0 > [ 41.952284] [] mce_cpu_callback+0x69/0x120 > [ 41.958608] [] notifier_call_chain+0x66/0x90 > [ 41.965124] [] __raw_notifier_call_chain+0xe/0x10 > [ 41.972053] [] cpu_notify+0x23/0x50 > [ 41.977761] [] cpu_notify_nofail+0xe/0x20 > [ 41.983986] [] _cpu_down+0x1b6/0x2d0 > [ 41.989787] [] cpu_down+0x36/0x50 > [ 41.995324] [] cpu_subsys_offline+0x14/0x20 > [ 42.001734] [] device_offline+0x95/0xc0 > [ 42.007797] [] online_store+0x3d/0x90 > [ 42.013673] [] dev_attr_store+0x18/0x30 > [ 42.019735] [] sysfs_kf_write+0x49/0x60 > [ 42.025796] [] kernfs_fop_write+0x140/0x1a0 > [ 42.032211] [] __vfs_write+0x28/0xf0 > [ 42.038013] [] vfs_write+0xa9/0x1b0 > [ 42.043715] [] SyS_write+0x49/0xb0 > [ 42.049334] [] system_call_fastpath+0x16/0x7a > [ 42.055926] > -> #1 (cpu_hotplug.lock#2){+.+.+.}: > [ 42.060705] [] lock_acquire+0xd1/0x2b0 > [ 42.066674] [] mutex_lock_nested+0x61/0x460 > [ 42.073087] [] cpu_hotplug_begin+0x7b/0xd0 > [ 42.079400] [] _cpu_up+0x2b/0x160 > [ 42.084927] [] cpu_up+0x8a/0xb0 > [ 42.090285] [] smp_init+0x86/0x88 > [ 42.095818] [] kernel_init_freeable+0x160/0x27d > [ 42.102577] [] kernel_init+0xe/0xf0 > [ 42.108455] [] ret_from_fork+0x42/0x70 > [ 42.114586] > -> #0 (cpu_hotplug.lock){++++++}: > [ 42.119497] [] __lock_acquire+0x1db6/0x2040 > [ 42.126062] [] lock_acquire+0xd1/0x2b0 > [ 42.132182] [] get_online_cpus+0x58/0x80 > [ 42.138476] [] mce_timer_delete_all+0x23/0x70 > [ 42.145215] [] mce_restart+0xe/0x30 > [ 42.151071] [] store_int_with_restart+0x1b/0x30 > [ 42.157975] [] dev_attr_store+0x18/0x30 > [ 42.164179] [] sysfs_kf_write+0x49/0x60 > [ 42.170383] [] kernfs_fop_write+0x140/0x1a0 > [ 42.176938] [] __vfs_write+0x28/0xf0 > [ 42.182879] [] vfs_write+0xa9/0x1b0 > [ 42.188733] [] SyS_write+0x49/0xb0 > [ 42.194498] [] system_call_fastpath+0x16/0x7a > [ 42.201230] > other info that might help us debug this: > > [ 42.210268] Chain exists of: > cpu_hotplug.lock --> cpu_hotplug.lock#2 --> s_active#121 > > [ 42.219214] Possible unsafe locking scenario: > > [ 42.225875] CPU0 CPU1 > [ 42.230578] ---- ---- > [ 42.235283] lock(s_active#121); > [ 42.238791] lock(cpu_hotplug.lock#2); > [ 42.245362] lock(s_active#121); > [ 42.251401] lock(cpu_hotplug.lock); > [ 42.255250] > *** DEADLOCK *** > > [ 42.262008] 3 locks held by script2.sh/2071: > [ 42.266446] #0: (sb_writers#6){.+.+.+}, at: [] vfs_write+0x163/0x1b0 > [ 42.274792] #1: (&of->mutex){+.+.+.}, at: [] kernfs_fop_write+0x66/0x1a0 > [ 42.283485] #2: (s_active#121){++++.+}, at: [] kernfs_fop_write+0x6e/0x1a0 > [ 42.292353] > stack backtrace: > [ 42.297084] CPU: 3 PID: 2071 Comm: script2.sh Not tainted 4.1.0-rc1+ #2 > [ 42.303880] Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 05/11/2014 > [ 42.311474] ffffffff82847b60 ffff88043934fb68 ffffffff818a08e0 0000000000000011 > [ 42.319092] ffffffff828479b0 ffff88043934fbb8 ffffffff810c2a9d 000000000000aae0 > [ 42.326712] ffff88043934fc28 0000000000000002 ffff8800a82cc0b8 0000000000000002 > [ 42.334332] Call Trace: > [ 42.336939] [] dump_stack+0x4f/0x7b > [ 42.342251] [] print_circular_bug+0x1cd/0x230 > [ 42.348436] [] __lock_acquire+0x1db6/0x2040 > [ 42.354455] [] ? check_irq_usage+0x89/0xc0 > [ 42.360385] [] lock_acquire+0xd1/0x2b0 > [ 42.365958] [] ? get_online_cpus+0x32/0x80 > [ 42.371880] [] get_online_cpus+0x58/0x80 > [ 42.377626] [] ? get_online_cpus+0x32/0x80 > [ 42.383556] [] ? simple_strtol+0xe/0x20 > [ 42.389215] [] mce_timer_delete_all+0x23/0x70 > [ 42.395401] [] mce_restart+0xe/0x30 > [ 42.400721] [] store_int_with_restart+0x1b/0x30 > [ 42.407080] [] dev_attr_store+0x18/0x30 > [ 42.412739] [] sysfs_kf_write+0x49/0x60 > [ 42.418396] [] kernfs_fop_write+0x140/0x1a0 > [ 42.424405] [] __vfs_write+0x28/0xf0 > [ 42.429803] [] ? vfs_write+0x163/0x1b0 > [ 42.435385] [] ? context_tracking_exit+0x64/0x1c0 > [ 42.441922] [] vfs_write+0xa9/0x1b0 > [ 42.447234] [] SyS_write+0x49/0xb0 > [ 42.452459] [] system_call_fastpath+0x16/0x7a > [ 42.459043] smpboot: Booting Node 0 Processor 7 APIC 0x7 > [ 42.476980] kvm: enabling virtualization on CPU7 > [ 42.525076] kvm: disabling virtualization on CPU7 > [ 42.530297] smpboot: CPU 7 is now offline > [ 42.535071] smpboot: Booting Node 0 Processor 7 APIC 0x7 > [ 42.552746] kvm: enabling virtualization on CPU7 > [ 42.600174] kvm: disabling virtualization on CPU7 > [ 42.605374] smpboot: CPU 7 is now offline > [ 42.610194] smpboot: Booting Node 0 Processor 7 APIC 0x7 > [ 42.627827] kvm: enabling virtualization on CPU7 > [ 42.681186] kvm: disabling virtualization on CPU7 > [ 42.687093] smpboot: CPU 7 is now offline > [ 42.692417] smpboot: Booting Node 0 Processor 7 APIC 0x7 > [ 42.710046] kvm: enabling virtualization on CPU7 > [ 42.751031] kvm: disabling virtualization on CPU7 > [ 42.756130] smpboot: CPU 7 is now offline > > -- > Regards/Gruss, > Boris. > > ECO tip #101: Trim your mails when you reply. > -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/