Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934816AbcCONZ7 (ORCPT ); Tue, 15 Mar 2016 09:25:59 -0400 Received: from [58.251.152.64] ([58.251.152.64]:5032 "EHLO szxga01-in.huawei.com" rhost-flags-FAIL-FAIL-OK-OK) by vger.kernel.org with ESMTP id S932449AbcCONZw (ORCPT ); Tue, 15 Mar 2016 09:25:52 -0400 Message-ID: <56E80D21.7010607@huawei.com> Date: Tue, 15 Mar 2016 21:24:49 +0800 From: Weidong Wang User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 MIME-Version: 1.0 To: , , , , , CC: Fengtiantian , , Subject: [Ask for help] met a deadlock with switch_fpu_finish on suse 3.0.93-0.8-default kernel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.177.21.100] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A020205.56E80D2D.0014,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2013-06-18 04:22:30, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 06d4a2de962d9d0bbc1ca3a3a48cdd8f Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3373 Lines: 70 Hi all, We find a deadlock problem in suse 3.0.93-0.8-default kernel when restore_fpu_checking return error in task switch. -------------------------------------------- The Call Trace is : 193 PID: 2415 TASK: ffff880b739d24c0 CPU: 5 COMMAND: "qemu-kvm" 194 #0 [ffff880c7f6a6e40] crash_nmi_callback at ffffffff8102460f 195 #1 [ffff880c7f6a6e50] notifier_call_chain at ffffffff81465027 196 #2 [ffff880c7f6a6e80] __atomic_notifier_call_chain at ffffffff8146506d 197 #3 [ffff880c7f6a6e90] notify_die at ffffffff814650bd 198 #4 [ffff880c7f6a6ec0] default_do_nmi at ffffffff81462507 199 #5 [ffff880c7f6a6ee0] do_nmi at ffffffff81462738 200 #6 [ffff880c7f6a6ef0] restart_nmi at ffffffff81461c91 201 [exception RIP: _raw_spin_lock+21] 202 RIP: ffffffff814611e5 RSP: ffff8809d8d1ba80 RFLAGS: 00000093 203 RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000093 204 RDX: ffff8809d8d1ba80 RSI: 0000000000000018 RDI: 0000000000000001 205 RBP: ffffffff814611e5 R8: ffffffff814611e5 R9: 0000000000000018 206 R10: ffff8809d8d1ba80 R11: 0000000000000093 R12: ffffffffffffffff 207 R13: ffff880c7f6b0a00 R14: 0000000000000005 R15: 000000000000e2b8 208 ORIG_RAX: 000000000000e2b8 CS: 0010 SS: 0018 209 --- --- 210 #7 [ffff8809d8d1ba80] _raw_spin_lock at ffffffff814611e5 211 #8 [ffff8809d8d1ba80] try_to_wake_up at ffffffff81054afb 212 #9 [ffff8809d8d1bad0] pollwake at ffffffff8116cfc6 213 #10 [ffff8809d8d1bb10] __wake_up_common at ffffffff81046e1a 214 #11 [ffff8809d8d1bb50] __wake_up at ffffffff8104bf43 215 #12 [ffff8809d8d1bb90] __send_signal at ffffffff81074bfd 216 #13 [ffff8809d8d1bbd0] force_sig_info at ffffffff81076194 217 #14 [ffff8809d8d1bc00] __switch_to at ffffffff81001930 218 #15 [ffff8809d8d1bcf0] reschedule_interrupt at ffffffff8146a06e 219 #16 [ffff8809d8d1bd58] vmx_handle_external_intr at ffffffffa03c3f4c [kvm_intel] 220 #17 [ffff8809d8d1bd80] vcpu_enter_guest at ffffffffa0363487 [kvm] 221 #18 [ffff8809d8d1be00] __vcpu_run at ffffffffa0363743 [kvm] 222 #19 [ffff8809d8d1be40] kvm_arch_vcpu_ioctl_run at ffffffffa0364438 [kvm] 223 #20 [ffff8809d8d1be70] kvm_vcpu_ioctl at ffffffffa0350cee [kvm] 224 #21 [ffff8809d8d1bf10] do_vfs_ioctl at ffffffff8116bd1b 225 #22 [ffff8809d8d1bf40] sys_ioctl at ffffffff8116c0e1 226 #23 [ffff8809d8d1bf80] system_call_fastpath at ffffffff81469172 -------------------------------------------- We see the patch commit 80ab6f1e8c981b1b6604b2f22e36c917526235cd "i387: use 'restore_fpu_checking()' directly in task switching code" this patch remove the __math_state_restore in switch_fpu_finish,like that: static inline void switch_fpu_finish(struct task_struct *new, fpu_switch_t fpu) { - if (fpu.preload) - __math_state_restore(new); + if (fpu.preload) { + if (unlikely(restore_fpu_checking(new))) + __thread_fpu_end(new); + } } So in switch_fpu_finish, when entered restore_fpu_checking fail, it won't call force_sig(). 1. Would it will fix this issuse(deadlock)? 2. We don't understand why the restore_fpu_checking would failed? Any one know that? 3. if the patch can fix the problem, We want to know that "restore_fpu_checking(tsk) really fail,and we not force send the SIGSEGV to the task, Would it introuduce other issue?" Regards, Weidong