From: Liuyongan <liuyongan@huawei.com>
To: "Wangweidong (Dan)" <wangweidong1@huawei.com>,
        "tglx@linutronix.de" <tglx@linutronix.de>,
        "mingo@redhat.com" <mingo@redhat.com>, "hpa@zytor.com" <hpa@zytor.com>,
        "x86@kernel.org" <x86@kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "torvalds@linux-foundation.org" <torvalds@linux-foundation.org>
CC: Fengtiantian <fengtiantian@huawei.com>,
        "Wulizhen (Pss)" <pss.wulizhen@huawei.com>,
        Xiexiangyou <xiexiangyou@huawei.com>,
        "Herongguang (Stephen)" <herongguang.he@huawei.com>,
        "mtosatti@redhat.com" <mtosatti@redhat.com>,
        "Guozhibin (Hahaer)" <hahaer.guo@huawei.com>
Subject: RE: [Ask for help] met a deadlock with switch_fpu_finish on suse
 3.0.93-0.8-default kernel
Thread-Topic: [Ask for help] met a deadlock with switch_fpu_finish on suse
 3.0.93-0.8-default kernel
Thread-Index: AQHRfr4R0b3NJqXsy0a9gRakbzhj7p9bie4A
Date: Wed, 16 Mar 2016 06:31:02 +0000
Message-ID: <E4ABEE53CC34664FA3F0BD8AEAF50A198F8EBA96@SZXEMA512-MBS.china.huawei.com>
References: <56E80D21.7010607@huawei.com>
In-Reply-To: <56E80D21.7010607@huawei.com>
Accept-Language: zh-CN, en-US
Content-Language: zh-CN
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Content-Transfer-Encoding: 8bit
Content-Length: 4898
Lines: 116

> -----Original Message-----
> From: Wangweidong (Dan)
> Sent: Tuesday, March 15, 2016 9:25 PM
> To: tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com; x86@kernel.org;
> linux-kernel@vger.kernel.org; torvalds@linux-foundation.org
> Cc: Fengtiantian; Liuyongan; Wangweidong (Dan)
> Subject: [Ask for help] met a deadlock with switch_fpu_finish on suse
> 3.0.93-0.8-default kernel
> 
> Hi all,
> 
> We find a deadlock problem in suse 3.0.93-0.8-default kernel when
> restore_fpu_checking return error in task switch.
> --------------------------------------------
> The Call Trace is :
> 193 PID: 2415   TASK: ffff880b739d24c0  CPU: 5   COMMAND: "qemu-kvm"
> 194  #0 [ffff880c7f6a6e40] crash_nmi_callback at ffffffff8102460f
> 195  #1 [ffff880c7f6a6e50] notifier_call_chain at ffffffff81465027
> 196  #2 [ffff880c7f6a6e80] __atomic_notifier_call_chain at ffffffff8146506d
> 197  #3 [ffff880c7f6a6e90] notify_die at ffffffff814650bd
> 198  #4 [ffff880c7f6a6ec0] default_do_nmi at ffffffff81462507
> 199  #5 [ffff880c7f6a6ee0] do_nmi at ffffffff81462738
> 200  #6 [ffff880c7f6a6ef0] restart_nmi at ffffffff81461c91
> 201     [exception RIP: _raw_spin_lock+21]
> 202     RIP: ffffffff814611e5  RSP: ffff8809d8d1ba80  RFLAGS: 00000093
> 203     RAX: 0000000000000010  RBX: 0000000000000010  RCX:
> 0000000000000093
> 204     RDX: ffff8809d8d1ba80  RSI: 0000000000000018  RDI:
> 0000000000000001
> 205     RBP: ffffffff814611e5   R8: ffffffff814611e5   R9:
> 0000000000000018
> 206     R10: ffff8809d8d1ba80  R11: 0000000000000093  R12:
> ffffffffffffffff
> 207     R13: ffff880c7f6b0a00  R14: 0000000000000005  R15:
> 000000000000e2b8
> 208     ORIG_RAX: 000000000000e2b8  CS: 0010  SS: 0018
> 209 --- <DOUBLEFAULT exception stack> ---
> 210  #7 [ffff8809d8d1ba80] _raw_spin_lock at ffffffff814611e5
> 211  #8 [ffff8809d8d1ba80] try_to_wake_up at ffffffff81054afb
> 212  #9 [ffff8809d8d1bad0] pollwake at ffffffff8116cfc6
> 213 #10 [ffff8809d8d1bb10] __wake_up_common at ffffffff81046e1a
> 214 #11 [ffff8809d8d1bb50] __wake_up at ffffffff8104bf43
> 215 #12 [ffff8809d8d1bb90] __send_signal at ffffffff81074bfd
> 216 #13 [ffff8809d8d1bbd0] force_sig_info at ffffffff81076194
> 217 #14 [ffff8809d8d1bc00] __switch_to at ffffffff81001930
> 218 #15 [ffff8809d8d1bcf0] reschedule_interrupt at ffffffff8146a06e
> 219 #16 [ffff8809d8d1bd58] vmx_handle_external_intr at ffffffffa03c3f4c
> [kvm_intel]
> 220 #17 [ffff8809d8d1bd80] vcpu_enter_guest at ffffffffa0363487 [kvm]
> 221 #18 [ffff8809d8d1be00] __vcpu_run at ffffffffa0363743 [kvm]
> 222 #19 [ffff8809d8d1be40] kvm_arch_vcpu_ioctl_run at ffffffffa0364438
> [kvm]
> 223 #20 [ffff8809d8d1be70] kvm_vcpu_ioctl at ffffffffa0350cee [kvm]
> 224 #21 [ffff8809d8d1bf10] do_vfs_ioctl at ffffffff8116bd1b
> 225 #22 [ffff8809d8d1bf40] sys_ioctl at ffffffff8116c0e1
> 226 #23 [ffff8809d8d1bf80] system_call_fastpath at ffffffff81469172
> --------------------------------------------
> 
> We see the patch
> commit 80ab6f1e8c981b1b6604b2f22e36c917526235cd
> "i387: use 'restore_fpu_checking()' directly in task switching code"
> 
> this patch remove the __math_state_restore in switch_fpu_finish,like that:
> 
>  static inline void switch_fpu_finish(struct task_struct *new, fpu_switch_t fpu)
> {
> -       if (fpu.preload)
> -               __math_state_restore(new);
> +       if (fpu.preload) {
> +               if (unlikely(restore_fpu_checking(new)))
> +                       __thread_fpu_end(new);
> +       }
>  }
> 
> So in switch_fpu_finish, when entered restore_fpu_checking fail, it won't call
> force_sig().
> 
> 
> 1. Would it will fix this issuse(deadlock)?
> 2. We don't understand why the restore_fpu_checking would failed? Any one
> know that?

Here is a patch that might cause fpu error. Anybody know anything else?

commit 42bdf991f4cad9678ee2b98c5c2e9299a3f986ef
Author: Marcelo Tosatti <mtosatti@redhat.com>
Date:   Mon Apr 15 23:30:13 2013 -0300
    KVM: x86: fix maintenance of guest/host xcr0 state
    Emulation of xcr0 writes zero guest_xcr0_loaded variable so that
    subsequent VM-entry reloads CPU's xcr0 with guests xcr0 value.
    However, this is incorrect because guest_xcr0_loaded variable is
    read to decide whether to reload hosts xcr0.
    In case the vcpu thread is scheduled out after the guest_xcr0_loaded = 0
    assignment, and scheduler decides to preload FPU:
    switch_to
    {
      __switch_to
        __math_state_restore
          restore_fpu_checking
            fpu_restore_checking
              if (use_xsave())
                  fpu_xrstor_checking
                xrstor64 with CPU's xcr0 == guests xcr0
    Fix by properly restoring hosts xcr0 during emulation of xcr0 writes.


> 3. if the patch can fix the problem, We want to know that
>    "restore_fpu_checking(tsk) really fail,and we not force send the SIGSEGV to
> the task,
>     Would it introuduce other issue?"
> 
> Regards,
> Weidong
> 
>