Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752993AbcJDIK1 (ORCPT ); Tue, 4 Oct 2016 04:10:27 -0400 Received: from mail-wm0-f45.google.com ([74.125.82.45]:36584 "EHLO mail-wm0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752151AbcJDIKZ (ORCPT ); Tue, 4 Oct 2016 04:10:25 -0400 To: Peter Hurley Cc: "Linux-Kernel@Vger. Kernel. Org" From: Nikolay Borisov Subject: more hangs in the tty layer Message-ID: <57F363EC.8040304@kyup.com> Date: Tue, 4 Oct 2016 11:10:20 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5361 Lines: 96 Hello Peter, I've emailed you before re. spurious hangs in the TTY layer, but at that time I was running a rather old (but LTS) 3.12 kernel. Now, I'm running a 4.4.10 and I still observe the following lock-ups. I have multiple processes which hang with the following callstack: [4592064.967445] sulogin D ffff8839bca53b28 0 38279 38274 0x00000006 [4592064.967448] ffff8839bca53b28 ffff881ff2ed44c0 ffff883748d85280 ffff8839bca54000 [4592064.967452] ffff883748d85280 ffff881ff07c988c 00000000ffffffff ffff881ff07c9890 [4592064.967454] ffff8839bca53b40 ffffffff8161020c ffff881ff07c9888 ffff8839bca53b50 [4592064.967458] Call Trace: [4592064.967460] [] schedule+0x3c/0x90 [4592064.967463] [] schedule_preempt_disabled+0x15/0x20 [4592064.967466] [] __mutex_lock_slowpath+0xb0/0x130 [4592064.967468] [] mutex_lock+0x2d/0x40 [4592064.967471] [] tty_lock+0x38/0x90 [4592064.967474] [] tty_release+0x55/0x590 [4592064.967478] [] __fput+0xea/0x1f0 [4592064.967481] [] ____fput+0xe/0x10 [4592064.967485] [] task_work_run+0x7f/0xb0 [4592064.967488] [] do_exit+0x373/0xb70 [4592064.967490] [] ? __mutex_unlock_slowpath+0x32/0x40 [4592064.967493] [] ? mutex_unlock+0x1b/0x20 [4592064.967495] [] ? n_tty_read+0x31b/0x9e0 [4592064.967498] [] do_group_exit+0x43/0xd0 [4592064.967500] [] get_signal+0x2e1/0x710 [4592064.967503] [] do_signal+0x28/0x770 [4592064.967505] [] ? tty_ldisc_deref+0x16/0x20 [4592064.967508] [] ? tty_read+0x9c/0xf0 [4592064.967511] [] ? __vfs_read+0x28/0xd0 [4592064.967514] [] ? security_file_permission+0xa3/0xc0 [4592064.967517] [] exit_to_usermode_loop+0x8a/0xd0 [4592064.967520] [] syscall_return_slowpath+0x80/0xf0 [4592064.967523] [] int_ret_from_sys_call+0x25/0x9f After debugging the memory dump of the server it turned out the following task is holding the mutex on which the sulogin is stuck: PID: 38274 TASK: ffff88395a109b80 CPU: 13 COMMAND: "sh" #0 [ffff8837aba27ac8] __schedule at ffffffff8160fa9a #1 [ffff8837aba27b10] schedule at ffffffff8161020c #2 [ffff8837aba27b28] schedule_timeout at ffffffff81613472 #3 [ffff8837aba27ba8] ldsem_down_write at ffffffff81613cc5 #4 [ffff8837aba27c10] tty_ldisc_lock at ffffffff816142e6 #5 [ffff8837aba27c28] tty_ldisc_hangup at ffffffff8139561a #6 [ffff8837aba27c50] __tty_hangup at ffffffff8138cd03 #7 [ffff8837aba27ca0] disassociate_ctty at ffffffff8138d2b9 #8 [ffff8837aba27cd0] disassociate_ctty at ffffffff8138d5d9 #9 [ffff8837aba27ce0] do_exit at ffffffff81055908 #10 [ffff8837aba27d48] do_group_exit at ffffffff81055d53 #11 [ffff8837aba27d70] get_signal at ffffffff81061bd1 #12 [ffff8837aba27de8] do_signal at ffffffff810053c8 #13 [ffff8837aba27ee0] exit_to_usermode_loop at ffffffff810021da #14 [ffff8837aba27f18] syscall_return_slowpath at ffffffff81002a90 #15 [ffff8837aba27f50] int_ret_from_sys_call at ffffffff816145f4 RIP: 00007f18fde88aba RSP: 00007ffc47a1fa78 RFLAGS: 00000246 RAX: fffffffffffffe00 RBX: 0000561242611d20 RCX: 00007f18fde88aba RDX: 0000000000000000 RSI: 00007ffc47a1fa9c RDI: 00000000ffffffff RBP: 000056124260f948 R8: 0000000000000000 R9: 00007f18fe3a5740 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc47a1fb20 R13: 00007ffc47a1faa0 R14: 0000000000000000 R15: 0000000000000001 ORIG_RAX: 000000000000003d CS: 0033 SS: 002b But I also observed also the following hung-up task, waiting for read access: [4592064.968193] index.pl D ffff881430387ce8 0 47168 1 0x00000004 [4592064.968196] ffff881430387ce8 ffff881ff2ed0dc0 ffff8808a0960000 ffff881430388000 [4592064.968200] ffff881ff07c9828 ffff881ff07c9830 ffff8808a0960000 7fffffffffffffff [4592064.968203] ffff881430387d00 ffffffff8161020c ffffffff00000004 ffff881430387d80 [4592064.968205] Call Trace: [4592064.968208] [] schedule+0x3c/0x90 [4592064.968211] [] schedule_timeout+0x202/0x260 [4592064.968216] [] ? security_capable+0x48/0x60 [4592064.968219] [] ldsem_down_read+0x122/0x240 [4592064.968222] [] ? vt_ioctl+0x6c/0x1460 [4592064.968225] [] tty_ldisc_ref_wait+0x20/0x50 [4592064.968228] [] tty_ioctl+0x417/0xd40 [4592064.968232] [] ? handle_mm_fault+0x19b6/0x19c0 [4592064.968236] [] do_vfs_ioctl+0x30f/0x560 [4592064.968239] [] ? dput+0x192/0x230 [4592064.968242] [] SyS_ioctl+0x79/0x90 [4592064.968245] [] entry_SYSCALL_64_fastpath+0x16/0x6e I checked and all the tasks are working on the same tty_struct. Here is the state of the struct: http://paste.ubuntu.com/23273973/ To me this seems like a lost wakeup event, especially that I've seen patches flying around fixing such issues. Unfortunately I cannot reliably reproduce this and booting a production machine with lockdep would be problematic. Regards, Nikolay