Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759495AbcKDAi3 (ORCPT ); Thu, 3 Nov 2016 20:38:29 -0400 Received: from mail-ua0-f173.google.com ([209.85.217.173]:33524 "EHLO mail-ua0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752892AbcKDAi1 (ORCPT ); Thu, 3 Nov 2016 20:38:27 -0400 MIME-Version: 1.0 In-Reply-To: <57F363EC.8040304@kyup.com> References: <57F363EC.8040304@kyup.com> From: Peter Hurley Date: Thu, 3 Nov 2016 18:38:25 -0600 Message-ID: Subject: Re: more hangs in the tty layer To: Nikolay Borisov Cc: "Linux-Kernel@Vger. Kernel. Org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5806 Lines: 101 On Tue, Oct 4, 2016 at 2:10 AM, Nikolay Borisov wrote: > Hello Peter, > > I've emailed you before re. spurious hangs in the TTY layer, but > at that time I was running a rather old (but LTS) 3.12 kernel. Now, > I'm running a 4.4.10 and I still observe the following lock-ups. > I have multiple processes which hang with the following callstack: > > [4592064.967445] sulogin D ffff8839bca53b28 0 38279 38274 0x00000006 > [4592064.967448] ffff8839bca53b28 ffff881ff2ed44c0 ffff883748d85280 ffff8839bca54000 > [4592064.967452] ffff883748d85280 ffff881ff07c988c 00000000ffffffff ffff881ff07c9890 > [4592064.967454] ffff8839bca53b40 ffffffff8161020c ffff881ff07c9888 ffff8839bca53b50 > [4592064.967458] Call Trace: > [4592064.967460] [] schedule+0x3c/0x90 > [4592064.967463] [] schedule_preempt_disabled+0x15/0x20 > [4592064.967466] [] __mutex_lock_slowpath+0xb0/0x130 > [4592064.967468] [] mutex_lock+0x2d/0x40 > [4592064.967471] [] tty_lock+0x38/0x90 > [4592064.967474] [] tty_release+0x55/0x590 > [4592064.967478] [] __fput+0xea/0x1f0 > [4592064.967481] [] ____fput+0xe/0x10 > [4592064.967485] [] task_work_run+0x7f/0xb0 > [4592064.967488] [] do_exit+0x373/0xb70 > [4592064.967490] [] ? __mutex_unlock_slowpath+0x32/0x40 > [4592064.967493] [] ? mutex_unlock+0x1b/0x20 > [4592064.967495] [] ? n_tty_read+0x31b/0x9e0 > [4592064.967498] [] do_group_exit+0x43/0xd0 > [4592064.967500] [] get_signal+0x2e1/0x710 > [4592064.967503] [] do_signal+0x28/0x770 > [4592064.967505] [] ? tty_ldisc_deref+0x16/0x20 > [4592064.967508] [] ? tty_read+0x9c/0xf0 > [4592064.967511] [] ? __vfs_read+0x28/0xd0 > [4592064.967514] [] ? security_file_permission+0xa3/0xc0 > [4592064.967517] [] exit_to_usermode_loop+0x8a/0xd0 > [4592064.967520] [] syscall_return_slowpath+0x80/0xf0 > [4592064.967523] [] int_ret_from_sys_call+0x25/0x9f > > After debugging the memory dump of the server it turned out the > following task is holding the mutex on which the sulogin is stuck: > > PID: 38274 TASK: ffff88395a109b80 CPU: 13 COMMAND: "sh" > #0 [ffff8837aba27ac8] __schedule at ffffffff8160fa9a > #1 [ffff8837aba27b10] schedule at ffffffff8161020c > #2 [ffff8837aba27b28] schedule_timeout at ffffffff81613472 > #3 [ffff8837aba27ba8] ldsem_down_write at ffffffff81613cc5 > #4 [ffff8837aba27c10] tty_ldisc_lock at ffffffff816142e6 > #5 [ffff8837aba27c28] tty_ldisc_hangup at ffffffff8139561a > #6 [ffff8837aba27c50] __tty_hangup at ffffffff8138cd03 > #7 [ffff8837aba27ca0] disassociate_ctty at ffffffff8138d2b9 > #8 [ffff8837aba27cd0] disassociate_ctty at ffffffff8138d5d9 > #9 [ffff8837aba27ce0] do_exit at ffffffff81055908 > #10 [ffff8837aba27d48] do_group_exit at ffffffff81055d53 > #11 [ffff8837aba27d70] get_signal at ffffffff81061bd1 > #12 [ffff8837aba27de8] do_signal at ffffffff810053c8 > #13 [ffff8837aba27ee0] exit_to_usermode_loop at ffffffff810021da > #14 [ffff8837aba27f18] syscall_return_slowpath at ffffffff81002a90 > #15 [ffff8837aba27f50] int_ret_from_sys_call at ffffffff816145f4 > RIP: 00007f18fde88aba RSP: 00007ffc47a1fa78 RFLAGS: 00000246 > RAX: fffffffffffffe00 RBX: 0000561242611d20 RCX: 00007f18fde88aba > RDX: 0000000000000000 RSI: 00007ffc47a1fa9c RDI: 00000000ffffffff > RBP: 000056124260f948 R8: 0000000000000000 R9: 00007f18fe3a5740 > R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc47a1fb20 > R13: 00007ffc47a1faa0 R14: 0000000000000000 R15: 0000000000000001 > ORIG_RAX: 000000000000003d CS: 0033 SS: 002b > > But I also observed also the following hung-up task, > waiting for read access: > > [4592064.968193] index.pl D ffff881430387ce8 0 47168 1 0x00000004 > [4592064.968196] ffff881430387ce8 ffff881ff2ed0dc0 ffff8808a0960000 ffff881430388000 > [4592064.968200] ffff881ff07c9828 ffff881ff07c9830 ffff8808a0960000 7fffffffffffffff > [4592064.968203] ffff881430387d00 ffffffff8161020c ffffffff00000004 ffff881430387d80 > [4592064.968205] Call Trace: > [4592064.968208] [] schedule+0x3c/0x90 > [4592064.968211] [] schedule_timeout+0x202/0x260 > [4592064.968216] [] ? security_capable+0x48/0x60 > [4592064.968219] [] ldsem_down_read+0x122/0x240 > [4592064.968222] [] ? vt_ioctl+0x6c/0x1460 > [4592064.968225] [] tty_ldisc_ref_wait+0x20/0x50 > [4592064.968228] [] tty_ioctl+0x417/0xd40 > [4592064.968232] [] ? handle_mm_fault+0x19b6/0x19c0 > [4592064.968236] [] do_vfs_ioctl+0x30f/0x560 > [4592064.968239] [] ? dput+0x192/0x230 > [4592064.968242] [] SyS_ioctl+0x79/0x90 > [4592064.968245] [] entry_SYSCALL_64_fastpath+0x16/0x6e > > I checked and all the tasks are working on the same tty_struct. Here > is the state of the struct: http://paste.ubuntu.com/23273973/ > > To me this seems like a lost wakeup event, especially that I've > seen patches flying around fixing such issues. Unfortunately I > cannot reliably reproduce this and booting a production machine > with lockdep would be problematic. Too bad because lockdep would definitively tell you what process is holding the ldsem for this tty. If you paste *all the tasks*, I can try to see if I recognize the task stack that is holding the ldsem. Regards, Peter Hurley