Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752159AbcCASXS (ORCPT ); Tue, 1 Mar 2016 13:23:18 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:5163 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750850AbcCASXQ (ORCPT ); Tue, 1 Mar 2016 13:23:16 -0500 Subject: Re: [PATCH] serial: flush ldisc after hangup To: Peter Hurley , , , , References: <1456855375-17175-1-git-send-email-jbacik@fb.com> <56D5DCA0.2040201@hurleysoftware.com> From: Josef Bacik Message-ID: <56D5DDA6.5080605@fb.com> Date: Tue, 1 Mar 2016 13:21:26 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D5DCA0.2040201@hurleysoftware.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [192.168.52.123] X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-03-01_09:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4154 Lines: 79 On 03/01/2016 01:17 PM, Peter Hurley wrote: > Hi Josef, > > On 03/01/2016 10:02 AM, Josef Bacik wrote: >> We hit a panic pretty consistently in production that looked like this >> >> PID: 461061 TASK: ffff880203f8bc00 CPU: 2 COMMAND: "kworker/u8:2" >> #0 [ffff88015834b940] machine_kexec at ffffffff8103c1c5 >> #1 [ffff88015834b990] crash_kexec at ffffffff810cd448 >> #2 [ffff88015834ba60] oops_end at ffffffff81006478 >> #3 [ffff88015834ba90] no_context at ffffffff818c5262 >> #4 [ffff88015834baf0] __bad_area_nosemaphore at ffffffff818c545a >> #5 [ffff88015834bb40] bad_area_nosemaphore at ffffffff818c548c >> #6 [ffff88015834bb50] __do_page_fault at ffffffff81045ad5 >> #7 [ffff88015834bbc0] do_page_fault at ffffffff81045efc >> #8 [ffff88015834bbd0] page_fault at ffffffff818d6b82 >> [exception RIP: __uart_start+0x1a] >> RIP: ffffffff8152f30a RSP: ffff88015834bc80 RFLAGS: 00010046 >> RAX: 0000000000000000 RBX: ffffffff822e9920 RCX: 0000000000000036 >> RDX: 0000000000003636 RSI: 00000000000000fe RDI: ffffffff822e9920 >> RBP: ffff88015834bca8 R8: 0000000000000000 R9: 00000000ffffffff >> R10: ffff8802546f0d20 R11: 0000000000000000 R12: ffff880254712400 >> R13: 0000000000000286 R14: 00000000000000fe R15: ffff880254712400 >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >> #9 [ffff88015834bc80] uart_start at ffffffff8152fbf2 > > Thanks for the report, but where's the rest of the stack trace? Woops sorry about that crash> bt PID: 461061 TASK: ffff880203f8bc00 CPU: 2 COMMAND: "kworker/u8:2" #0 [ffff88015834b940] machine_kexec at ffffffff8103c1c5 #1 [ffff88015834b990] crash_kexec at ffffffff810cd448 #2 [ffff88015834ba60] oops_end at ffffffff81006478 #3 [ffff88015834ba90] no_context at ffffffff818c5262 #4 [ffff88015834baf0] __bad_area_nosemaphore at ffffffff818c545a #5 [ffff88015834bb40] bad_area_nosemaphore at ffffffff818c548c #6 [ffff88015834bb50] __do_page_fault at ffffffff81045ad5 #7 [ffff88015834bbc0] do_page_fault at ffffffff81045efc #8 [ffff88015834bbd0] page_fault at ffffffff818d6b82 [exception RIP: __uart_start+0x1a] RIP: ffffffff8152f30a RSP: ffff88015834bc80 RFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffffff822e9920 RCX: 0000000000000036 RDX: 0000000000003636 RSI: 00000000000000fe RDI: ffffffff822e9920 RBP: ffff88015834bca8 R8: 0000000000000000 R9: 00000000ffffffff R10: ffff8802546f0d20 R11: 0000000000000000 R12: ffff880254712400 R13: 0000000000000286 R14: 00000000000000fe R15: ffff880254712400 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff88015834bc80] uart_start at ffffffff8152fbf2 #10 [ffff88015834bcb0] uart_flush_chars at ffffffff8152fc1e #11 [ffff88015834bcc0] n_tty_receive_buf_common at ffffffff81516cf1 #12 [ffff88015834bd80] n_tty_receive_buf2 at ffffffff81517414 #13 [ffff88015834bd90] flush_to_ldisc at ffffffff8151ab6d #14 [ffff88015834bdf0] process_one_work at ffffffff81069871 #15 [ffff88015834be40] worker_thread at ffffffff81069c53 #16 [ffff88015834bec0] kthread at ffffffff8106f429 #17 [ffff88015834bf50] ret_from_fork at ffffffff818d50c8 > >> It was a NULL pointer dereference, the state->port.tty was NULL so when we go to >> check tty->stopped in uart_tx_stopped() we panic. Looking at the other CPU's we >> were in the middle of uart_open(), and the core actually had a valid pointer in >> state->port.tty, which points to a race between either close or hangup (the only >> two places that set state->port.tty to NULL) and open. Close already flushes >> the ldisc but hangup does not, which means we could have some characters in the >> receive buffer in between the hangup and the open, and we end up in this >> situation. > > Yeah, the race is that the ldisc should not be attempting i/o to > the driver at all. This problem is fixed in -next already, but in the > tty core rather than in each individual tty driver. > Great! Which patch/patches fix this? I looked at linux-next and there's a lot of refactoring stuff, do I need all the things or is there a specific one that fixes this problem? Thanks, Josef