2016-03-01 18:03:09

by Josef Bacik

[permalink] [raw]
Subject: [PATCH] serial: flush ldisc after hangup

We hit a panic pretty consistently in production that looked like this

PID: 461061 TASK: ffff880203f8bc00 CPU: 2 COMMAND: "kworker/u8:2"
#0 [ffff88015834b940] machine_kexec at ffffffff8103c1c5
#1 [ffff88015834b990] crash_kexec at ffffffff810cd448
#2 [ffff88015834ba60] oops_end at ffffffff81006478
#3 [ffff88015834ba90] no_context at ffffffff818c5262
#4 [ffff88015834baf0] __bad_area_nosemaphore at ffffffff818c545a
#5 [ffff88015834bb40] bad_area_nosemaphore at ffffffff818c548c
#6 [ffff88015834bb50] __do_page_fault at ffffffff81045ad5
#7 [ffff88015834bbc0] do_page_fault at ffffffff81045efc
#8 [ffff88015834bbd0] page_fault at ffffffff818d6b82
[exception RIP: __uart_start+0x1a]
RIP: ffffffff8152f30a RSP: ffff88015834bc80 RFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffff822e9920 RCX: 0000000000000036
RDX: 0000000000003636 RSI: 00000000000000fe RDI: ffffffff822e9920
RBP: ffff88015834bca8 R8: 0000000000000000 R9: 00000000ffffffff
R10: ffff8802546f0d20 R11: 0000000000000000 R12: ffff880254712400
R13: 0000000000000286 R14: 00000000000000fe R15: ffff880254712400
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff88015834bc80] uart_start at ffffffff8152fbf2

It was a NULL pointer dereference, the state->port.tty was NULL so when we go to
check tty->stopped in uart_tx_stopped() we panic. Looking at the other CPU's we
were in the middle of uart_open(), and the core actually had a valid pointer in
state->port.tty, which points to a race between either close or hangup (the only
two places that set state->port.tty to NULL) and open. Close already flushes
the ldisc but hangup does not, which means we could have some characters in the
receive buffer in between the hangup and the open, and we end up in this
situation.

Signed-off-by: Josef Bacik <[email protected]>
---
drivers/tty/serial/serial_core.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index b1f54ab..93b7a53 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -1527,6 +1527,7 @@ static void uart_hangup(struct tty_struct *tty)
wake_up_interruptible(&port->delta_msr_wait);
}
mutex_unlock(&port->mutex);
+ tty_ldisc_flush(tty);
}

static void uart_port_shutdown(struct tty_port *port)
--
2.5.0


2016-03-01 18:17:11

by Peter Hurley

[permalink] [raw]
Subject: Re: [PATCH] serial: flush ldisc after hangup

Hi Josef,

On 03/01/2016 10:02 AM, Josef Bacik wrote:
> We hit a panic pretty consistently in production that looked like this
>
> PID: 461061 TASK: ffff880203f8bc00 CPU: 2 COMMAND: "kworker/u8:2"
> #0 [ffff88015834b940] machine_kexec at ffffffff8103c1c5
> #1 [ffff88015834b990] crash_kexec at ffffffff810cd448
> #2 [ffff88015834ba60] oops_end at ffffffff81006478
> #3 [ffff88015834ba90] no_context at ffffffff818c5262
> #4 [ffff88015834baf0] __bad_area_nosemaphore at ffffffff818c545a
> #5 [ffff88015834bb40] bad_area_nosemaphore at ffffffff818c548c
> #6 [ffff88015834bb50] __do_page_fault at ffffffff81045ad5
> #7 [ffff88015834bbc0] do_page_fault at ffffffff81045efc
> #8 [ffff88015834bbd0] page_fault at ffffffff818d6b82
> [exception RIP: __uart_start+0x1a]
> RIP: ffffffff8152f30a RSP: ffff88015834bc80 RFLAGS: 00010046
> RAX: 0000000000000000 RBX: ffffffff822e9920 RCX: 0000000000000036
> RDX: 0000000000003636 RSI: 00000000000000fe RDI: ffffffff822e9920
> RBP: ffff88015834bca8 R8: 0000000000000000 R9: 00000000ffffffff
> R10: ffff8802546f0d20 R11: 0000000000000000 R12: ffff880254712400
> R13: 0000000000000286 R14: 00000000000000fe R15: ffff880254712400
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> #9 [ffff88015834bc80] uart_start at ffffffff8152fbf2

Thanks for the report, but where's the rest of the stack trace?

> It was a NULL pointer dereference, the state->port.tty was NULL so when we go to
> check tty->stopped in uart_tx_stopped() we panic. Looking at the other CPU's we
> were in the middle of uart_open(), and the core actually had a valid pointer in
> state->port.tty, which points to a race between either close or hangup (the only
> two places that set state->port.tty to NULL) and open. Close already flushes
> the ldisc but hangup does not, which means we could have some characters in the
> receive buffer in between the hangup and the open, and we end up in this
> situation.

Yeah, the race is that the ldisc should not be attempting i/o to
the driver at all. This problem is fixed in -next already, but in the
tty core rather than in each individual tty driver.

Regards,
Peter Hurley

> Signed-off-by: Josef Bacik <[email protected]>
> ---
> drivers/tty/serial/serial_core.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
> index b1f54ab..93b7a53 100644
> --- a/drivers/tty/serial/serial_core.c
> +++ b/drivers/tty/serial/serial_core.c
> @@ -1527,6 +1527,7 @@ static void uart_hangup(struct tty_struct *tty)
> wake_up_interruptible(&port->delta_msr_wait);
> }
> mutex_unlock(&port->mutex);
> + tty_ldisc_flush(tty);
> }
>
> static void uart_port_shutdown(struct tty_port *port)
>

2016-03-01 18:23:18

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] serial: flush ldisc after hangup

On 03/01/2016 01:17 PM, Peter Hurley wrote:
> Hi Josef,
>
> On 03/01/2016 10:02 AM, Josef Bacik wrote:
>> We hit a panic pretty consistently in production that looked like this
>>
>> PID: 461061 TASK: ffff880203f8bc00 CPU: 2 COMMAND: "kworker/u8:2"
>> #0 [ffff88015834b940] machine_kexec at ffffffff8103c1c5
>> #1 [ffff88015834b990] crash_kexec at ffffffff810cd448
>> #2 [ffff88015834ba60] oops_end at ffffffff81006478
>> #3 [ffff88015834ba90] no_context at ffffffff818c5262
>> #4 [ffff88015834baf0] __bad_area_nosemaphore at ffffffff818c545a
>> #5 [ffff88015834bb40] bad_area_nosemaphore at ffffffff818c548c
>> #6 [ffff88015834bb50] __do_page_fault at ffffffff81045ad5
>> #7 [ffff88015834bbc0] do_page_fault at ffffffff81045efc
>> #8 [ffff88015834bbd0] page_fault at ffffffff818d6b82
>> [exception RIP: __uart_start+0x1a]
>> RIP: ffffffff8152f30a RSP: ffff88015834bc80 RFLAGS: 00010046
>> RAX: 0000000000000000 RBX: ffffffff822e9920 RCX: 0000000000000036
>> RDX: 0000000000003636 RSI: 00000000000000fe RDI: ffffffff822e9920
>> RBP: ffff88015834bca8 R8: 0000000000000000 R9: 00000000ffffffff
>> R10: ffff8802546f0d20 R11: 0000000000000000 R12: ffff880254712400
>> R13: 0000000000000286 R14: 00000000000000fe R15: ffff880254712400
>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
>> #9 [ffff88015834bc80] uart_start at ffffffff8152fbf2
>
> Thanks for the report, but where's the rest of the stack trace?

Woops sorry about that

crash> bt
PID: 461061 TASK: ffff880203f8bc00 CPU: 2 COMMAND: "kworker/u8:2"
#0 [ffff88015834b940] machine_kexec at ffffffff8103c1c5
#1 [ffff88015834b990] crash_kexec at ffffffff810cd448
#2 [ffff88015834ba60] oops_end at ffffffff81006478
#3 [ffff88015834ba90] no_context at ffffffff818c5262
#4 [ffff88015834baf0] __bad_area_nosemaphore at ffffffff818c545a
#5 [ffff88015834bb40] bad_area_nosemaphore at ffffffff818c548c
#6 [ffff88015834bb50] __do_page_fault at ffffffff81045ad5
#7 [ffff88015834bbc0] do_page_fault at ffffffff81045efc
#8 [ffff88015834bbd0] page_fault at ffffffff818d6b82
[exception RIP: __uart_start+0x1a]
RIP: ffffffff8152f30a RSP: ffff88015834bc80 RFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffff822e9920 RCX: 0000000000000036
RDX: 0000000000003636 RSI: 00000000000000fe RDI: ffffffff822e9920
RBP: ffff88015834bca8 R8: 0000000000000000 R9: 00000000ffffffff
R10: ffff8802546f0d20 R11: 0000000000000000 R12: ffff880254712400
R13: 0000000000000286 R14: 00000000000000fe R15: ffff880254712400
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff88015834bc80] uart_start at ffffffff8152fbf2
#10 [ffff88015834bcb0] uart_flush_chars at ffffffff8152fc1e
#11 [ffff88015834bcc0] n_tty_receive_buf_common at ffffffff81516cf1
#12 [ffff88015834bd80] n_tty_receive_buf2 at ffffffff81517414
#13 [ffff88015834bd90] flush_to_ldisc at ffffffff8151ab6d
#14 [ffff88015834bdf0] process_one_work at ffffffff81069871
#15 [ffff88015834be40] worker_thread at ffffffff81069c53
#16 [ffff88015834bec0] kthread at ffffffff8106f429
#17 [ffff88015834bf50] ret_from_fork at ffffffff818d50c8

>
>> It was a NULL pointer dereference, the state->port.tty was NULL so when we go to
>> check tty->stopped in uart_tx_stopped() we panic. Looking at the other CPU's we
>> were in the middle of uart_open(), and the core actually had a valid pointer in
>> state->port.tty, which points to a race between either close or hangup (the only
>> two places that set state->port.tty to NULL) and open. Close already flushes
>> the ldisc but hangup does not, which means we could have some characters in the
>> receive buffer in between the hangup and the open, and we end up in this
>> situation.
>
> Yeah, the race is that the ldisc should not be attempting i/o to
> the driver at all. This problem is fixed in -next already, but in the
> tty core rather than in each individual tty driver.
>

Great! Which patch/patches fix this? I looked at linux-next and
there's a lot of refactoring stuff, do I need all the things or is there
a specific one that fixes this problem? Thanks,

Josef

2016-03-01 19:01:48

by Peter Hurley

[permalink] [raw]
Subject: Re: [PATCH] serial: flush ldisc after hangup

On 03/01/2016 10:21 AM, Josef Bacik wrote:
> On 03/01/2016 01:17 PM, Peter Hurley wrote:

>> Yeah, the race is that the ldisc should not be attempting i/o to
>> the driver at all. This problem is fixed in -next already, but in
>> the tty core rather than in each individual tty driver.
>>
>
> Great! Which patch/patches fix this? I looked at linux-next and
> there's a lot of refactoring stuff, do I need all the things or is
> there a specific one that fixes this problem? Thanks,

The series that fixes this design problem introduce in 3.10 are these
patches:

staging: digi: Replace open-coded tty_wakeup()
serial: 68328: Remove bogus ldisc reset
bluetooth: hci_ldisc: Remove dead code
NFC: nci: Remove dead code
tty: Remove chars_in_buffer() line discipline method
tty: Fix unsafe ldisc reference via ioctl(TIOCGETD)
n_tty: Fix unsafe reference to "other" ldisc
tty: Reset c_line from driver's init_termios
staging/speakup: Use tty_ldisc_ref() for paste kworker
tty: Fix comments for tty_ldisc_get()
tty: Fix comments for tty_ldisc_release()
tty: Prepare for destroying line discipline on hangup
tty: Handle NULL tty->ldisc
tty: Move tty_ldisc_kill()
tty: Use 'disc' for line discipline index name
tty: Refactor tty_ldisc_reinit() for reuse
tty: Destroy ldisc instance on hangup

'tty: Fix unsafe ldisc reference via ioctl(TIOCGETD)' and
'tty: Fix unsafe reference to "other" ldisc' have already
made their way to most stable trees, afaict.

I doubt I'll submit any more of this series for stable;
it's just too invasive. I'll probably do it per-driver, like
your patch. Just an fyi.

Regards,
Peter Hurley

2016-03-01 19:04:24

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] serial: flush ldisc after hangup

On 03/01/2016 02:01 PM, Peter Hurley wrote:
> On 03/01/2016 10:21 AM, Josef Bacik wrote:
>> On 03/01/2016 01:17 PM, Peter Hurley wrote:
>
>>> Yeah, the race is that the ldisc should not be attempting i/o to
>>> the driver at all. This problem is fixed in -next already, but in
>>> the tty core rather than in each individual tty driver.
>>>
>>
>> Great! Which patch/patches fix this? I looked at linux-next and
>> there's a lot of refactoring stuff, do I need all the things or is
>> there a specific one that fixes this problem? Thanks,
>
> The series that fixes this design problem introduce in 3.10 are these
> patches:
>
> staging: digi: Replace open-coded tty_wakeup()
> serial: 68328: Remove bogus ldisc reset
> bluetooth: hci_ldisc: Remove dead code
> NFC: nci: Remove dead code
> tty: Remove chars_in_buffer() line discipline method
> tty: Fix unsafe ldisc reference via ioctl(TIOCGETD)
> n_tty: Fix unsafe reference to "other" ldisc
> tty: Reset c_line from driver's init_termios
> staging/speakup: Use tty_ldisc_ref() for paste kworker
> tty: Fix comments for tty_ldisc_get()
> tty: Fix comments for tty_ldisc_release()
> tty: Prepare for destroying line discipline on hangup
> tty: Handle NULL tty->ldisc
> tty: Move tty_ldisc_kill()
> tty: Use 'disc' for line discipline index name
> tty: Refactor tty_ldisc_reinit() for reuse
> tty: Destroy ldisc instance on hangup
>
> 'tty: Fix unsafe ldisc reference via ioctl(TIOCGETD)' and
> 'tty: Fix unsafe reference to "other" ldisc' have already
> made their way to most stable trees, afaict.
>
> I doubt I'll submit any more of this series for stable;
> it's just too invasive. I'll probably do it per-driver, like
> your patch. Just an fyi.
>

That's fine, we aren't afraid to pull back lots of stuff to our kernels,
as long as they are destined for upstream. Thanks for the help!

Josef