LinuxLists.cc - tty crash in tty_ldisc_receive

2017-04-06 07:04:51

Subject: tty crash in tty_ldisc_receive_buf()

Hi all,

We are seeing the following crash (in linux-next but has been around since at
least v4.10).

[  417.514499] Unable to handle kernel paging request for data at address 0x00002260
[  417.515361] Faulting instruction address: 0xc0000000006fad80
cpu 0x15: Vector: 300 (Data Access) at [c00000799411f890]
    pc: c0000000006fad80: n_tty_receive_buf_common+0xc0/0xbd0
    lr: c0000000006fad5c: n_tty_receive_buf_common+0x9c/0xbd0
    sp: c00000799411fb10
   msr: 900000000280b033
   dar: 2260
dsisr: 40000000
current = 0xc0000079675d1e00
paca    = 0xc00000000fb0d200 softe: 0 irq_happened: 0x01
    pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 2017
enter ? for help
[c00000799411fbe0] c0000000006ff968 tty_ldisc_receive_buf+0x48/0xe0
[c00000799411fc10] c0000000007009d8 tty_port_default_receive_buf+0x68/0xe0
[c00000799411fc50] c0000000006ffce4 flush_to_ldisc+0x114/0x130
[c00000799411fca0] c00000000010a0fc process_one_work+0x1ec/0x580
[c00000799411fd30] c00000000010a528 worker_thread+0x98/0x5d0
[c00000799411fdc0] c00000000011343c kthread+0x16c/0x1b0
[c00000799411fe30] c00000000000b4e8 ret_from_kernel_thread+0x5c/0x74

It seems the null ptr deref is in n_tty_receive_buf_common() where we do:

size_t tail = smp_load_acquire(&ldata->read_tail);

ldata is NULL.

We see this usually on boot but can also see it if we kill a getty attached to
tty (which is then respawned by systemd). It seems like we are flushing data to
a tty at the same time as it's being torn down and restarted.

I did try the below patch which avoids the crash but locks up one of the CPUs. I
guess the data never gets flushed if we say nothing is processed.

This is on powerpc but has also been reported by parisc.

I'm not at all familiar with the tty layer and looking at the locks, mutexes,
semaphores and reference counting in there scares the hell out of me.

If anyone has an idea, I'm happy to try a patch.

Regards,
Mikey

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index bdf0e6e899..99dd757aa4 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1673,6 +1673,10 @@ n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,

down_read(&tty->termios_rwsem);

+ /* This probably shouldn't happen, but return 0 data processed */
+ if (!ldata)
+ return 0;
+
while (1) {
/*
* When PARMRK is set, each input char may take up to 3 chars

2017-04-06 07:17:08

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: tty crash in tty_ldisc_receive_buf()

On Thu, 2017-04-06 at 17:04 +1000, Michael Neuling wrote:
> We see this usually on boot but can also see it if we kill a getty attached to
> tty (which is then respawned by systemd). It seems like we are flushing data to
> a tty at the same time as it's being torn down and restarted.
>
> I did try the below patch which avoids the crash but locks up one of the CPUs. I
> guess the data never gets flushed if we say nothing is processed.
>
> This is on powerpc but has also been reported by parisc.
>
> I'm not at all familiar with the tty layer and looking at the locks, mutexes,
> semaphores and reference counting in there scares the hell out of me.
>
> If anyone has an idea, I'm happy to try a patch.

Note that we noticed one path that called reinit without the ldisc lock
held for writing, we added that, but it didn't fix the problem.

Cheers,
Ben.

2017-04-06 13:28:56

by Rob Herring (Arm)

[permalink] [raw]

Subject: Re: tty crash in tty_ldisc_receive_buf()

On Thu, Apr 6, 2017 at 2:04 AM, Michael Neuling <[email protected]> wrote:
> Hi all,
>
> We are seeing the following crash (in linux-next but has been around since at
> least v4.10).
>
> [ 417.514499] Unable to handle kernel paging request for data at address 0x00002260
> [ 417.515361] Faulting instruction address: 0xc0000000006fad80
> cpu 0x15: Vector: 300 (Data Access) at [c00000799411f890]
> pc: c0000000006fad80: n_tty_receive_buf_common+0xc0/0xbd0
> lr: c0000000006fad5c: n_tty_receive_buf_common+0x9c/0xbd0
> sp: c00000799411fb10
> msr: 900000000280b033
> dar: 2260
> dsisr: 40000000
> current = 0xc0000079675d1e00
> paca = 0xc00000000fb0d200 softe: 0 irq_happened: 0x01
> pid = 5, comm = kworker/u56:0
> Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 2017
> enter ? for help
> [c00000799411fbe0] c0000000006ff968 tty_ldisc_receive_buf+0x48/0xe0
> [c00000799411fc10] c0000000007009d8 tty_port_default_receive_buf+0x68/0xe0
> [c00000799411fc50] c0000000006ffce4 flush_to_ldisc+0x114/0x130
> [c00000799411fca0] c00000000010a0fc process_one_work+0x1ec/0x580
> [c00000799411fd30] c00000000010a528 worker_thread+0x98/0x5d0
> [c00000799411fdc0] c00000000011343c kthread+0x16c/0x1b0
> [c00000799411fe30] c00000000000b4e8 ret_from_kernel_thread+0x5c/0x74
>
> It seems the null ptr deref is in n_tty_receive_buf_common() where we do:
>
> size_t tail = smp_load_acquire(&ldata->read_tail);
>
> ldata is NULL.
>
> We see this usually on boot but can also see it if we kill a getty attached to
> tty (which is then respawned by systemd). It seems like we are flushing data to
> a tty at the same time as it's being torn down and restarted.
>
> I did try the below patch which avoids the crash but locks up one of the CPUs. I
> guess the data never gets flushed if we say nothing is processed.
>
> This is on powerpc but has also been reported by parisc.
>
> I'm not at all familiar with the tty layer and looking at the locks, mutexes,
> semaphores and reference counting in there scares the hell out of me.
>
> If anyone has an idea, I'm happy to try a patch.

Can you try this one [1].

Rob

[1] https://lkml.org/lkml/2017/3/23/569

2017-04-07 00:47:18

by Michael Neuling

[permalink] [raw]

Subject: Re: tty crash in tty_ldisc_receive_buf()

> > If anyone has an idea, I'm happy to try a patch.
>
> Can you try this one [1].

Rob, I'm still hitting it when I apply that on next-20170405. Crash below..

Any other clues?

[  229.422825] Unable to handle kernel paging request for data at address 0x00002260
[  229.423681] Faulting instruction address: 0xc0000000006fad80
cpu 0x13: Vector: 300 (Data Access) at [c00000799411f8a0]
    pc: c0000000006fad80: n_tty_receive_buf_common+0xc0/0xbd0
    lr: c0000000006fad5c: n_tty_receive_buf_common+0x9c/0xbd0
    sp: c00000799411fb20
   msr: 900000000280b033
   dar: 2260
dsisr: 40000000
current = 0xc0000079665d1e00
paca    = 0xc00000000fb0be00 softe: 0 irq_happened: 0x01
    pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405+ (mikey@bml86) (gcc version 5.4.0
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #4 SMP Thu Apr 6 19:13:58 CDT
2017
enter ? for help
[c00000799411fbf0] c0000000006ff968 tty_ldisc_receive_buf+0x48/0xe0
[c00000799411fc20] c0000000007009fc tty_port_default_receive_buf+0x2c/0x40
[c00000799411fc40] c000000000700278 flush_to_ldisc+0x168/0x190
[c00000799411fca0] c00000000010a0fc process_one_work+0x1ec/0x580
[c00000799411fd30] c00000000010a528 worker_thread+0x98/0x5d0
[c00000799411fdc0] c00000000011343c kthread+0x16c/0x1b0
[c00000799411fe30] c00000000000b4e8 ret_from_kernel_thread+0x5c/0x74
13:mon> r
R00 = c0000000006fad5c   R16 = 0000000000000000
R01 = c00000799411fb20   R17 = 0000000000000000
R02 = c0000000014b9200   R18 = c000007994060ea8
R03 = 0000000000000000   R19 = c000007994060c78
R04 = c0000000021c2420   R20 = c000007994060c20
R05 = c0000000021c2520   R21 = 0000000000000000
R06 = 0000000000000100   R22 = 0000000000000000
R07 = 0000000000000001   R23 = 0000000100000000
R08 = 0000000000000000   R24 = 0000000000000000
R09 = c0000000fc459600   R25 = 0000000000000000
R10 = c0000000fc459628   R26 = c0000000021c2420
R11 = 0002010100000005   R27 = c0000000021c2520
R12 = 0000000024002828   R28 = 0000000000000100
R13 = c00000000fb0be00   R29 = c0000000fc459400
R14 = c0000000001132d8   R30 = 0000000000000001
R15 = c0000079941f0ec0   R31 = c0000000fc459400
pc  = c0000000006fad80 n_tty_receive_buf_common+0xc0/0xbd0
cfar= c000000000b98e10 down_read+0x70/0x90
lr  = c0000000006fad5c n_tty_receive_buf_common+0x9c/0xbd0
msr = 900000000280b033   cr  = 24042828
ctr = c0000000006fb890   xer = 0000000000000000   trap =  300
dar = 0000000000002260   dsisr = 40000000

2017-04-07 01:03:55

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: tty crash in tty_ldisc_receive_buf()

On Thu, 2017-04-06 at 08:28 -0500, Rob Herring wrote:
>
> Can you try this one [1].
>
> Rob
>
> [1] https://lkml.org/lkml/2017/3/23/569

Unless I missed something, that patch doesn't change barriers or
locking, so I don't see how it could fix anything ...

Mind you we haven't quite figured out exactly what's happening here.

Cheers,
Ben.

2017-04-07 01:24:55

by Wang YanQing

[permalink] [raw]

Subject: Re: tty crash in tty_ldisc_receive_buf()

On Thu, Apr 06, 2017 at 05:04:41PM +1000, Michael Neuling wrote:
> Hi all,
>
> We are seeing the following crash (in linux-next but has been around since at
> least v4.10).
>
> [??417.514499] Unable to handle kernel paging request for data at address 0x00002260
> [??417.515361] Faulting instruction address: 0xc0000000006fad80
> cpu 0x15: Vector: 300 (Data Access) at [c00000799411f890]
> ????pc: c0000000006fad80: n_tty_receive_buf_common+0xc0/0xbd0
> ????lr: c0000000006fad5c: n_tty_receive_buf_common+0x9c/0xbd0
> ????sp: c00000799411fb10
> ???msr: 900000000280b033
> ???dar: 2260
> ?dsisr: 40000000
> ? current = 0xc0000079675d1e00
> ? paca????= 0xc00000000fb0d200 ?softe: 0 ?irq_happened: 0x01
> ????pid???= 5, comm = kworker/u56:0
> Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 2017
> enter ? for help
> [c00000799411fbe0] c0000000006ff968 tty_ldisc_receive_buf+0x48/0xe0
> [c00000799411fc10] c0000000007009d8 tty_port_default_receive_buf+0x68/0xe0
> [c00000799411fc50] c0000000006ffce4 flush_to_ldisc+0x114/0x130
> [c00000799411fca0] c00000000010a0fc process_one_work+0x1ec/0x580
> [c00000799411fd30] c00000000010a528 worker_thread+0x98/0x5d0
> [c00000799411fdc0] c00000000011343c kthread+0x16c/0x1b0
> [c00000799411fe30] c00000000000b4e8 ret_from_kernel_thread+0x5c/0x74
>
> It seems the null ptr deref is in n_tty_receive_buf_common() where we do:
>
> size_t tail = smp_load_acquire(&ldata->read_tail);
>
> ldata is NULL.
>
> We see this usually on boot but can also see it if we kill a getty attached to
> tty (which is then respawned by systemd). It seems like we are flushing data to
> a tty at the same time as it's being torn down and restarted.
>
> I did try the below patch which avoids the crash but locks up one of the CPUs. I
> guess the data never gets flushed if we say nothing is processed.
>
> This is on powerpc but has also been reported by parisc.
>
> I'm not at all familiar with the tty layer and looking at the locks, mutexes,
> semaphores and reference counting in there scares the hell out of me.?
>
> If anyone has an idea, I'm happy to try a patch.
>
> Regards,
> Mikey
>
> diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
> index bdf0e6e899..99dd757aa4 100644
> --- a/drivers/tty/n_tty.c
> +++ b/drivers/tty/n_tty.c
> @@ -1673,6 +1673,10 @@ n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
>
> down_read(&tty->termios_rwsem);
>
> + /* This probably shouldn't happen, but return 0 data processed */
> + if (!ldata)
> + return 0;
> +
> while (1) {
> /*
> * When PARMRK is set, each input char may take up to 3 chars

Maybe your patch should looks like:
+ /* This probably shouldn't happen, but return 0 data processed */
+ if (!ldata) {
+ up_read(&tty->termios_rwsem);
+ return 0;
+ }

or

Maybe below patch should work:
@@ -1668,11 +1668,12 @@ static int
n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
char *fp, int count, int flow)
{
- struct n_tty_data *ldata = tty->disc_data;
+ struct n_tty_data *ldata;
int room, n, rcvd = 0, overflow;

down_read(&tty->termios_rwsem);

+ ldata = tty->disc_data;

2017-04-07 02:06:11

by Michael Neuling

[permalink] [raw]

Subject: Re: tty crash in tty_ldisc_receive_buf()

> > + /* This probably shouldn't happen, but return 0 data processed */
> > + if (!ldata)
> > + return 0;
> > +
> > while (1) {
> > /*
> > * When PARMRK is set, each input char may take up to 3
> > chars
>
> Maybe your patch should looks like:
> + /* This probably shouldn't happen, but return 0 data processed */
> + if (!ldata) {
> +               up_read(&tty->termios_rwsem);
> + return 0;
> +       }

Oops, nice catch.. Thanks!

That does indeed fix the problem now without the softlockup. I'm not sure it's
the right fix, but full patch below.

Anyone see a problem with this approach? Am I just papering over a real issue?

> Maybe below patch should work:
> @@ -1668,11 +1668,12 @@ static int
> n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
>                           char *fp, int count, int flow)
> {
> -       struct n_tty_data *ldata = tty->disc_data;
> +       struct n_tty_data *ldata;
>            int room, n, rcvd = 0, overflow;
>
>         down_read(&tty->termios_rwsem);
>
> +       ldata = tty->disc_data;

I did try just that alone and it didn't help.

Mikey

------------------------------------------------------------------------
>From 75c2a0369450692946ca8cc7ac148a98deaecd2a Mon Sep 17 00:00:00 2001
From: Michael Neuling <[email protected]>
Date: Fri, 7 Apr 2017 11:31:02 +1000
Subject: [PATCH] tty: fix regression in flush_to_ldisc

When reiniting a tty we can end up with:

[ 417.514499] Unable to handle kernel paging request for data at address 0x00002260
[ 417.515361] Faulting instruction address: 0xc0000000006fad80
cpu 0x15: Vector: 300 (Data Access) at [c00000799411f890]
pc: c0000000006fad80: n_tty_receive_buf_common+0xc0/0xbd0
lr: c0000000006fad5c: n_tty_receive_buf_common+0x9c/0xbd0
sp: c00000799411fb10
msr: 900000000280b033
dar: 2260
dsisr: 40000000
current = 0xc0000079675d1e00
paca = 0xc00000000fb0d200 softe: 0 irq_happened: 0x01
pid = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 2017
enter ? for help
[c00000799411fbe0] c0000000006ff968 tty_ldisc_receive_buf+0x48/0xe0
[c00000799411fc10] c0000000007009d8 tty_port_default_receive_buf+0x68/0xe0
[c00000799411fc50] c0000000006ffce4 flush_to_ldisc+0x114/0x130
[c00000799411fca0] c00000000010a0fc process_one_work+0x1ec/0x580
[c00000799411fd30] c00000000010a528 worker_thread+0x98/0x5d0
[c00000799411fdc0] c00000000011343c kthread+0x16c/0x1b0
[c00000799411fe30] c00000000000b4e8 ret_from_kernel_thread+0x5c/0x74

This is due to a NULL ptr dref of tty->disc_data.

This fixes the issue by moving the disc_data read to after we take the
semaphore, then returning 0 data processed when NULL.

Cc: <[email protected]> [4.10+]
Signed-off-by: Michael Neuling <[email protected]>
---
drivers/tty/n_tty.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index bdf0e6e899..a2a9832a42 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1668,11 +1668,17 @@ static int
n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
char *fp, int count, int flow)
{
- struct n_tty_data *ldata = tty->disc_data;
+ struct n_tty_data *ldata;
int room, n, rcvd = 0, overflow;

down_read(&tty->termios_rwsem);

+ ldata = tty->disc_data;
+ if (!ldata) {
+ up_read(&tty->termios_rwsem);
+ return 0;
+ }
+
while (1) {
/*
* When PARMRK is set, each input char may take up to 3 chars
--
2.9.3

2017-04-07 14:04:03

by Rob Herring (Arm)

[permalink] [raw]

Subject: Re: tty crash in tty_ldisc_receive_buf()

On Thu, Apr 6, 2017 at 8:03 PM, Benjamin Herrenschmidt
<[email protected]> wrote:
> On Thu, 2017-04-06 at 08:28 -0500, Rob Herring wrote:
>>
>> Can you try this one [1].
>>
>> Rob
>>
>> [1] https://lkml.org/lkml/2017/3/23/569
>
> Unless I missed something, that patch doesn't change barriers or
> locking, so I don't see how it could fix anything ...

Only because that's a code path that I changed recently and the
locking and ordering appears to be tricky.

> Mind you we haven't quite figured out exactly what's happening here.

Is this with a serial port? There were some changes to serial_core close in 4.9.

Rob

2017-04-07 23:21:54

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: tty crash in tty_ldisc_receive_buf()

On Fri, 2017-04-07 at 09:03 -0500, Rob Herring wrote:
> Is this with a serial port? There were some changes to serial_core
> close in 4.9.

It's a combo of serial and hvc actually.

Cheers,
Ben.