2013-07-01 14:49:20

by Andre Naujoks

[permalink] [raw]
Subject: [PATCH] kernel panic, pty.c: remove direct call to tty_wakup in pty_write

Hello.

This patch removes the direct call to tty_wakeup in pty_write. I have
not noticed any drawbacks with this but I am not familiar with the pty
driver at all. I think what happens is a recursive loop,
write_wakeup->write->write_wakeup ...

The documentation for the tty interface forbids this direct call:

(from Documentation/serial/tty.txt)
write_wakeup() - May be called at any point between open and close.
The TTY_DO_WRITE_WAKEUP flag indicates if a call
is needed but always races versus calls. Thus the
ldisc must be careful about setting order and to
handle unexpected calls. Must not sleep.

The driver is forbidden from calling this directly
from the ->write call from the ldisc as the ldisc
is permitted to call the driver write method from
this function. In such a situation defer it.



The direct call caused a reproducable kernel panic (see bottom of this
mail) for me with the following setup:

- using can-utils from git://gitorious.org/linux-can/can-utils.git
slcan_attach and cangen are used

- create a network link between two serial CAN interfaces with:
$ socat PTY,link=/tmp/slcan0,raw TCP4-LISTEN:50000 &
$ socat TCP4:localhost:50000 PTY,link=/tmp/slcan1,raw &
$ slcan_attach /tmp/slcan0
$ slcan_attach /tmp/slcan1
$ ip link set slcan0 up
$ ip link set slcan1 up

- produce a kernel panic by overloading the CAN interfaces:
$ cangen slcan0 -g0


Please keep me in CC. I am not subscribed to the list.
If I can provide any more information, I will be glad to do so.

This is the patch. It applies to the current linux master branch:


>From 9f67139bebb938026406a66c1411e0b50628a238 Mon Sep 17 00:00:00 2001
From: Andre Naujoks <[email protected]>
Date: Mon, 1 Jul 2013 15:45:13 +0200
Subject: [PATCH 1/2] remove direct call to tty_wakeup in pty_write.

Signed-off-by: Andre Naujoks <[email protected]>
---
drivers/tty/pty.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index abfd990..5dcb782 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -127,7 +127,6 @@ static int pty_write(struct tty_struct *tty, const
unsigned char *buf, int c)
/* And shovel */
if (c) {
tty_flip_buffer_push(to->port);
- tty_wakeup(tty);
}
}
return c;
--
1.8.3.1

Regards
Andre Naujoks

Kernel-Panic:

[ 61.764168] ------------[ cut here ]------------
[ 61.765107] WARNING: at
/build/linux-9VFSO6/linux-3.9.4/kernel/softirq.c:160
_local_bh_enable_ip.isra.16+0x33/0x88()
[ 61.766467] Hardware name: Bochs
[ 61.766900] Modules linked in: can_raw
[ 61.768420] ------------[ cut here ]------------
[ 61.771474] kernel BUG at
/build/linux-9VFSO6/linux-3.9.4/kernel/sched/core.c:524!
[ 61.772378] invalid opcode: 0000 [#1] SMP
[ 61.772378] Modules linked in: can_raw can slcan vcan nfsv4 nfsd
auth_rpcgss nfs_acl nfs lockd dns_resolver fscache sunrpc loop snd_pcm
snd_page_alloc kvm_amd snd_timer kvm snd ttm soundcore drm_kms_helper
parport_pc parport drm i2c_piix4 psmouse i2c_core processor pcspkr
serio_raw thermal_sys evdev button ext4 crc16 jbd2 mbcache sg sr_mod
sd_mod crc_t10dif cdrom ata_generic virtio_net floppy ata_piix
virtio_pci virtio_ring virtio libata scsi_mod
[ 61.772378] CPU 0
[ 61.772378] Pid: 2547, comm: socat Not tainted 3.9-1-amd64 #1 Debian
3.9.4-1 Bochs Bochs
[ 61.772378] RIP: 0010:[<ffffffff8106212f>] [<ffffffff8106212f>]
resched_task+0x26/0x5d
[ 61.772378] RSP: 0018:ffff88007fc03e38 EFLAGS: 00010046
[ 61.772378] RAX: 0000000000000000 RBX: ffff88003739f7f0 RCX:
0000000000416036
[ 61.772378] RDX: 0000000000000000 RSI: 0000000000000c00 RDI:
ffff88007a9ba000
[ 61.772378] RBP: ffff88007fc13f30 R08: 0000000000000004 R09:
0000000000000001
[ 61.772378] R10: 00000000000016af R11: 000000000000b768 R12:
00000000001e8480
[ 61.772378] R13: ffff88007fc13ec0 R14: 0000000000000000 R15:
ffff88007fc03f50
[ 61.772378] FS: 00007f2c71d11700(0000) GS:ffff88007fc00000(0000)
knlGS:0000000000000000
[ 61.772378] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 61.772378] CR2: 0000000001882808 CR3: 00000000370fe000 CR4:
00000000000006f0
[ 61.772378] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 61.772378] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 61.772378] Process socat (pid: 2547, threadinfo ffff88007a9ba000,
task ffff88003739f7f0)
[ 61.772378] Stack:
[ 61.772378] ffff88003739f838 ffffffff810685f8 ffff88007fc13ec0
0000000000000000
[ 61.772378] ffff88003739f7f0 ffff88007fc0e2b0 ffffffff8107b14d
ffffffff81063a65
[ 61.772378] ffff88003739f7f0 0000000000000000 0000000000000000
ffffffff8104a51d
[ 61.772378] Call Trace:
[ 61.772378] <IRQ>
[ 61.772378] [<ffffffff810685f8>] ? task_tick_fair+0x91/0xf5
[ 61.772378] [<ffffffff8107b14d>] ? tick_sched_do_timer+0x25/0x25
[ 61.772378] [<ffffffff81063a65>] ? scheduler_tick+0xb5/0xdd
[ 61.772378] [<ffffffff8104a51d>] ? update_process_times+0x50/0x5c
[ 61.772378] [<ffffffff8107aea3>] ? tick_sched_handle+0x3f/0x4c
[ 61.772378] [<ffffffff8107b17d>] ? tick_sched_timer+0x30/0x4c
[ 61.772378] [<ffffffff8105a481>] ? __run_hrtimer+0xae/0x154
[ 61.772378] [<ffffffff8105ad13>] ? hrtimer_interrupt+0xc5/0x1a7
[ 61.772378] [<ffffffff81028c8f>] ? smp_apic_timer_interrupt+0x6e/0x81
[ 61.772378] [<ffffffff813961dd>] ? apic_timer_interrupt+0x6d/0x80
[ 61.772378] <EOI>
[ 61.772378] [<ffffffff8103d61a>] ? arch_local_irq_restore+0x2/0x8
[ 61.772378] [<ffffffff8103f5a9>] ? vprintk_emit+0x3be/0x3e4
[ 61.772378] [<ffffffff8103ed4a>] ? wake_up_klogd+0x2d/0x31
[ 61.772378] [<ffffffff81043bd8>] ? _local_bh_enable_ip.isra.16+0x33/0x88
[ 61.772378] [<ffffffff8138a939>] ? printk+0x4f/0x54
[ 61.772378] [<ffffffff810850b5>] ? print_modules+0x51/0xb8
[ 61.772378] [<ffffffff8103d537>] ? warn_slowpath_common+0x71/0x8c
[ 61.772378] [<ffffffff81043bd8>] ? _local_bh_enable_ip.isra.16+0x33/0x88
[ 61.772378] [<ffffffff813028b6>] ? tcp_sendmsg+0x1f/0x7ca
[ 61.772378] [<ffffffff8105eaea>] ? __wake_up+0x35/0x46
[ 61.772378] [<ffffffff812bc3ad>] ? sock_aio_write+0xc8/0xed
[ 61.772378] [<ffffffff8105eaea>] ? __wake_up+0x35/0x46
[ 61.772378] [<ffffffff8110cf93>] ? do_sync_write+0x62/0x9b
[ 61.772378] [<ffffffff8110d541>] ? vfs_write+0x9d/0xf8
[ 61.772378] [<ffffffff81061600>] ? should_resched+0x5/0x23
[ 61.772378] [<ffffffff8110d828>] ? sys_write+0x51/0x80
[ 61.772378] [<ffffffff813955e9>] ? system_call_fastpath+0x16/0x1b
[ 61.772378] Code: 00 5b 5b 5d c3 53 48 89 fb 48 8b 7f 08 48 c7 c0 c0
3e 01 00 8b 57 18 48 03 04 d5 80 eb 68 81 8b 00 89 c2 c1 ea 10 66 39 c2
75 02 <0f> 0b 48 8b 47 10 a8 08 75 2b e8 d7 ec ff ff 48 8b 43 08 8b 78
[ 61.772378] RIP [<ffffffff8106212f>] resched_task+0x26/0x5d
[ 61.772378] RSP <ffff88007fc03e38>
[ 61.772378] ---[ end trace e7680e6512133308 ]---
[ 61.772378] Kernel panic - not syncing: Fatal exception in interrupt





2013-07-02 09:39:19

by Dean Jenkins

[permalink] [raw]
Subject: Re: [PATCH] kernel panic, pty.c: remove direct call to tty_wakup in pty_write

On 01/07/13 15:49, Andre Naujoks wrote:
> Hello.
>
> This patch removes the direct call to tty_wakeup in pty_write. I have
> not noticed any drawbacks with this but I am not familiar with the pty
> driver at all. I think what happens is a recursive loop,
> write_wakeup->write->write_wakeup ...
Indeed there is a recursive loop that I have witnessed whilst using SLIP
over USB HID.

In the failure case for SLIP, xleft remains positive and recursion
causes a stack overflow as xleft is not updated during the recursion:
sl_encaps()-->pty_write()-->tty_wakeup()-->slip_write_wakeup()-->pty_write()-->tty_wakeup()-->slip_write_wakeup()
etc.

The underlying issue being pty_write() calling tty_wakeup() within the
initial write thread. Note that the TTY wakeup event uses tty_wakeup()
as a callback to request more data.
> The documentation for the tty interface forbids this direct call:
>
> (from Documentation/serial/tty.txt)
> write_wakeup() - May be called at any point between open and close.
> The TTY_DO_WRITE_WAKEUP flag indicates if a call
> is needed but always races versus calls. Thus the
> ldisc must be careful about setting order and to
> handle unexpected calls. Must not sleep.
>
> The driver is forbidden from calling this directly
> from the ->write call from the ldisc as the ldisc
> is permitted to call the driver write method from
> this function. In such a situation defer it.
I also saw that documentation and the code seems to be breaking that
description.

It is possible to mitigate against
pty-write-->tty_wakeup()-->slip_write_wakeup() by clearing and setting
TTY_DO_WRITE_WAKEUP at the "correct" time in the layers bound to
PTY/TTY. Indeed, I modified SLIP to do that and sent patches to the
linux-netdev mailing list although I am not aware of any progress in
having those patches accepted.

Perhaps this mitigation could be applied to your scenario ?

Eventually, your patch could be used when all protocols have mitigration
in place.

>
>
> The direct call caused a reproducable kernel panic (see bottom of this
> mail) for me with the following setup:
>
> - using can-utils from git://gitorious.org/linux-can/can-utils.git
> slcan_attach and cangen are used
>
> - create a network link between two serial CAN interfaces with:
> $ socat PTY,link=/tmp/slcan0,raw TCP4-LISTEN:50000 &
> $ socat TCP4:localhost:50000 PTY,link=/tmp/slcan1,raw &
> $ slcan_attach /tmp/slcan0
> $ slcan_attach /tmp/slcan1
> $ ip link set slcan0 up
> $ ip link set slcan1 up
>
> - produce a kernel panic by overloading the CAN interfaces:
> $ cangen slcan0 -g0
>
>
> Please keep me in CC. I am not subscribed to the list.
> If I can provide any more information, I will be glad to do so.
>
> This is the patch. It applies to the current linux master branch:
>
>
> From 9f67139bebb938026406a66c1411e0b50628a238 Mon Sep 17 00:00:00 2001
> From: Andre Naujoks <[email protected]>
> Date: Mon, 1 Jul 2013 15:45:13 +0200
> Subject: [PATCH 1/2] remove direct call to tty_wakeup in pty_write.
>
> Signed-off-by: Andre Naujoks <[email protected]>
> ---
> drivers/tty/pty.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
> index abfd990..5dcb782 100644
> --- a/drivers/tty/pty.c
> +++ b/drivers/tty/pty.c
> @@ -127,7 +127,6 @@ static int pty_write(struct tty_struct *tty, const
> unsigned char *buf, int c)
> /* And shovel */
> if (c) {
> tty_flip_buffer_push(to->port);
> - tty_wakeup(tty);
> }
> }
> return c;
I agree that this looks to be a simple remedy to the issue. However,
this code has existed in this state for many years so there is a risk
that some applications rely on this "feature" in order to work. For
example, SLIP uses this "feature" although I am not certain that the
existing SLIP code would break if your patch were applied. The TTY
wakeup event should take care of completing the transmission (in theory).

I think a lower risk approach would be to modify the upper layers in the
protocol write function to clear the TTY_DO_WRITE_WAKEUP flag before
pty_write is called and set TTY_DO_WRITE_WAKEUP afterwards to allow the
TTY wakeup event to complete the transmission. I did this in the SLIP
sl_encaps() and slip_write_wakeup() functions. Note that the SLIP
segmentation mechanism is also broken as it is possible to send
truncated SLIP frames upon a failure to send all characters of the
frame. This leads to the recursive loop on the transmission of next SLIP
frame because xleft remains positive (uninitialised) and this causes the
kernel to crash catastrophically. The scheduler crashes due to the stack
overflow overwriting the task data structures.

If anyone is interested, I could upload my SLIP patches to the
linux-kernel mailing list.

Regards,
Dean Jenkins
Mentor Graphics

2013-07-02 11:02:20

by Andre Naujoks

[permalink] [raw]
Subject: Re: [PATCH] kernel panic, pty.c: remove direct call to tty_wakup in pty_write

On 02.07.2013 11:39, schrieb Dean Jenkins:
> On 01/07/13 15:49, Andre Naujoks wrote:
>> Hello.
>>
>> This patch removes the direct call to tty_wakeup in pty_write. I have
>> not noticed any drawbacks with this but I am not familiar with the pty
>> driver at all. I think what happens is a recursive loop,
>> write_wakeup->write->write_wakeup ...
> Indeed there is a recursive loop that I have witnessed whilst using SLIP
> over USB HID.
>
> In the failure case for SLIP, xleft remains positive and recursion
> causes a stack overflow as xleft is not updated during the recursion:
> sl_encaps()-->pty_write()-->tty_wakeup()-->slip_write_wakeup()-->pty_write()-->tty_wakeup()-->slip_write_wakeup()
> etc.

The funny thing about the SLIP driver is, that I could not reproduce
this particular issue with it. The SLIP driver would just hang after a
second of high load but never crash the kernel for me. It may be,
because I used a network connection with socat, which is fast enough for
the data to go through after a few recursions.

>
> The underlying issue being pty_write() calling tty_wakeup() within the
> initial write thread. Note that the TTY wakeup event uses tty_wakeup()
> as a callback to request more data.
>> The documentation for the tty interface forbids this direct call:
>>
>> (from Documentation/serial/tty.txt)
>> write_wakeup() - May be called at any point between open and close.
>> The TTY_DO_WRITE_WAKEUP flag indicates if a call
>> is needed but always races versus calls. Thus the
>> ldisc must be careful about setting order and to
>> handle unexpected calls. Must not sleep.
>>
>> The driver is forbidden from calling this directly
>> from the ->write call from the ldisc as the ldisc
>> is permitted to call the driver write method from
>> this function. In such a situation defer it.
> I also saw that documentation and the code seems to be breaking that
> description.
>
> It is possible to mitigate against
> pty-write-->tty_wakeup()-->slip_write_wakeup() by clearing and setting
> TTY_DO_WRITE_WAKEUP at the "correct" time in the layers bound to
> PTY/TTY. Indeed, I modified SLIP to do that and sent patches to the
> linux-netdev mailing list although I am not aware of any progress in
> having those patches accepted.

See below, why I don't think, that that is a good idea.

>
> Perhaps this mitigation could be applied to your scenario ?
>
> Eventually, your patch could be used when all protocols have mitigration
> in place.
>
>>
>>
>> The direct call caused a reproducable kernel panic (see bottom of this
>> mail) for me with the following setup:
>>
>> - using can-utils from git://gitorious.org/linux-can/can-utils.git
>> slcan_attach and cangen are used
>>
>> - create a network link between two serial CAN interfaces with:
>> $ socat PTY,link=/tmp/slcan0,raw TCP4-LISTEN:50000 &
>> $ socat TCP4:localhost:50000 PTY,link=/tmp/slcan1,raw &
>> $ slcan_attach /tmp/slcan0
>> $ slcan_attach /tmp/slcan1
>> $ ip link set slcan0 up
>> $ ip link set slcan1 up
>>
>> - produce a kernel panic by overloading the CAN interfaces:
>> $ cangen slcan0 -g0
>>
>>
>> Please keep me in CC. I am not subscribed to the list.
>> If I can provide any more information, I will be glad to do so.
>>
>> This is the patch. It applies to the current linux master branch:
>>
>>
>> From 9f67139bebb938026406a66c1411e0b50628a238 Mon Sep 17 00:00:00 2001
>> From: Andre Naujoks <[email protected]>
>> Date: Mon, 1 Jul 2013 15:45:13 +0200
>> Subject: [PATCH 1/2] remove direct call to tty_wakeup in pty_write.
>>
>> Signed-off-by: Andre Naujoks <[email protected]>
>> ---
>> drivers/tty/pty.c | 1 -
>> 1 file changed, 1 deletion(-)
>>
>> diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
>> index abfd990..5dcb782 100644
>> --- a/drivers/tty/pty.c
>> +++ b/drivers/tty/pty.c
>> @@ -127,7 +127,6 @@ static int pty_write(struct tty_struct *tty, const
>> unsigned char *buf, int c)
>> /* And shovel */
>> if (c) {
>> tty_flip_buffer_push(to->port);
>> - tty_wakeup(tty);
>> }
>> }
>> return c;
> I agree that this looks to be a simple remedy to the issue. However,
> this code has existed in this state for many years so there is a risk
> that some applications rely on this "feature" in order to work. For
> example, SLIP uses this "feature" although I am not certain that the
> existing SLIP code would break if your patch were applied. The TTY
> wakeup event should take care of completing the transmission (in theory).

The SLIP driver doesn't break. It is actually more stable and does not
hang for me any more with high loads.

For example this hangs for me after a second or two over a
slip-pty-socat connection without the fix:

machineA$ netcat -l -p 60000
machineB$ cat /dev/zero | netcat <machineA-ip> 60000

With the fix, it does not hang any more.


I am currently using this patch and it "just works". I agree, that this
should probably be tested by some more people which are likely to be
affected by this.

I'd really like a comment from Greg Kroah-Hartman or Jiri Slaby on this,
because they are the maintainers for the pty driver and probably know
about any implications of this.

>
> I think a lower risk approach would be to modify the upper layers in the
> protocol write function to clear the TTY_DO_WRITE_WAKEUP flag before
> pty_write is called and set TTY_DO_WRITE_WAKEUP afterwards to allow the
> TTY wakeup event to complete the transmission. I did this in the SLIP

Does this cause a race? What if the callback is defered and we have not
re-set the flag yet, when it should be called?

The wakup function is called even with the direct call removed,
otherwise the slcan and the slip drivers would no longer work, it is
just defered as required by the documentation.

I think there is no way around fixing the pty driver. All workarounds I
could think of (including disabling the callback for the duration of the
write inside the callback) are somehow broken.

> sl_encaps() and slip_write_wakeup() functions. Note that the SLIP
> segmentation mechanism is also broken as it is possible to send
> truncated SLIP frames upon a failure to send all characters of the

Is that really a problem? The SLIP driver (and the slcan driver) stops
the netif queue before entering sl_encaps and wakes it up again, when
all data is transmitted (from the wakup call), doesn't it? So there
should be not further calls to sl_xmit while the queue is stopped. But I
might misunderstand the semantics of netif_stop_queue and
netif_wake_queue here.

Regards
Andre

> frame. This leads to the recursive loop on the transmission of next SLIP
> frame because xleft remains positive (uninitialised) and this causes the
> kernel to crash catastrophically. The scheduler crashes due to the stack
> overflow overwriting the task data structures.
>
> If anyone is interested, I could upload my SLIP patches to the
> linux-kernel mailing list.
>
> Regards,
> Dean Jenkins
> Mentor Graphics

2013-07-02 15:31:16

by Dean Jenkins

[permalink] [raw]
Subject: Re: [PATCH] kernel panic, pty.c: remove direct call to tty_wakup in pty_write

On 02/07/13 12:02, Andre Naujoks wrote:
> On 02.07.2013 11:39, schrieb Dean Jenkins:
>> On 01/07/13 15:49, Andre Naujoks wrote:
>>> Hello.
>>>
>>> This patch removes the direct call to tty_wakeup in pty_write. I have
>>> not noticed any drawbacks with this but I am not familiar with the pty
>>> driver at all. I think what happens is a recursive loop,
>>> write_wakeup->write->write_wakeup ...
>> Indeed there is a recursive loop that I have witnessed whilst using SLIP
>> over USB HID.
>>
>> In the failure case for SLIP, xleft remains positive and recursion
>> causes a stack overflow as xleft is not updated during the recursion:
>> sl_encaps()-->pty_write()-->tty_wakeup()-->slip_write_wakeup()-->pty_write()-->tty_wakeup()-->slip_write_wakeup()
>> etc.
> The funny thing about the SLIP driver is, that I could not reproduce
> this particular issue with it. The SLIP driver would just hang after a
> second of high load but never crash the kernel for me. It may be,
> because I used a network connection with socat, which is fast enough for
> the data to go through after a few recursions.
In my case, failure occurred because the xleft variable remained
positive at the end of the call to sl_encaps(). Normally, xleft is zero
upon leaving sl_encaps() because only a single write per SLIP frame to
PTY/TTY is necessary to write the whole frame to the TTY layer. In other
words, xleft is NOT initialised to zero on the transmission of the next
SLIP frame; xleft is normally zero because the previous SLIP was sent
OK. Therefore, the kernel crashes on the transmission of the next SLIP
frame after a failure to send a complete SLIP frame.

You can trigger the recursion crash by hard-coding xleft to a positive
value before exiting sl_encaps(). This simulates a failure to send the
complete SLIP frame perhaps due a kmalloc failure in the TTY buffer
allocation and therefore is a rare event.

Note some TTY layers such as Bluetooth's RFCOMM TTY consumes all the
SLIP frame characters in a single write from sl_encaps() or RFCOMM TTY
returns an error code (not handled by SLIP so is another defect in
SLIP). In other words, some TTY layers do not use the segmentation
mechanism of multiple calls to pty_write() via the TTY wakeup mechanism.
>> The underlying issue being pty_write() calling tty_wakeup() within the
>> initial write thread. Note that the TTY wakeup event uses tty_wakeup()
>> as a callback to request more data.
>>> The documentation for the tty interface forbids this direct call:
>>>
>>> (from Documentation/serial/tty.txt)
>>> write_wakeup() - May be called at any point between open and close.
>>> The TTY_DO_WRITE_WAKEUP flag indicates if a call
>>> is needed but always races versus calls. Thus the
>>> ldisc must be careful about setting order and to
>>> handle unexpected calls. Must not sleep.
>>>
>>> The driver is forbidden from calling this directly
>>> from the ->write call from the ldisc as the ldisc
>>> is permitted to call the driver write method from
>>> this function. In such a situation defer it.
>> I also saw that documentation and the code seems to be breaking that
>> description.
>>
>> It is possible to mitigate against
>> pty-write-->tty_wakeup()-->slip_write_wakeup() by clearing and setting
>> TTY_DO_WRITE_WAKEUP at the "correct" time in the layers bound to
>> PTY/TTY. Indeed, I modified SLIP to do that and sent patches to the
>> linux-netdev mailing list although I am not aware of any progress in
>> having those patches accepted.
> See below, why I don't think, that that is a good idea.
>
>> Perhaps this mitigation could be applied to your scenario ?
>>
>> Eventually, your patch could be used when all protocols have mitigration
>> in place.
>>
>>>
>>> The direct call caused a reproducable kernel panic (see bottom of this
>>> mail) for me with the following setup:
>>>
>>> - using can-utils from git://gitorious.org/linux-can/can-utils.git
>>> slcan_attach and cangen are used
>>>
>>> - create a network link between two serial CAN interfaces with:
>>> $ socat PTY,link=/tmp/slcan0,raw TCP4-LISTEN:50000 &
>>> $ socat TCP4:localhost:50000 PTY,link=/tmp/slcan1,raw &
>>> $ slcan_attach /tmp/slcan0
>>> $ slcan_attach /tmp/slcan1
>>> $ ip link set slcan0 up
>>> $ ip link set slcan1 up
>>>
>>> - produce a kernel panic by overloading the CAN interfaces:
>>> $ cangen slcan0 -g0
>>>
>>>
>>> Please keep me in CC. I am not subscribed to the list.
>>> If I can provide any more information, I will be glad to do so.
>>>
>>> This is the patch. It applies to the current linux master branch:
>>>
>>>
>>> From 9f67139bebb938026406a66c1411e0b50628a238 Mon Sep 17 00:00:00 2001
>>> From: Andre Naujoks <[email protected]>
>>> Date: Mon, 1 Jul 2013 15:45:13 +0200
>>> Subject: [PATCH 1/2] remove direct call to tty_wakeup in pty_write.
>>>
>>> Signed-off-by: Andre Naujoks <[email protected]>
>>> ---
>>> drivers/tty/pty.c | 1 -
>>> 1 file changed, 1 deletion(-)
>>>
>>> diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
>>> index abfd990..5dcb782 100644
>>> --- a/drivers/tty/pty.c
>>> +++ b/drivers/tty/pty.c
>>> @@ -127,7 +127,6 @@ static int pty_write(struct tty_struct *tty, const
>>> unsigned char *buf, int c)
>>> /* And shovel */
>>> if (c) {
>>> tty_flip_buffer_push(to->port);
>>> - tty_wakeup(tty);
>>> }
>>> }
>>> return c;
>> I agree that this looks to be a simple remedy to the issue. However,
>> this code has existed in this state for many years so there is a risk
>> that some applications rely on this "feature" in order to work. For
>> example, SLIP uses this "feature" although I am not certain that the
>> existing SLIP code would break if your patch were applied. The TTY
>> wakeup event should take care of completing the transmission (in theory).
> The SLIP driver doesn't break. It is actually more stable and does not
> hang for me any more with high loads.
>
> For example this hangs for me after a second or two over a
> slip-pty-socat connection without the fix:
>
> machineA$ netcat -l -p 60000
> machineB$ cat /dev/zero | netcat <machineA-ip> 60000
>
> With the fix, it does not hang any more.
>
>
> I am currently using this patch and it "just works". I agree, that this
> should probably be tested by some more people which are likely to be
> affected by this.
My patches work in a similar way be manipulating TTY_DO_WRITE_WAKEUP to
prevent pty_write() calling the wakeup function pointer within sl_encaps().
>
> I'd really like a comment from Greg Kroah-Hartman or Jiri Slaby on this,
> because they are the maintainers for the pty driver and probably know
> about any implications of this.
>
>> I think a lower risk approach would be to modify the upper layers in the
>> protocol write function to clear the TTY_DO_WRITE_WAKEUP flag before
>> pty_write is called and set TTY_DO_WRITE_WAKEUP afterwards to allow the
>> TTY wakeup event to complete the transmission. I did this in the SLIP
> Does this cause a race? What if the callback is defered and we have not
> re-set the flag yet, when it should be called?
I do not think there is a race with my solution. I will forward my
patches to the linux-kernel mailing list so you can see.

The race is worse at the moment because both sl_encaps() and the TTY
wakeup event can call the tty_wakeup() function and therefore
potentially calls the function pointer to execute slip_write_wakeup() at
the same time on 2 CPUs. In the current code there is a risk that
slip_write_wakeup() runs concurrently with itself.
>
> The wakup function is called even with the direct call removed,
> otherwise the slcan and the slip drivers would no longer work, it is
> just defered as required by the documentation.
>
> I think there is no way around fixing the pty driver. All workarounds I
> could think of (including disabling the callback for the duration of the
> write inside the callback) are somehow broken.
>
>> sl_encaps() and slip_write_wakeup() functions. Note that the SLIP
>> segmentation mechanism is also broken as it is possible to send
>> truncated SLIP frames upon a failure to send all characters of the
> Is that really a problem? The SLIP driver (and the slcan driver) stops
> the netif queue before entering sl_encaps and wakes it up again, when
> all data is transmitted (from the wakup call), doesn't it? So there
> should be not further calls to sl_xmit while the queue is stopped. But I
> might misunderstand the semantics of netif_stop_queue and
> netif_wake_queue here.
The problem is that xleft remembers the state from the previous SLIP
frame rather than set to zero for the new SLIP frame. Yes, the netif
queue mechanism provides locking but the lock will be prematurely
cleared in slip_write_wakeup() when the SLIP frame fails to be fully
sent because slip_write_wakeup() first uses the xleft value from the
previous SLIP frame eg. zero.

>
> Regards
> Andre
>
>> frame. This leads to the recursive loop on the transmission of next SLIP
>> frame because xleft remains positive (uninitialised) and this causes the
>> kernel to crash catastrophically. The scheduler crashes due to the stack
>> overflow overwriting the task data structures.
>>
>> If anyone is interested, I could upload my SLIP patches to the
>> linux-kernel mailing list.
Regards,
Dean Jenkins
Mentor Graphics

2013-07-02 18:59:22

by Peter Hurley

[permalink] [raw]
Subject: Re: [PATCH] kernel panic, pty.c: remove direct call to tty_wakup in pty_write

On 07/01/2013 10:49 AM, Andre Naujoks wrote:
> Hello.
>
> This patch removes the direct call to tty_wakeup in pty_write. I have
> not noticed any drawbacks with this but I am not familiar with the pty
> driver at all. I think what happens is a recursive loop,
> write_wakeup->write->write_wakeup ...
>
> The documentation for the tty interface forbids this direct call:
>
> (from Documentation/serial/tty.txt)
> write_wakeup() - May be called at any point between open and close.
> The TTY_DO_WRITE_WAKEUP flag indicates if a call
> is needed but always races versus calls. Thus the
> ldisc must be careful about setting order and to
> handle unexpected calls. Must not sleep.
>
> The driver is forbidden from calling this directly
> from the ->write call from the ldisc as the ldisc
> is permitted to call the driver write method from
> this function. In such a situation defer it.
>
>
>
> The direct call caused a reproducable kernel panic (see bottom of this
> mail) for me with the following setup:
>
> - using can-utils from git://gitorious.org/linux-can/can-utils.git
> slcan_attach and cangen are used
>
> - create a network link between two serial CAN interfaces with:
> $ socat PTY,link=/tmp/slcan0,raw TCP4-LISTEN:50000 &
> $ socat TCP4:localhost:50000 PTY,link=/tmp/slcan1,raw &
> $ slcan_attach /tmp/slcan0
> $ slcan_attach /tmp/slcan1
> $ ip link set slcan0 up
> $ ip link set slcan1 up
>
> - produce a kernel panic by overloading the CAN interfaces:
> $ cangen slcan0 -g0
>
>
> Please keep me in CC. I am not subscribed to the list.
> If I can provide any more information, I will be glad to do so.
>
> This is the patch. It applies to the current linux master branch:

An identical patch is in Greg's queue for linux-next:
'tty: Remove extra wakeup from pty write() path'

That patch's commit message details why tty_wakeup() is unnecessary,
but does not foresee or document the SLIP ldisc write()/write_wakeup()
recursion.

Since this fix will now likely go back through stable, the commit
message should include a description of the recursion, so that Greg can
merge the commit messages.

Separately, the stack trace for the WARN and the oops implicates
the network stack alone. Maybe there is some other problem?

Regards,
Peter Hurley


2013-07-02 20:21:58

by Andre Naujoks

[permalink] [raw]
Subject: Re: [PATCH] kernel panic, pty.c: remove direct call to tty_wakup in pty_write with better commit message

Thanks for the pointer. Since your patch is from April, does that mean,
that we cannot expect it to hit 3.11?


From 6cbfeb6cb9bbe7455cddbf162a7b158e1debd578 Mon Sep 17 00:00:00 2001
From: Andre Naujoks <[email protected]>
Date: Tue, 2 Jul 2013 22:11:33 +0200
Subject: [PATCH] Remove the tty_wakeup call inside pty_write

The call to tty_wakeup can cause a recursive loop and therefore a
kernel oops when pty_write is called again from the ldisc wakeup
function but there is not enough room for all data at once.

Signed-off-by: Andre Naujoks <[email protected]>
---
drivers/tty/pty.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index abfd990..cae4e4f 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -125,10 +125,8 @@ static int pty_write(struct tty_struct *tty, const
unsigned char *buf, int c)
/* Stuff the data into the input queue of the other end */
c = tty_insert_flip_string(to->port, buf, c);
/* And shovel */
- if (c) {
+ if (c)
tty_flip_buffer_push(to->port);
- tty_wakeup(tty);
- }
}
return c;
}
--
1.8.3.1



On 02.07.2013 20:59, Peter Hurley wrote:
> On 07/01/2013 10:49 AM, Andre Naujoks wrote:
>> Hello.
>>
>> This patch removes the direct call to tty_wakeup in pty_write. I have
>> not noticed any drawbacks with this but I am not familiar with the pty
>> driver at all. I think what happens is a recursive loop,
>> write_wakeup->write->write_wakeup ...
>>
>> The documentation for the tty interface forbids this direct call:
>>
>> (from Documentation/serial/tty.txt)
>> write_wakeup() - May be called at any point between open and close.
>> The TTY_DO_WRITE_WAKEUP flag indicates if a call
>> is needed but always races versus calls. Thus the
>> ldisc must be careful about setting order and to
>> handle unexpected calls. Must not sleep.
>>
>> The driver is forbidden from calling this directly
>> from the ->write call from the ldisc as the ldisc
>> is permitted to call the driver write method from
>> this function. In such a situation defer it.
>>
>>
>>
>> The direct call caused a reproducable kernel panic (see bottom of this
>> mail) for me with the following setup:
>>
>> - using can-utils from git://gitorious.org/linux-can/can-utils.git
>> slcan_attach and cangen are used
>>
>> - create a network link between two serial CAN interfaces with:
>> $ socat PTY,link=/tmp/slcan0,raw TCP4-LISTEN:50000 &
>> $ socat TCP4:localhost:50000 PTY,link=/tmp/slcan1,raw &
>> $ slcan_attach /tmp/slcan0
>> $ slcan_attach /tmp/slcan1
>> $ ip link set slcan0 up
>> $ ip link set slcan1 up
>>
>> - produce a kernel panic by overloading the CAN interfaces:
>> $ cangen slcan0 -g0
>>
>>
>> Please keep me in CC. I am not subscribed to the list.
>> If I can provide any more information, I will be glad to do so.
>>
>> This is the patch. It applies to the current linux master branch:
>
> An identical patch is in Greg's queue for linux-next:
> 'tty: Remove extra wakeup from pty write() path'
>
> That patch's commit message details why tty_wakeup() is unnecessary,
> but does not foresee or document the SLIP ldisc write()/write_wakeup()
> recursion.
>
> Since this fix will now likely go back through stable, the commit
> message should include a description of the recursion, so that Greg can
> merge the commit messages.
>
> Separately, the stack trace for the WARN and the oops implicates
> the network stack alone. Maybe there is some other problem?
>
> Regards,
> Peter Hurley
>
>
>

2013-07-02 20:31:24

by Peter Hurley

[permalink] [raw]
Subject: Re: [PATCH] kernel panic, pty.c: remove direct call to tty_wakup in pty_write with better commit message

On 07/02/2013 04:21 PM, Andre Naujoks wrote:
> Thanks for the pointer. Since your patch is from April, does that mean, that we cannot expect it to hit 3.11?

Yes, that patch was scheduled for 3.12.
But a patch with a revised commit message detailing a bug fix
could definitely go into 3.11 (which is why I suggested adding
a detailed commit message explaining the bug fix).

Regards,
Peter Hurley

2013-07-02 20:32:20

by Andre Naujoks

[permalink] [raw]
Subject: Re: [PATCH] kernel panic, pty.c: remove direct call to tty_wakup in pty_write

On 02.07.2013 20:59, Peter Hurley wrote:
> On 07/01/2013 10:49 AM, Andre Naujoks wrote:
>> Hello.
>>
>> This patch removes the direct call to tty_wakeup in pty_write. I have
>> not noticed any drawbacks with this but I am not familiar with the pty
>> driver at all. I think what happens is a recursive loop,
>> write_wakeup->write->write_wakeup ...
>>
>> The documentation for the tty interface forbids this direct call:
>>
>> (from Documentation/serial/tty.txt)
>> write_wakeup() - May be called at any point between open and close.
>> The TTY_DO_WRITE_WAKEUP flag indicates if a call
>> is needed but always races versus calls. Thus the
>> ldisc must be careful about setting order and to
>> handle unexpected calls. Must not sleep.
>>
>> The driver is forbidden from calling this directly
>> from the ->write call from the ldisc as the ldisc
>> is permitted to call the driver write method from
>> this function. In such a situation defer it.
>>
>>
>>
>> The direct call caused a reproducable kernel panic (see bottom of this
>> mail) for me with the following setup:
>>
>> - using can-utils from git://gitorious.org/linux-can/can-utils.git
>> slcan_attach and cangen are used
>>
>> - create a network link between two serial CAN interfaces with:
>> $ socat PTY,link=/tmp/slcan0,raw TCP4-LISTEN:50000 &
>> $ socat TCP4:localhost:50000 PTY,link=/tmp/slcan1,raw &
>> $ slcan_attach /tmp/slcan0
>> $ slcan_attach /tmp/slcan1
>> $ ip link set slcan0 up
>> $ ip link set slcan1 up
>>
>> - produce a kernel panic by overloading the CAN interfaces:
>> $ cangen slcan0 -g0
>>
>>
>> Please keep me in CC. I am not subscribed to the list.
>> If I can provide any more information, I will be glad to do so.
>>
>> This is the patch. It applies to the current linux master branch:
>
> An identical patch is in Greg's queue for linux-next:
> 'tty: Remove extra wakeup from pty write() path'
>
> That patch's commit message details why tty_wakeup() is unnecessary,
> but does not foresee or document the SLIP ldisc write()/write_wakeup()
> recursion.
>
> Since this fix will now likely go back through stable, the commit
> message should include a description of the recursion, so that Greg can
> merge the commit messages.
>
> Separately, the stack trace for the WARN and the oops implicates
> the network stack alone. Maybe there is some other problem?

I haven't run into any other problems so far with the patch applied. If
there is another problem, I will probably run into it soon, because I
will use this facility very extensively in the near future. We'll see
how that works out.

Regards
Andre Naujoks

>
> Regards,
> Peter Hurley
>
>
>