The fundamental issue here is that gsmld_write() uses spin_lock_irqsave()
to ensure mutual exclusion. spin_lock_irqsave() disables IRQs,
essentially placing the thread of execution into atomic mode and
therefore must not sleep at any point. Unfortunately, the call chain
eventually ends up in __might_resched() which complains loudly.
The BUG() splat looks like this:
BUG: sleeping function called from invalid context at kernel/printk/printk.c:2627
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 13029, name: (agetty)
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
3 locks held by (agetty)/13029:
#0: ffff888112fc70a0 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x21/0x70
#1: ffff888112fc7130 (&tty->atomic_write_lock){+.+.}-{3:3}, at: file_tty_write+0x1e4/0x9d0
#2: ffff8881053c73e0 (&gsm->tx_lock){....}-{2:2}, at: gsmld_write+0x5b/0x120
irq event stamp: 2506
hardirqs last enabled at (2505): [<ffffffff8b007b0e>] syscall_enter_from_user_mode+0x2e/0x1a0
hardirqs last disabled at (2506): [<ffffffff8b0ab5cc>] _raw_spin_lock_irqsave+0xac/0x120
softirqs last enabled at (514): [<ffffffff81561c9c>] __irq_exit_rcu+0xec/0x160
softirqs last disabled at (509): [<ffffffff81561c9c>] __irq_exit_rcu+0xec/0x160
Preemption disabled at:
[<0000000000000000>] 0x0
CPU: 1 PID: 13029 Comm: (agetty) Not tainted 6.6.0-rc3-next-20230929-00002-gcdf323998d5b #17
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x1e3/0x2d0
? nf_tcp_handle_invalid+0x650/0x650
? panic+0x8a0/0x8a0
__might_resched+0x5ce/0x790
? __might_sleep+0xc0/0xc0
? reacquire_held_locks+0x680/0x680
console_lock+0x1c/0x1b0
do_con_write+0x118/0x7410
? stack_trace_snprint+0xf0/0xf0
? lockdep_unlock+0x163/0x300
? lockdep_unlock+0x163/0x300
? mark_lock+0x2a0/0x350
? __lock_acquire+0x1302/0x2050
? vt_resize+0xc0/0xc0
? do_raw_spin_lock+0x147/0x3a0
? __rwlock_init+0x140/0x140
? _raw_spin_lock_irqsave+0xdd/0x120
? _raw_spin_lock+0x40/0x40
con_write+0x29/0x640
? check_heap_object+0x249/0x870
gsmld_write+0xf9/0x120
? gsmld_read+0x10/0x10
file_tty_write+0x57c/0x9d0
vfs_write+0x77d/0xaf0
? file_end_write+0x250/0x250
? tty_ioctl+0x8fa/0xbc0
? __fdget_pos+0x1d2/0x330
ksys_write+0x19b/0x2c0
? print_irqtrace_events+0x220/0x220
? __ia32_sys_read+0x80/0x80
? syscall_enter_from_user_mode+0x2e/0x1a0
? syscall_enter_from_user_mode+0x2e/0x1a0
do_syscall_64+0x2b/0x70
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f76546101b0
The important part of the call stack being:
gsmld_write() # Takes a lock and disables IRQs
con_write()
console_lock()
__might_sleep()
__might_resched() # Complains that IRQs are disabled
To fix this, let's ensure mutual exclusion by using a protected shared
variable (busy) instead. We'll use the current locking mechanism to
protect it, but ensure that the locks are released and IRQs re-enabled
by the time we transit further down the call chain which may sleep.
Cc: Daniel Starke <[email protected]>
Cc: Fedor Pchelkin <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: [email protected]
Reported-by: [email protected]
Signed-off-by: Lee Jones <[email protected]>
---
drivers/tty/n_gsm.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c
index 1f3aba607cd51..b83a97d58381f 100644
--- a/drivers/tty/n_gsm.c
+++ b/drivers/tty/n_gsm.c
@@ -270,6 +270,7 @@ struct gsm_mux {
struct tty_struct *tty; /* The tty our ldisc is bound to */
spinlock_t lock;
struct mutex mutex;
+ bool busy;
unsigned int num;
struct kref ref;
@@ -3253,6 +3254,7 @@ static struct gsm_mux *gsm_alloc_mux(void)
gsm->dead = true; /* Avoid early tty opens */
gsm->wait_config = false; /* Disabled */
gsm->keep_alive = 0; /* Disabled */
+ gsm->busy = false;
/* Store the instance to the mux array or abort if no space is
* available.
@@ -3718,11 +3720,21 @@ static ssize_t gsmld_write(struct tty_struct *tty, struct file *file,
ret = -ENOBUFS;
spin_lock_irqsave(&gsm->tx_lock, flags);
+ if (gsm->busy) {
+ spin_unlock_irqrestore(&gsm->tx_lock, flags);
+ return -EBUSY;
+ }
+ gsm->busy = true;
+ spin_unlock_irqrestore(&gsm->tx_lock, flags);
+
space = tty_write_room(tty);
if (space >= nr)
ret = tty->ops->write(tty, buf, nr);
else
set_bit(TTY_DO_WRITE_WAKEUP, &tty->flags);
+
+ spin_lock_irqsave(&gsm->tx_lock, flags);
+ gsm->busy = false;
spin_unlock_irqrestore(&gsm->tx_lock, flags);
return ret;
--
2.42.0.582.g8ccd20d70d-goog
On Tue, Oct 03, 2023 at 06:00:20PM +0100, Lee Jones wrote:
> The important part of the call stack being:
>
> gsmld_write() # Takes a lock and disables IRQs
> con_write()
> console_lock()
Wait, why is the n_gsm line discipline being used for a console?
What hardware/protocol wants this to happen?
gsm I thought was for a very specific type of device, not a console.
As per:
https://www.kernel.org/doc/html/v5.9/driver-api/serial/n_gsm.html
this is a specific modem protocol, why is con_write() being called?
> __might_sleep()
> __might_resched() # Complains that IRQs are disabled
>
> To fix this, let's ensure mutual exclusion by using a protected shared
> variable (busy) instead. We'll use the current locking mechanism to
> protect it, but ensure that the locks are released and IRQs re-enabled
> by the time we transit further down the call chain which may sleep.
>
> Cc: Daniel Starke <[email protected]>
> Cc: Fedor Pchelkin <[email protected]>
> Cc: Jiri Slaby <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: [email protected]
> Reported-by: [email protected]
> Signed-off-by: Lee Jones <[email protected]>
> ---
> drivers/tty/n_gsm.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c
> index 1f3aba607cd51..b83a97d58381f 100644
> --- a/drivers/tty/n_gsm.c
> +++ b/drivers/tty/n_gsm.c
> @@ -270,6 +270,7 @@ struct gsm_mux {
> struct tty_struct *tty; /* The tty our ldisc is bound to */
> spinlock_t lock;
> struct mutex mutex;
> + bool busy;
> unsigned int num;
> struct kref ref;
>
> @@ -3253,6 +3254,7 @@ static struct gsm_mux *gsm_alloc_mux(void)
> gsm->dead = true; /* Avoid early tty opens */
> gsm->wait_config = false; /* Disabled */
> gsm->keep_alive = 0; /* Disabled */
> + gsm->busy = false;
>
> /* Store the instance to the mux array or abort if no space is
> * available.
> @@ -3718,11 +3720,21 @@ static ssize_t gsmld_write(struct tty_struct *tty, struct file *file,
>
> ret = -ENOBUFS;
> spin_lock_irqsave(&gsm->tx_lock, flags);
> + if (gsm->busy) {
> + spin_unlock_irqrestore(&gsm->tx_lock, flags);
> + return -EBUSY;
So you just "busted" the re-entrant call chain here, are you sure this
is ok for this protocl? Can it handle -EBUSY?
Daniel, any thoughts?
And Lee, you really don't have this hardware, right? So why are you
dealing with bug reports for it? :)
thanks,
greg k-h
On Tue, 03 Oct 2023, Greg Kroah-Hartman wrote:
> On Tue, Oct 03, 2023 at 06:00:20PM +0100, Lee Jones wrote:
> > The important part of the call stack being:
> >
> > gsmld_write() # Takes a lock and disables IRQs
> > con_write()
> > console_lock()
>
> Wait, why is the n_gsm line discipline being used for a console?
>
> What hardware/protocol wants this to happen?
>
> gsm I thought was for a very specific type of device, not a console.
>
> As per:
> https://www.kernel.org/doc/html/v5.9/driver-api/serial/n_gsm.html
> this is a specific modem protocol, why is con_write() being called?
What it's meant for and what random users can make it do are likely to
be quite separate questions. This scenario is user driven and can be
replicated simply by issuing a few syscalls (open, ioctl, write).
> > __might_sleep()
> > __might_resched() # Complains that IRQs are disabled
> >
> > To fix this, let's ensure mutual exclusion by using a protected shared
> > variable (busy) instead. We'll use the current locking mechanism to
> > protect it, but ensure that the locks are released and IRQs re-enabled
> > by the time we transit further down the call chain which may sleep.
> >
> > Cc: Daniel Starke <[email protected]>
> > Cc: Fedor Pchelkin <[email protected]>
> > Cc: Jiri Slaby <[email protected]>
> > Cc: Greg Kroah-Hartman <[email protected]>
> > Cc: [email protected]
> > Reported-by: [email protected]
> > Signed-off-by: Lee Jones <[email protected]>
> > ---
> > drivers/tty/n_gsm.c | 12 ++++++++++++
> > 1 file changed, 12 insertions(+)
> >
> > diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c
> > index 1f3aba607cd51..b83a97d58381f 100644
> > --- a/drivers/tty/n_gsm.c
> > +++ b/drivers/tty/n_gsm.c
> > @@ -270,6 +270,7 @@ struct gsm_mux {
> > struct tty_struct *tty; /* The tty our ldisc is bound to */
> > spinlock_t lock;
> > struct mutex mutex;
> > + bool busy;
> > unsigned int num;
> > struct kref ref;
> >
> > @@ -3253,6 +3254,7 @@ static struct gsm_mux *gsm_alloc_mux(void)
> > gsm->dead = true; /* Avoid early tty opens */
> > gsm->wait_config = false; /* Disabled */
> > gsm->keep_alive = 0; /* Disabled */
> > + gsm->busy = false;
> >
> > /* Store the instance to the mux array or abort if no space is
> > * available.
> > @@ -3718,11 +3720,21 @@ static ssize_t gsmld_write(struct tty_struct *tty, struct file *file,
> >
> > ret = -ENOBUFS;
> > spin_lock_irqsave(&gsm->tx_lock, flags);
> > + if (gsm->busy) {
> > + spin_unlock_irqrestore(&gsm->tx_lock, flags);
> > + return -EBUSY;
>
> So you just "busted" the re-entrant call chain here, are you sure this
> is ok for this protocl? Can it handle -EBUSY?
I should have marked this submission as RFC. Mea culpa. Please
consider it as such going forward. Feedback like this is highly
valuable.
> Daniel, any thoughts?
>
> And Lee, you really don't have this hardware, right? So why are you
> dealing with bug reports for it? :)
'cos Syzkaller.
And no, as per the splat above, this was reproduced in qemu.
--
Lee Jones [李琼斯]
> Daniel, any thoughts?
Our application of this protocol is only with specific modems to enable
circuit switched operation (handling calls, selecting/querying networks,
etc.) while doing packet switched communication (i.e. IP traffic over PPP).
The protocol was developed for such use cases.
Regarding the issue itself:
There was already an attempt to fix all this by switching from spinlocks to
mutexes resulting in ~20% performance loss. However, the patch was reverted
as it did not handle the T1 timer leading into sleep during atomic within
gsm_dlci_t1() on every mutex lock there.
There was also a suggestion to fix this in do_con_write() as
tty_operations::write() appears to be documented as "not allowed to sleep".
The patch for this was rejected. It did not fix the issue within n_gsm.
Link: https://lore.kernel.org/all/[email protected]/
Link: https://lore.kernel.org/all/[email protected]/
Link: https://lore.kernel.org/all/[email protected]/
Best regards,
Daniel Starke
On Tue, Oct 03, 2023 at 07:55:00PM +0100, Lee Jones wrote:
> On Tue, 03 Oct 2023, Greg Kroah-Hartman wrote:
>
> > On Tue, Oct 03, 2023 at 06:00:20PM +0100, Lee Jones wrote:
> > > The important part of the call stack being:
> > >
> > > gsmld_write() # Takes a lock and disables IRQs
> > > con_write()
> > > console_lock()
> >
> > Wait, why is the n_gsm line discipline being used for a console?
> >
> > What hardware/protocol wants this to happen?
> >
> > gsm I thought was for a very specific type of device, not a console.
> >
> > As per:
> > https://www.kernel.org/doc/html/v5.9/driver-api/serial/n_gsm.html
> > this is a specific modem protocol, why is con_write() being called?
>
> What it's meant for and what random users can make it do are likely to
> be quite separate questions. This scenario is user driven and can be
> replicated simply by issuing a few syscalls (open, ioctl, write).
I would recommend that any distro/system that does not want to support
this specific hardware protocol, just disable it for now (it's marked as
experimental too), if they don't want to deal with the potential
sleep-while-atomic issue.
> > And Lee, you really don't have this hardware, right? So why are you
> > dealing with bug reports for it? :)
>
> 'cos Syzkaller.
Ah, yeah, fake report, no real issue here then :)
thanks,
greg k-h
On Wed, Oct 04, 2023 at 05:59:09AM +0000, Starke, Daniel wrote:
> > Daniel, any thoughts?
>
> Our application of this protocol is only with specific modems to enable
> circuit switched operation (handling calls, selecting/querying networks,
> etc.) while doing packet switched communication (i.e. IP traffic over PPP).
> The protocol was developed for such use cases.
>
> Regarding the issue itself:
> There was already an attempt to fix all this by switching from spinlocks to
> mutexes resulting in ~20% performance loss. However, the patch was reverted
> as it did not handle the T1 timer leading into sleep during atomic within
> gsm_dlci_t1() on every mutex lock there.
> There was also a suggestion to fix this in do_con_write() as
> tty_operations::write() appears to be documented as "not allowed to sleep".
> The patch for this was rejected. It did not fix the issue within n_gsm.
>
> Link: https://lore.kernel.org/all/[email protected]/
> Link: https://lore.kernel.org/all/[email protected]/
> Link: https://lore.kernel.org/all/[email protected]/
Ok, I thought I remembered this, I'll just drop this patch from my
review queue and wait for a better solution if it ever comes up as this
isn't a real issue that people are seeing on actual systems, but just a
syzbot report.
thanks,
greg k-h
On 04. 10. 23, 8:05, Greg Kroah-Hartman wrote:
> On Wed, Oct 04, 2023 at 05:59:09AM +0000, Starke, Daniel wrote:
>>> Daniel, any thoughts?
>>
>> Our application of this protocol is only with specific modems to enable
>> circuit switched operation (handling calls, selecting/querying networks,
>> etc.) while doing packet switched communication (i.e. IP traffic over PPP).
>> The protocol was developed for such use cases.
>>
>> Regarding the issue itself:
>> There was already an attempt to fix all this by switching from spinlocks to
>> mutexes resulting in ~20% performance loss. However, the patch was reverted
>> as it did not handle the T1 timer leading into sleep during atomic within
>> gsm_dlci_t1() on every mutex lock there.
>> There was also a suggestion to fix this in do_con_write() as
>> tty_operations::write() appears to be documented as "not allowed to sleep".
>> The patch for this was rejected. It did not fix the issue within n_gsm.
>>
>> Link: https://lore.kernel.org/all/[email protected]/
>> Link: https://lore.kernel.org/all/[email protected]/
>> Link: https://lore.kernel.org/all/[email protected]/
>
> Ok, I thought I remembered this, I'll just drop this patch from my
> review queue and wait for a better solution if it ever comes up as this
> isn't a real issue that people are seeing on actual systems, but just a
> syzbot report.
I remember too and it is not easy.
So without actually looking into the code, can we just somehow disallow
attaching this line discipline to a console (ban such a TIOCSETD) and be
done with this issue? IOW disallow root to shoot their foot.
regards,
--
js
suse labs
On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> On Wed, Oct 04, 2023 at 05:59:09AM +0000, Starke, Daniel wrote:
> > > Daniel, any thoughts?
> >
> > Our application of this protocol is only with specific modems to enable
> > circuit switched operation (handling calls, selecting/querying networks,
> > etc.) while doing packet switched communication (i.e. IP traffic over PPP).
> > The protocol was developed for such use cases.
> >
> > Regarding the issue itself:
> > There was already an attempt to fix all this by switching from spinlocks to
> > mutexes resulting in ~20% performance loss. However, the patch was reverted
> > as it did not handle the T1 timer leading into sleep during atomic within
> > gsm_dlci_t1() on every mutex lock there.
That's correct. When I initially saw this report, my initial thought
was to replace the spinlocks with mutexts, but having read the previous
accepted attempt and it's subsequent reversion I started to think of
other ways to solve this issue. This solution, unlike the last, does
not involve adding sleep inducing locks into atomic contexts, nor
should it negatively affect performance.
> > There was also a suggestion to fix this in do_con_write() as
> > tty_operations::write() appears to be documented as "not allowed to sleep".
> > The patch for this was rejected. It did not fix the issue within n_gsm.
> >
> > Link: https://lore.kernel.org/all/[email protected]/
> > Link: https://lore.kernel.org/all/[email protected]/
> > Link: https://lore.kernel.org/all/[email protected]/
>
> Ok, I thought I remembered this, I'll just drop this patch from my
> review queue and wait for a better solution if it ever comes up as this
> isn't a real issue that people are seeing on actual systems, but just a
> syzbot report.
What does the "better solution" look like?
--
Lee Jones [李琼斯]
> So without actually looking into the code, can we just somehow disallow
> attaching this line discipline to a console (ban such a TIOCSETD) and be
> done with this issue? IOW disallow root to shoot their foot.
Probably possible by patching gsmld_open() if there is a function which can
check this based on a struct tty_struct * handle. That may, however,
introduce some cross references that are hard to maintain.
Best regards,
Daniel Starke
On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> On Tue, Oct 03, 2023 at 07:55:00PM +0100, Lee Jones wrote:
> > On Tue, 03 Oct 2023, Greg Kroah-Hartman wrote:
> >
> > > On Tue, Oct 03, 2023 at 06:00:20PM +0100, Lee Jones wrote:
> > > > The important part of the call stack being:
> > > >
> > > > gsmld_write() # Takes a lock and disables IRQs
> > > > con_write()
> > > > console_lock()
> > >
> > > Wait, why is the n_gsm line discipline being used for a console?
> > >
> > > What hardware/protocol wants this to happen?
> > >
> > > gsm I thought was for a very specific type of device, not a console.
> > >
> > > As per:
> > > https://www.kernel.org/doc/html/v5.9/driver-api/serial/n_gsm.html
> > > this is a specific modem protocol, why is con_write() being called?
> >
> > What it's meant for and what random users can make it do are likely to
> > be quite separate questions. This scenario is user driven and can be
> > replicated simply by issuing a few syscalls (open, ioctl, write).
>
> I would recommend that any distro/system that does not want to support
> this specific hardware protocol, just disable it for now (it's marked as
> experimental too), if they don't want to deal with the potential
> sleep-while-atomic issue.
n_gsm is available on all the systems I have available. The mention of
'EXPERIMENTAL' in the module description appears to have zero effect on
whether distros choose to make it available or not. If you're saying
that we know this module is BROKEN however, then perhaps we should mark
it as such.
> > > And Lee, you really don't have this hardware, right? So why are you
> > > dealing with bug reports for it? :)
> >
> > 'cos Syzkaller.
>
> Ah, yeah, fake report, no real issue here then :)
Ouch! The way I see it, the present issue with Syzkaller is that we do
not have the resources to remedy all of the issues it flags. Passing
off all reports as 'not real issues' is going to make engineers who
decide to work on them feel undervalued and is likely have a detrimental
effect overall. As an ambassador for young and new people trying to get
into Kernel Engineering in general, is that really the outcome you're
after?
--
Lee Jones [李琼斯]
On Wed, Oct 04, 2023 at 10:09:18AM +0100, Lee Jones wrote:
> On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
>
> > On Tue, Oct 03, 2023 at 07:55:00PM +0100, Lee Jones wrote:
> > > On Tue, 03 Oct 2023, Greg Kroah-Hartman wrote:
> > >
> > > > On Tue, Oct 03, 2023 at 06:00:20PM +0100, Lee Jones wrote:
> > > > > The important part of the call stack being:
> > > > >
> > > > > gsmld_write() # Takes a lock and disables IRQs
> > > > > con_write()
> > > > > console_lock()
> > > >
> > > > Wait, why is the n_gsm line discipline being used for a console?
> > > >
> > > > What hardware/protocol wants this to happen?
> > > >
> > > > gsm I thought was for a very specific type of device, not a console.
> > > >
> > > > As per:
> > > > https://www.kernel.org/doc/html/v5.9/driver-api/serial/n_gsm.html
> > > > this is a specific modem protocol, why is con_write() being called?
> > >
> > > What it's meant for and what random users can make it do are likely to
> > > be quite separate questions. This scenario is user driven and can be
> > > replicated simply by issuing a few syscalls (open, ioctl, write).
> >
> > I would recommend that any distro/system that does not want to support
> > this specific hardware protocol, just disable it for now (it's marked as
> > experimental too), if they don't want to deal with the potential
> > sleep-while-atomic issue.
>
> n_gsm is available on all the systems I have available.
Then file a bug with your distro to disable it? No real general purpose
distro should enable it from what I can tell.
> The mention of
> 'EXPERIMENTAL' in the module description appears to have zero effect on
> whether distros choose to make it available or not. If you're saying
> that we know this module is BROKEN however, then perhaps we should mark
> it as such.
Or we just prevent it from being bound to a console as that's not
something that should be happening.
And again, the "worst" that can happen is the calling process locks up,
due to a lock inversion, right?
> > > > And Lee, you really don't have this hardware, right? So why are you
> > > > dealing with bug reports for it? :)
> > >
> > > 'cos Syzkaller.
> >
> > Ah, yeah, fake report, no real issue here then :)
>
> Ouch! The way I see it, the present issue with Syzkaller is that we do
> not have the resources to remedy all of the issues it flags. Passing
> off all reports as 'not real issues' is going to make engineers who
> decide to work on them feel undervalued and is likely have a detrimental
> effect overall. As an ambassador for young and new people trying to get
> into Kernel Engineering in general, is that really the outcome you're
> after?
That's not what I'm saying here at all, what I'm saying is "pick issues
that are real". syzbot does not always make it obvious what is, and is
not, a real issue. There have been long threads and discussions about
this and some developers are now just ignoring all syzbot reports (see
the filesystem thread on the ksummit discuss list for more details.)
For this specific issue, it's been much-reported, and is not trivial,
and I would argue, not a "real" problem in the grand scheme of things
for normal users to worry about.
thanks,
greg k-h
On Wed, Oct 04, 2023 at 10:09:18AM +0100, Lee Jones wrote:
> On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
>
> > On Tue, Oct 03, 2023 at 07:55:00PM +0100, Lee Jones wrote:
> > > On Tue, 03 Oct 2023, Greg Kroah-Hartman wrote:
> > >
> > > > On Tue, Oct 03, 2023 at 06:00:20PM +0100, Lee Jones wrote:
> > > > > The important part of the call stack being:
> > > > >
> > > > > gsmld_write() # Takes a lock and disables IRQs
> > > > > con_write()
> > > > > console_lock()
> > > >
> > > > Wait, why is the n_gsm line discipline being used for a console?
> > > >
> > > > What hardware/protocol wants this to happen?
> > > >
> > > > gsm I thought was for a very specific type of device, not a console.
> > > >
> > > > As per:
> > > > https://www.kernel.org/doc/html/v5.9/driver-api/serial/n_gsm.html
> > > > this is a specific modem protocol, why is con_write() being called?
> > >
> > > What it's meant for and what random users can make it do are likely to
> > > be quite separate questions. This scenario is user driven and can be
> > > replicated simply by issuing a few syscalls (open, ioctl, write).
> >
> > I would recommend that any distro/system that does not want to support
> > this specific hardware protocol, just disable it for now (it's marked as
> > experimental too), if they don't want to deal with the potential
> > sleep-while-atomic issue.
>
> n_gsm is available on all the systems I have available. The mention of
> 'EXPERIMENTAL' in the module description appears to have zero effect on
> whether distros choose to make it available or not. If you're saying
> that we know this module is BROKEN however, then perhaps we should mark
> it as such.
Also, I think this requires root to set this line discipline to the
console, right? A normal user can't do that, or am I missing a code
path here?
Is there a reproducer somewhere for this issue that runs as a normal
user? I couldn't find one in the syzbot listings but I might have been
not looking deep enough.
thanks,
greg k-h
On Wed, Oct 04, 2023 at 09:57:20AM +0100, Lee Jones wrote:
> On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
>
> > On Wed, Oct 04, 2023 at 05:59:09AM +0000, Starke, Daniel wrote:
> > > > Daniel, any thoughts?
> > >
> > > Our application of this protocol is only with specific modems to enable
> > > circuit switched operation (handling calls, selecting/querying networks,
> > > etc.) while doing packet switched communication (i.e. IP traffic over PPP).
> > > The protocol was developed for such use cases.
> > >
> > > Regarding the issue itself:
> > > There was already an attempt to fix all this by switching from spinlocks to
> > > mutexes resulting in ~20% performance loss. However, the patch was reverted
> > > as it did not handle the T1 timer leading into sleep during atomic within
> > > gsm_dlci_t1() on every mutex lock there.
>
> That's correct. When I initially saw this report, my initial thought
> was to replace the spinlocks with mutexts, but having read the previous
> accepted attempt and it's subsequent reversion I started to think of
> other ways to solve this issue. This solution, unlike the last, does
> not involve adding sleep inducing locks into atomic contexts, nor
> should it negatively affect performance.
>
> > > There was also a suggestion to fix this in do_con_write() as
> > > tty_operations::write() appears to be documented as "not allowed to sleep".
> > > The patch for this was rejected. It did not fix the issue within n_gsm.
> > >
> > > Link: https://lore.kernel.org/all/[email protected]/
> > > Link: https://lore.kernel.org/all/[email protected]/
> > > Link: https://lore.kernel.org/all/[email protected]/
> >
> > Ok, I thought I remembered this, I'll just drop this patch from my
> > review queue and wait for a better solution if it ever comes up as this
> > isn't a real issue that people are seeing on actual systems, but just a
> > syzbot report.
>
> What does the "better solution" look like?
One that actually fixes the root problem here (i.e. does not break the
recursion loop, or cause a performance decrease for normal users, or
prevent this from being bound to the console).
thanks,
greg k-h
On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> On Wed, Oct 04, 2023 at 09:57:20AM +0100, Lee Jones wrote:
> > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> >
> > > On Wed, Oct 04, 2023 at 05:59:09AM +0000, Starke, Daniel wrote:
> > > > > Daniel, any thoughts?
> > > >
> > > > Our application of this protocol is only with specific modems to enable
> > > > circuit switched operation (handling calls, selecting/querying networks,
> > > > etc.) while doing packet switched communication (i.e. IP traffic over PPP).
> > > > The protocol was developed for such use cases.
> > > >
> > > > Regarding the issue itself:
> > > > There was already an attempt to fix all this by switching from spinlocks to
> > > > mutexes resulting in ~20% performance loss. However, the patch was reverted
> > > > as it did not handle the T1 timer leading into sleep during atomic within
> > > > gsm_dlci_t1() on every mutex lock there.
> >
> > That's correct. When I initially saw this report, my initial thought
> > was to replace the spinlocks with mutexts, but having read the previous
> > accepted attempt and it's subsequent reversion I started to think of
> > other ways to solve this issue. This solution, unlike the last, does
> > not involve adding sleep inducing locks into atomic contexts, nor
> > should it negatively affect performance.
> >
> > > > There was also a suggestion to fix this in do_con_write() as
> > > > tty_operations::write() appears to be documented as "not allowed to sleep".
> > > > The patch for this was rejected. It did not fix the issue within n_gsm.
> > > >
> > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > Link: https://lore.kernel.org/all/[email protected]/
> > >
> > > Ok, I thought I remembered this, I'll just drop this patch from my
> > > review queue and wait for a better solution if it ever comes up as this
> > > isn't a real issue that people are seeing on actual systems, but just a
> > > syzbot report.
> >
> > What does the "better solution" look like?
>
> One that actually fixes the root problem here (i.e. does not break the
> recursion loop, or cause a performance decrease for normal users, or
> prevent this from being bound to the console).
Does this solution break the recursion loop or affect performance?
The last suggestion was recently made (after mine was posted).
--
Lee Jones [李琼斯]
On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> On Wed, Oct 04, 2023 at 10:09:18AM +0100, Lee Jones wrote:
> > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> >
> > > On Tue, Oct 03, 2023 at 07:55:00PM +0100, Lee Jones wrote:
> > > > On Tue, 03 Oct 2023, Greg Kroah-Hartman wrote:
> > > >
> > > > > On Tue, Oct 03, 2023 at 06:00:20PM +0100, Lee Jones wrote:
> > > > > > The important part of the call stack being:
> > > > > >
> > > > > > gsmld_write() # Takes a lock and disables IRQs
> > > > > > con_write()
> > > > > > console_lock()
> > > > >
> > > > > Wait, why is the n_gsm line discipline being used for a console?
> > > > >
> > > > > What hardware/protocol wants this to happen?
> > > > >
> > > > > gsm I thought was for a very specific type of device, not a console.
> > > > >
> > > > > As per:
> > > > > https://www.kernel.org/doc/html/v5.9/driver-api/serial/n_gsm.html
> > > > > this is a specific modem protocol, why is con_write() being called?
> > > >
> > > > What it's meant for and what random users can make it do are likely to
> > > > be quite separate questions. This scenario is user driven and can be
> > > > replicated simply by issuing a few syscalls (open, ioctl, write).
> > >
> > > I would recommend that any distro/system that does not want to support
> > > this specific hardware protocol, just disable it for now (it's marked as
> > > experimental too), if they don't want to deal with the potential
> > > sleep-while-atomic issue.
> >
> > n_gsm is available on all the systems I have available. The mention of
> > 'EXPERIMENTAL' in the module description appears to have zero effect on
> > whether distros choose to make it available or not. If you're saying
> > that we know this module is BROKEN however, then perhaps we should mark
> > it as such.
>
> Also, I think this requires root to set this line discipline to the
> console, right? A normal user can't do that, or am I missing a code
> path here?
I haven't been testing long, but yes, early indications show that root
is required.
> Is there a reproducer somewhere for this issue that runs as a normal
> user? I couldn't find one in the syzbot listings but I might have been
> not looking deep enough.
https://syzkaller.appspot.com/text?tag=ReproC&x=15578d8fa80000
--
Lee Jones [李琼斯]
On Wed, Oct 04, 2023 at 01:57:58PM +0100, Lee Jones wrote:
> On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
>
> > On Wed, Oct 04, 2023 at 10:09:18AM +0100, Lee Jones wrote:
> > > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> > >
> > > > On Tue, Oct 03, 2023 at 07:55:00PM +0100, Lee Jones wrote:
> > > > > On Tue, 03 Oct 2023, Greg Kroah-Hartman wrote:
> > > > >
> > > > > > On Tue, Oct 03, 2023 at 06:00:20PM +0100, Lee Jones wrote:
> > > > > > > The important part of the call stack being:
> > > > > > >
> > > > > > > gsmld_write() # Takes a lock and disables IRQs
> > > > > > > con_write()
> > > > > > > console_lock()
> > > > > >
> > > > > > Wait, why is the n_gsm line discipline being used for a console?
> > > > > >
> > > > > > What hardware/protocol wants this to happen?
> > > > > >
> > > > > > gsm I thought was for a very specific type of device, not a console.
> > > > > >
> > > > > > As per:
> > > > > > https://www.kernel.org/doc/html/v5.9/driver-api/serial/n_gsm.html
> > > > > > this is a specific modem protocol, why is con_write() being called?
> > > > >
> > > > > What it's meant for and what random users can make it do are likely to
> > > > > be quite separate questions. This scenario is user driven and can be
> > > > > replicated simply by issuing a few syscalls (open, ioctl, write).
> > > >
> > > > I would recommend that any distro/system that does not want to support
> > > > this specific hardware protocol, just disable it for now (it's marked as
> > > > experimental too), if they don't want to deal with the potential
> > > > sleep-while-atomic issue.
> > >
> > > n_gsm is available on all the systems I have available. The mention of
> > > 'EXPERIMENTAL' in the module description appears to have zero effect on
> > > whether distros choose to make it available or not. If you're saying
> > > that we know this module is BROKEN however, then perhaps we should mark
> > > it as such.
> >
> > Also, I think this requires root to set this line discipline to the
> > console, right? A normal user can't do that, or am I missing a code
> > path here?
>
> I haven't been testing long, but yes, early indications show that root
> is required.
Oh good, then this really isn't that high of a priority to get fixed as
root can do much worse things :)
thanks,
greg k-h
On Wed, Oct 04, 2023 at 01:57:04PM +0100, Lee Jones wrote:
> On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
>
> > On Wed, Oct 04, 2023 at 09:57:20AM +0100, Lee Jones wrote:
> > > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> > >
> > > > On Wed, Oct 04, 2023 at 05:59:09AM +0000, Starke, Daniel wrote:
> > > > > > Daniel, any thoughts?
> > > > >
> > > > > Our application of this protocol is only with specific modems to enable
> > > > > circuit switched operation (handling calls, selecting/querying networks,
> > > > > etc.) while doing packet switched communication (i.e. IP traffic over PPP).
> > > > > The protocol was developed for such use cases.
> > > > >
> > > > > Regarding the issue itself:
> > > > > There was already an attempt to fix all this by switching from spinlocks to
> > > > > mutexes resulting in ~20% performance loss. However, the patch was reverted
> > > > > as it did not handle the T1 timer leading into sleep during atomic within
> > > > > gsm_dlci_t1() on every mutex lock there.
> > >
> > > That's correct. When I initially saw this report, my initial thought
> > > was to replace the spinlocks with mutexts, but having read the previous
> > > accepted attempt and it's subsequent reversion I started to think of
> > > other ways to solve this issue. This solution, unlike the last, does
> > > not involve adding sleep inducing locks into atomic contexts, nor
> > > should it negatively affect performance.
> > >
> > > > > There was also a suggestion to fix this in do_con_write() as
> > > > > tty_operations::write() appears to be documented as "not allowed to sleep".
> > > > > The patch for this was rejected. It did not fix the issue within n_gsm.
> > > > >
> > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > >
> > > > Ok, I thought I remembered this, I'll just drop this patch from my
> > > > review queue and wait for a better solution if it ever comes up as this
> > > > isn't a real issue that people are seeing on actual systems, but just a
> > > > syzbot report.
> > >
> > > What does the "better solution" look like?
> >
> > One that actually fixes the root problem here (i.e. does not break the
> > recursion loop, or cause a performance decrease for normal users, or
> > prevent this from being bound to the console).
>
> Does this solution break the recursion loop or affect performance?
This solution broke the recursion by returning an error, right?
The performance one was by using mutexes as in previous attempts.
thanks,
greg k-h
On 04. 10. 23, 14:57, Lee Jones wrote:
>>> n_gsm is available on all the systems I have available. The mention of
>>> 'EXPERIMENTAL' in the module description appears to have zero effect on
>>> whether distros choose to make it available or not. If you're saying
>>> that we know this module is BROKEN however, then perhaps we should mark
>>> it as such.
>>
>> Also, I think this requires root to set this line discipline to the
>> console, right? A normal user can't do that, or am I missing a code
>> path here?
>
> I haven't been testing long, but yes, early indications show that root
> is required.
FWIW I concluded to the same yesterday, so I disputed the connected
CVE-2023-31082. Waiting for mitre's ack/nack.
thanks,
--
js
suse labs
> > Would something like this tick that box?
> >
> > diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c index
> > 1f3aba607cd51..5c1d2fcd5d9e2 100644
> > --- a/drivers/tty/n_gsm.c
> > +++ b/drivers/tty/n_gsm.c
> > @@ -3716,6 +3716,10 @@ static ssize_t gsmld_write(struct tty_struct *tty, struct file *file,
> > if (!gsm)
> > return -ENODEV;
> >
> > + /* The GSM line discipline does not support binding to console */
> > + if (strncmp(tty->name, "tty", 3))
> > + return -EINVAL;
>
> No, that's not going to work, some consoles do not start with "tty" :(
Also, I would recommend exiting as early as possible. E.g. in gsmld_open().
And please retain support for real serial devices (e.g. ttyS, ttyUSB,
ttyACM, ...).
Best regards,
Daniel Starke
On Thu, Oct 05, 2023 at 11:43:11AM +0100, Lee Jones wrote:
> On Thu, 05 Oct 2023, Greg Kroah-Hartman wrote:
>
> > On Thu, Oct 05, 2023 at 10:03:11AM +0100, Lee Jones wrote:
> > > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> > >
> > > > On Wed, Oct 04, 2023 at 01:57:04PM +0100, Lee Jones wrote:
> > > > > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> > > > >
> > > > > > On Wed, Oct 04, 2023 at 09:57:20AM +0100, Lee Jones wrote:
> > > > > > > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> > > > > > >
> > > > > > > > On Wed, Oct 04, 2023 at 05:59:09AM +0000, Starke, Daniel wrote:
> > > > > > > > > > Daniel, any thoughts?
> > > > > > > > >
> > > > > > > > > Our application of this protocol is only with specific modems to enable
> > > > > > > > > circuit switched operation (handling calls, selecting/querying networks,
> > > > > > > > > etc.) while doing packet switched communication (i.e. IP traffic over PPP).
> > > > > > > > > The protocol was developed for such use cases.
> > > > > > > > >
> > > > > > > > > Regarding the issue itself:
> > > > > > > > > There was already an attempt to fix all this by switching from spinlocks to
> > > > > > > > > mutexes resulting in ~20% performance loss. However, the patch was reverted
> > > > > > > > > as it did not handle the T1 timer leading into sleep during atomic within
> > > > > > > > > gsm_dlci_t1() on every mutex lock there.
> > > > > > >
> > > > > > > That's correct. When I initially saw this report, my initial thought
> > > > > > > was to replace the spinlocks with mutexts, but having read the previous
> > > > > > > accepted attempt and it's subsequent reversion I started to think of
> > > > > > > other ways to solve this issue. This solution, unlike the last, does
> > > > > > > not involve adding sleep inducing locks into atomic contexts, nor
> > > > > > > should it negatively affect performance.
> > > > > > >
> > > > > > > > > There was also a suggestion to fix this in do_con_write() as
> > > > > > > > > tty_operations::write() appears to be documented as "not allowed to sleep".
> > > > > > > > > The patch for this was rejected. It did not fix the issue within n_gsm.
> > > > > > > > >
> > > > > > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > > > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > > > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > > > > >
> > > > > > > > Ok, I thought I remembered this, I'll just drop this patch from my
> > > > > > > > review queue and wait for a better solution if it ever comes up as this
> > > > > > > > isn't a real issue that people are seeing on actual systems, but just a
> > > > > > > > syzbot report.
> > > > > > >
> > > > > > > What does the "better solution" look like?
> > > > > >
> > > > > > One that actually fixes the root problem here (i.e. does not break the
> > > > > > recursion loop, or cause a performance decrease for normal users, or
> > > > > > prevent this from being bound to the console).
> > > > >
> > > > > Does this solution break the recursion loop or affect performance?
> > > >
> > > > This solution broke the recursion by returning an error, right?
> > >
> > > This is the part I was least sure about.
> > >
> > > If this was considered valid and we were to go forward with a solution
> > > like this, what would a quality improvement look like? Should we have
> > > stayed in this function and waited for the previous occupant to leave
> > > before continuing through ->write()?
> >
> > This isn't valid, as it obviously never shows up in real use.
> >
> > The real solution should be to prevent binding a console to this line
> > discipline as it can not handle the recursion that consoles require for
> > the write path.
>
> Would something like this tick that box?
>
> diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c
> index 1f3aba607cd51..5c1d2fcd5d9e2 100644
> --- a/drivers/tty/n_gsm.c
> +++ b/drivers/tty/n_gsm.c
> @@ -3716,6 +3716,10 @@ static ssize_t gsmld_write(struct tty_struct *tty, struct file *file,
> if (!gsm)
> return -ENODEV;
>
> + /* The GSM line discipline does not support binding to console */
> + if (strncmp(tty->name, "tty", 3))
> + return -EINVAL;
No, that's not going to work, some consoles do not start with "tty" :(
thanks,
greg k-h
On Thu, Oct 05, 2023 at 01:17:56PM +0100, Lee Jones wrote:
> On Thu, 05 Oct 2023, Greg Kroah-Hartman wrote:
>
> > On Thu, Oct 05, 2023 at 12:46:32PM +0100, Lee Jones wrote:
> > > On Thu, 05 Oct 2023, Starke, Daniel wrote:
> > >
> > > > > > Would something like this tick that box?
> > > > > >
> > > > > > diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c index
> > > > > > 1f3aba607cd51..5c1d2fcd5d9e2 100644
> > > > > > --- a/drivers/tty/n_gsm.c
> > > > > > +++ b/drivers/tty/n_gsm.c
> > > > > > @@ -3716,6 +3716,10 @@ static ssize_t gsmld_write(struct tty_struct *tty, struct file *file,
> > > > > > if (!gsm)
> > > > > > return -ENODEV;
> > > > > >
> > > > > > + /* The GSM line discipline does not support binding to console */
> > > > > > + if (strncmp(tty->name, "tty", 3))
> > > > > > + return -EINVAL;
> > > > >
> > > > > No, that's not going to work, some consoles do not start with "tty" :(
> > >
> > > Ah, you mean there are others that we need to consider?
> > >
> > > I was just covering off con_write() from drivers/tty/vt/vt.c.
> > >
> > > Does anyone have a counter proposal?
> >
> > consoles are dynamically assigned, the device knows if it is a console
> > or not, so there is a way to determine this at runtime. It's not a
> > device naming thing at all.
>
> It's not a device naming thing, but it is a ... :)
>
> Okay, here's another uninformed stab in the dark:
>
> diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c
> index 1f3aba607cd51..ddf00f6a4141d 100644
> --- a/drivers/tty/n_gsm.c
> +++ b/drivers/tty/n_gsm.c
> @@ -3629,6 +3629,10 @@ static int gsmld_open(struct tty_struct *tty)
> if (tty->ops->write == NULL)
> return -EINVAL;
>
> + /* The GSM line discipline does not support binding to console */
> + if (tty->driver->type == TTY_DRIVER_TYPE_CONSOLE)
> + return -EINVAL;
> +
> /* Attach our ldisc data */
> gsm = gsm_alloc_mux();
> if (gsm == NULL)
>
> This seems to allow for the real serial devices (TTY_DRIVER_TYPE_SERIAL)
> suggested by Daniel.
Close, but not quite.
Driver "types" can say if they are a console or not, but that doesn't
mean you can't run the console over a serial port as well, right?
Sorry, I don't have a real solution right now, and wouldn't wish the
phrase "just dig through the tty console code!" on anyone, but that's
what it is going to take to figure it out somehow, I can't remember the
exact way consoles are hooked up at the moment (I conviently skipped
that portion of the tty layer in my Embedded Recipies talk last week,
saying "it's magic...")
thanks,
greg k-h
On Thu, 05 Oct 2023, Starke, Daniel wrote:
> > > Would something like this tick that box?
> > >
> > > diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c index
> > > 1f3aba607cd51..5c1d2fcd5d9e2 100644
> > > --- a/drivers/tty/n_gsm.c
> > > +++ b/drivers/tty/n_gsm.c
> > > @@ -3716,6 +3716,10 @@ static ssize_t gsmld_write(struct tty_struct *tty, struct file *file,
> > > if (!gsm)
> > > return -ENODEV;
> > >
> > > + /* The GSM line discipline does not support binding to console */
> > > + if (strncmp(tty->name, "tty", 3))
> > > + return -EINVAL;
> >
> > No, that's not going to work, some consoles do not start with "tty" :(
Ah, you mean there are others that we need to consider?
I was just covering off con_write() from drivers/tty/vt/vt.c.
Does anyone have a counter proposal?
> Also, I would recommend exiting as early as possible. E.g. in gsmld_open().
Good suggestion.
> And please retain support for real serial devices (e.g. ttyS, ttyUSB,
> ttyACM, ...).
Okay, so it's "tty{0-9}*$" that should be excluded?
Plus others that Greg alluded to?
Is there a definitive accept / reject list?
--
Lee Jones [李琼斯]
On Thu, 05 Oct 2023, Greg Kroah-Hartman wrote:
> On Thu, Oct 05, 2023 at 10:03:11AM +0100, Lee Jones wrote:
> > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> >
> > > On Wed, Oct 04, 2023 at 01:57:04PM +0100, Lee Jones wrote:
> > > > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> > > >
> > > > > On Wed, Oct 04, 2023 at 09:57:20AM +0100, Lee Jones wrote:
> > > > > > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> > > > > >
> > > > > > > On Wed, Oct 04, 2023 at 05:59:09AM +0000, Starke, Daniel wrote:
> > > > > > > > > Daniel, any thoughts?
> > > > > > > >
> > > > > > > > Our application of this protocol is only with specific modems to enable
> > > > > > > > circuit switched operation (handling calls, selecting/querying networks,
> > > > > > > > etc.) while doing packet switched communication (i.e. IP traffic over PPP).
> > > > > > > > The protocol was developed for such use cases.
> > > > > > > >
> > > > > > > > Regarding the issue itself:
> > > > > > > > There was already an attempt to fix all this by switching from spinlocks to
> > > > > > > > mutexes resulting in ~20% performance loss. However, the patch was reverted
> > > > > > > > as it did not handle the T1 timer leading into sleep during atomic within
> > > > > > > > gsm_dlci_t1() on every mutex lock there.
> > > > > >
> > > > > > That's correct. When I initially saw this report, my initial thought
> > > > > > was to replace the spinlocks with mutexts, but having read the previous
> > > > > > accepted attempt and it's subsequent reversion I started to think of
> > > > > > other ways to solve this issue. This solution, unlike the last, does
> > > > > > not involve adding sleep inducing locks into atomic contexts, nor
> > > > > > should it negatively affect performance.
> > > > > >
> > > > > > > > There was also a suggestion to fix this in do_con_write() as
> > > > > > > > tty_operations::write() appears to be documented as "not allowed to sleep".
> > > > > > > > The patch for this was rejected. It did not fix the issue within n_gsm.
> > > > > > > >
> > > > > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > > > >
> > > > > > > Ok, I thought I remembered this, I'll just drop this patch from my
> > > > > > > review queue and wait for a better solution if it ever comes up as this
> > > > > > > isn't a real issue that people are seeing on actual systems, but just a
> > > > > > > syzbot report.
> > > > > >
> > > > > > What does the "better solution" look like?
> > > > >
> > > > > One that actually fixes the root problem here (i.e. does not break the
> > > > > recursion loop, or cause a performance decrease for normal users, or
> > > > > prevent this from being bound to the console).
> > > >
> > > > Does this solution break the recursion loop or affect performance?
> > >
> > > This solution broke the recursion by returning an error, right?
> >
> > This is the part I was least sure about.
> >
> > If this was considered valid and we were to go forward with a solution
> > like this, what would a quality improvement look like? Should we have
> > stayed in this function and waited for the previous occupant to leave
> > before continuing through ->write()?
>
> This isn't valid, as it obviously never shows up in real use.
>
> The real solution should be to prevent binding a console to this line
> discipline as it can not handle the recursion that consoles require for
> the write path.
Would something like this tick that box?
diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c
index 1f3aba607cd51..5c1d2fcd5d9e2 100644
--- a/drivers/tty/n_gsm.c
+++ b/drivers/tty/n_gsm.c
@@ -3716,6 +3716,10 @@ static ssize_t gsmld_write(struct tty_struct *tty, struct file *file,
if (!gsm)
return -ENODEV;
+ /* The GSM line discipline does not support binding to console */
+ if (strncmp(tty->name, "tty", 3))
+ return -EINVAL;
+
ret = -ENOBUFS;
spin_lock_irqsave(&gsm->tx_lock, flags);
space = tty_write_room(tty);
> Then, if consoles are really needed, the code can be fixed up to handle
> such recursion. That's not a trivial thing to do, as can be seen by the
> crazy gyrations that the n_tty line discipline does in its write path...
--
Lee Jones [李琼斯]
On Thu, Oct 05, 2023 at 06:59:50AM +0200, Jiri Slaby wrote:
> On 04. 10. 23, 14:57, Lee Jones wrote:
> > > > n_gsm is available on all the systems I have available. The mention of
> > > > 'EXPERIMENTAL' in the module description appears to have zero effect on
> > > > whether distros choose to make it available or not. If you're saying
> > > > that we know this module is BROKEN however, then perhaps we should mark
> > > > it as such.
> > >
> > > Also, I think this requires root to set this line discipline to the
> > > console, right? A normal user can't do that, or am I missing a code
> > > path here?
> >
> > I haven't been testing long, but yes, early indications show that root
> > is required.
>
> FWIW I concluded to the same yesterday, so I disputed the connected
> CVE-2023-31082. Waiting for mitre's ack/nack.
Trying to dispute obviously-wrong CVEs is a never-ending task.
Personally, it's fun to see who keeps popping up to attempt to resolve
this issue showing what companies have their security policies dictated
by a random government authority that can be modified by anyone :)
thanks,
greg k-h
On Thu, Oct 05, 2023 at 12:46:32PM +0100, Lee Jones wrote:
> On Thu, 05 Oct 2023, Starke, Daniel wrote:
>
> > > > Would something like this tick that box?
> > > >
> > > > diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c index
> > > > 1f3aba607cd51..5c1d2fcd5d9e2 100644
> > > > --- a/drivers/tty/n_gsm.c
> > > > +++ b/drivers/tty/n_gsm.c
> > > > @@ -3716,6 +3716,10 @@ static ssize_t gsmld_write(struct tty_struct *tty, struct file *file,
> > > > if (!gsm)
> > > > return -ENODEV;
> > > >
> > > > + /* The GSM line discipline does not support binding to console */
> > > > + if (strncmp(tty->name, "tty", 3))
> > > > + return -EINVAL;
> > >
> > > No, that's not going to work, some consoles do not start with "tty" :(
>
> Ah, you mean there are others that we need to consider?
>
> I was just covering off con_write() from drivers/tty/vt/vt.c.
>
> Does anyone have a counter proposal?
consoles are dynamically assigned, the device knows if it is a console
or not, so there is a way to determine this at runtime. It's not a
device naming thing at all.
thanks,
greg k-h
On Thu, 05 Oct 2023, Greg Kroah-Hartman wrote:
> On Thu, Oct 05, 2023 at 06:59:50AM +0200, Jiri Slaby wrote:
> > On 04. 10. 23, 14:57, Lee Jones wrote:
> > > > > n_gsm is available on all the systems I have available. The mention of
> > > > > 'EXPERIMENTAL' in the module description appears to have zero effect on
> > > > > whether distros choose to make it available or not. If you're saying
> > > > > that we know this module is BROKEN however, then perhaps we should mark
> > > > > it as such.
> > > >
> > > > Also, I think this requires root to set this line discipline to the
> > > > console, right? A normal user can't do that, or am I missing a code
> > > > path here?
> > >
> > > I haven't been testing long, but yes, early indications show that root
> > > is required.
> >
> > FWIW I concluded to the same yesterday, so I disputed the connected
> > CVE-2023-31082. Waiting for mitre's ack/nack.
>
> Trying to dispute obviously-wrong CVEs is a never-ending task.
>
> Personally, it's fun to see who keeps popping up to attempt to resolve
> this issue showing what companies have their security policies dictated
> by a random government authority that can be modified by anyone :)
It's a struggle! :)
--
Lee Jones [李琼斯]
On Thu, Oct 05, 2023 at 10:03:11AM +0100, Lee Jones wrote:
> On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
>
> > On Wed, Oct 04, 2023 at 01:57:04PM +0100, Lee Jones wrote:
> > > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> > >
> > > > On Wed, Oct 04, 2023 at 09:57:20AM +0100, Lee Jones wrote:
> > > > > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> > > > >
> > > > > > On Wed, Oct 04, 2023 at 05:59:09AM +0000, Starke, Daniel wrote:
> > > > > > > > Daniel, any thoughts?
> > > > > > >
> > > > > > > Our application of this protocol is only with specific modems to enable
> > > > > > > circuit switched operation (handling calls, selecting/querying networks,
> > > > > > > etc.) while doing packet switched communication (i.e. IP traffic over PPP).
> > > > > > > The protocol was developed for such use cases.
> > > > > > >
> > > > > > > Regarding the issue itself:
> > > > > > > There was already an attempt to fix all this by switching from spinlocks to
> > > > > > > mutexes resulting in ~20% performance loss. However, the patch was reverted
> > > > > > > as it did not handle the T1 timer leading into sleep during atomic within
> > > > > > > gsm_dlci_t1() on every mutex lock there.
> > > > >
> > > > > That's correct. When I initially saw this report, my initial thought
> > > > > was to replace the spinlocks with mutexts, but having read the previous
> > > > > accepted attempt and it's subsequent reversion I started to think of
> > > > > other ways to solve this issue. This solution, unlike the last, does
> > > > > not involve adding sleep inducing locks into atomic contexts, nor
> > > > > should it negatively affect performance.
> > > > >
> > > > > > > There was also a suggestion to fix this in do_con_write() as
> > > > > > > tty_operations::write() appears to be documented as "not allowed to sleep".
> > > > > > > The patch for this was rejected. It did not fix the issue within n_gsm.
> > > > > > >
> > > > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > > >
> > > > > > Ok, I thought I remembered this, I'll just drop this patch from my
> > > > > > review queue and wait for a better solution if it ever comes up as this
> > > > > > isn't a real issue that people are seeing on actual systems, but just a
> > > > > > syzbot report.
> > > > >
> > > > > What does the "better solution" look like?
> > > >
> > > > One that actually fixes the root problem here (i.e. does not break the
> > > > recursion loop, or cause a performance decrease for normal users, or
> > > > prevent this from being bound to the console).
> > >
> > > Does this solution break the recursion loop or affect performance?
> >
> > This solution broke the recursion by returning an error, right?
>
> This is the part I was least sure about.
>
> If this was considered valid and we were to go forward with a solution
> like this, what would a quality improvement look like? Should we have
> stayed in this function and waited for the previous occupant to leave
> before continuing through ->write()?
This isn't valid, as it obviously never shows up in real use.
The real solution should be to prevent binding a console to this line
discipline as it can not handle the recursion that consoles require for
the write path.
Then, if consoles are really needed, the code can be fixed up to handle
such recursion. That's not a trivial thing to do, as can be seen by the
crazy gyrations that the n_tty line discipline does in its write path...
thanks,
greg k-h
On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> On Wed, Oct 04, 2023 at 01:57:04PM +0100, Lee Jones wrote:
> > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> >
> > > On Wed, Oct 04, 2023 at 09:57:20AM +0100, Lee Jones wrote:
> > > > On Wed, 04 Oct 2023, Greg Kroah-Hartman wrote:
> > > >
> > > > > On Wed, Oct 04, 2023 at 05:59:09AM +0000, Starke, Daniel wrote:
> > > > > > > Daniel, any thoughts?
> > > > > >
> > > > > > Our application of this protocol is only with specific modems to enable
> > > > > > circuit switched operation (handling calls, selecting/querying networks,
> > > > > > etc.) while doing packet switched communication (i.e. IP traffic over PPP).
> > > > > > The protocol was developed for such use cases.
> > > > > >
> > > > > > Regarding the issue itself:
> > > > > > There was already an attempt to fix all this by switching from spinlocks to
> > > > > > mutexes resulting in ~20% performance loss. However, the patch was reverted
> > > > > > as it did not handle the T1 timer leading into sleep during atomic within
> > > > > > gsm_dlci_t1() on every mutex lock there.
> > > >
> > > > That's correct. When I initially saw this report, my initial thought
> > > > was to replace the spinlocks with mutexts, but having read the previous
> > > > accepted attempt and it's subsequent reversion I started to think of
> > > > other ways to solve this issue. This solution, unlike the last, does
> > > > not involve adding sleep inducing locks into atomic contexts, nor
> > > > should it negatively affect performance.
> > > >
> > > > > > There was also a suggestion to fix this in do_con_write() as
> > > > > > tty_operations::write() appears to be documented as "not allowed to sleep".
> > > > > > The patch for this was rejected. It did not fix the issue within n_gsm.
> > > > > >
> > > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > > > Link: https://lore.kernel.org/all/[email protected]/
> > > > >
> > > > > Ok, I thought I remembered this, I'll just drop this patch from my
> > > > > review queue and wait for a better solution if it ever comes up as this
> > > > > isn't a real issue that people are seeing on actual systems, but just a
> > > > > syzbot report.
> > > >
> > > > What does the "better solution" look like?
> > >
> > > One that actually fixes the root problem here (i.e. does not break the
> > > recursion loop, or cause a performance decrease for normal users, or
> > > prevent this from being bound to the console).
> >
> > Does this solution break the recursion loop or affect performance?
>
> This solution broke the recursion by returning an error, right?
This is the part I was least sure about.
If this was considered valid and we were to go forward with a solution
like this, what would a quality improvement look like? Should we have
stayed in this function and waited for the previous occupant to leave
before continuing through ->write()?
> The performance one was by using mutexes as in previous attempts.
Right. This solution was designed to avoid that.
--
Lee Jones [李琼斯]
On Thu, 05 Oct 2023, Greg Kroah-Hartman wrote:
> On Thu, Oct 05, 2023 at 12:46:32PM +0100, Lee Jones wrote:
> > On Thu, 05 Oct 2023, Starke, Daniel wrote:
> >
> > > > > Would something like this tick that box?
> > > > >
> > > > > diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c index
> > > > > 1f3aba607cd51..5c1d2fcd5d9e2 100644
> > > > > --- a/drivers/tty/n_gsm.c
> > > > > +++ b/drivers/tty/n_gsm.c
> > > > > @@ -3716,6 +3716,10 @@ static ssize_t gsmld_write(struct tty_struct *tty, struct file *file,
> > > > > if (!gsm)
> > > > > return -ENODEV;
> > > > >
> > > > > + /* The GSM line discipline does not support binding to console */
> > > > > + if (strncmp(tty->name, "tty", 3))
> > > > > + return -EINVAL;
> > > >
> > > > No, that's not going to work, some consoles do not start with "tty" :(
> >
> > Ah, you mean there are others that we need to consider?
> >
> > I was just covering off con_write() from drivers/tty/vt/vt.c.
> >
> > Does anyone have a counter proposal?
>
> consoles are dynamically assigned, the device knows if it is a console
> or not, so there is a way to determine this at runtime. It's not a
> device naming thing at all.
It's not a device naming thing, but it is a ... :)
Okay, here's another uninformed stab in the dark:
diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c
index 1f3aba607cd51..ddf00f6a4141d 100644
--- a/drivers/tty/n_gsm.c
+++ b/drivers/tty/n_gsm.c
@@ -3629,6 +3629,10 @@ static int gsmld_open(struct tty_struct *tty)
if (tty->ops->write == NULL)
return -EINVAL;
+ /* The GSM line discipline does not support binding to console */
+ if (tty->driver->type == TTY_DRIVER_TYPE_CONSOLE)
+ return -EINVAL;
+
/* Attach our ldisc data */
gsm = gsm_alloc_mux();
if (gsm == NULL)
This seems to allow for the real serial devices (TTY_DRIVER_TYPE_SERIAL)
suggested by Daniel.
--
Lee Jones [李琼斯]