Hi all
I'm facing a problem with TTY loosing data using u_serial gadget.
I have a MPC8247 based system running 2.6.33. Data transfer is done
bidirectional
to see the problem more often. I use tty in raw mode and do some delayed
reads on the
receiver side to stress the throttling / un-throttling.
I tracked down the problem and was able to see that n_tty_receive_buf()
is not able to store all data in the read queue.
The function flush_to_ldisc() use the values of tty->receive_room to
schedule_delayed_work()or to pass data to receive_buf() and finally
n_tty_receive_buf().
Also the number of passed bytes rely on variable receive_room.
I added debug code to re-calculate the real free space.
left_space = N_TTY_BUF_SIZE - tty->read_cnt - 1;
Comparing receive_room and left_space showed in some cases big
differences up to 1600
bytes. left_space was always smaller and sometimes zero.
receive_room is calculated in n_tty_set_room() without read locking.
There are various "FIXME does n_tty_set_room need locking ?"
My test showed, adding a read looking solves the problem.
I asked me why is the locking not done? How big is the risk for dead-locks?
To minimize this risk, I reduced the changes to a minimum (see the patch
below).
The code in the patch only re-calculates the receive_room for raw mode
right before
it is used in flush_to_ldisc().
Can somebody give me an advice, what is the best way to solve this problem?
Thanks for your help.
Regards Stefan
From 91b04c875cecc890139ccdd101cfff6b02b5f22c Mon Sep 17 00:00:00 2001
From: Stefan Bigler <[email protected]>
Date: Wed, 16 Mar 2011 15:20:03 +0100
Subject: [PATCH 1/1] n_tty: fix race condition with receive_room
Do an additional update of receive_room with locking for raw tty.
The n_tty_set_room is done without locking what is known as a
potential problem.
In case of heavy load and raw tty and flush_to_ldisc() determine
the number of char to send with receive_buf(). If there is not
enough space available in receive_buf() the data is lost.
This additional update of receive_room should reduce the risk of
have invalid receive_room. It is not a clean fix but the risk of
risk a dead-lock with clean protection is too high
Signed-off-by: Stefan Bigler <[email protected]>
---
drivers/char/tty_buffer.c | 16 ++++++++++++++++
1 files changed, 16 insertions(+), 0 deletions(-)
diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c
index f27c4d6..5db07fe 100644
--- a/drivers/char/tty_buffer.c
+++ b/drivers/char/tty_buffer.c
@@ -431,6 +431,22 @@ static void flush_to_ldisc(struct work_struct *work)
line discipline as we want to empty the queue */
if (test_bit(TTY_FLUSHPENDING, &tty->flags))
break;
+
+ /* Take read_lock to update receive_room.
+ This avoids invalid free space information.
+ To limit the risk of introducing problems
+ only do it for raw tty */
+ if (tty->raw) {
+ unsigned long flags_r;
+ spin_lock_irqsave(&tty->read_lock,
+ flags_r);
+ tty->receive_room =
+ N_TTY_BUF_SIZE -
+ tty->read_cnt - 1;
+ spin_unlock_irqrestore(&tty->read_lock,
+ flags_r);
+ }
+
if (!tty->receive_room) {
schedule_delayed_work(&tty->buf.work, 1);
break;
--
1.7.0.5
On Wed, Mar 16, 2011 at 09:47:50PM +0100, Stefan Bigler wrote:
> Hi all
>
> I'm facing a problem with TTY loosing data using u_serial gadget.
>
> I have a MPC8247 based system running 2.6.33.
Oh, that's an old kernel, lots of work on the tty layer has happened
since then that would fix issues like this.
If you are stuck using this kernel, please go ask your vendor for
support for it.
If not, can you try out 2.6.38 and let us, and the
[email protected] list know if you still have problems with it?
thanks,
greg k-h
Hi Greg
Thanks for your quick answer. Meanwhile we ported our board to 2.6.38.
We recognize the same problem again.
I had also a look at the relevant fixes, a lot is done but I could not find
the required protection of the attribute receive_room.
I added a printk in case of failure were receive_room shows more
space that is really available in the queue and also used.
The printk is attached below and also the log.
Regards
Stefan
Log:
flush_to_ldisc, receive_room to small count=201, receive_room=287,
real_room=0
flush_to_ldisc, receive_room to small count=225, receive_room=1031,
real_room=0
flush_to_ldisc, receive_room to small count=33, receive_room=4095,
real_room=0
diff --git a/drivers/tty/tty_buffer.c b/drivers/tty/tty_buffer.c
index d8210ca..fa9ca62 100644
--- a/drivers/tty/tty_buffer.c
+++ b/drivers/tty/tty_buffer.c
@@ -442,6 +442,21 @@ static void flush_to_ldisc(struct work_struct *work)
line discipline as we want to empty the queue */
if (test_bit(TTY_FLUSHPENDING, &tty->flags))
break;
+
+ /* add printk in case of inconsistante size */
+ if (tty->raw) {
+ unsigned long flags_r;
+ int real_room;
+ spin_lock_irqsave(&tty->read_lock, flags_r);
+ real_room = N_TTY_BUF_SIZE -tty->read_cnt - 1;
+ if (real_room < min(count, tty->receive_room)) {
+ printk(KERN_ERR "flush_to_ldisc, receive_room to
small count=%d,"
+ " receive_room=%d, real_room=%d\n",
+ count, tty->receive_room, real_room);
+ }
+ spin_unlock_irqrestore(&tty->read_lock, flags_r);
+ }
+
if (!tty->receive_room || seen_tail) {
schedule_delayed_work(&tty->buf.work, 1);
break;
Am 03/17/2011 01:04 AM, wrote Greg KH:
> On Wed, Mar 16, 2011 at 09:47:50PM +0100, Stefan Bigler wrote:
>> Hi all
>>
>> I'm facing a problem with TTY loosing data using u_serial gadget.
>>
>> I have a MPC8247 based system running 2.6.33.
> Oh, that's an old kernel, lots of work on the tty layer has happened
> since then that would fix issues like this.
>
> If you are stuck using this kernel, please go ask your vendor for
> support for it.
>
> If not, can you try out 2.6.38 and let us, and the
> [email protected] list know if you still have problems with it?
>
> thanks,
>
> greg k-h
Hi,
(please don't top-post)
On Fri, Mar 18, 2011 at 05:35:56PM +0100, Stefan Bigler wrote:
> Thanks for your quick answer. Meanwhile we ported our board to 2.6.38.
> We recognize the same problem again.
> I had also a look at the relevant fixes, a lot is done but I could not find
> the required protection of the attribute receive_room.
>
> I added a printk in case of failure were receive_room shows more
> space that is really available in the queue and also used.
>
> The printk is attached below and also the log.
>
> Regards
> Stefan
>
>
> Log:
> flush_to_ldisc, receive_room to small count=201, receive_room=287,
> real_room=0
> flush_to_ldisc, receive_room to small count=225, receive_room=1031,
> real_room=0
> flush_to_ldisc, receive_room to small count=33, receive_room=4095,
> real_room=0
I had the same problem while I was working at Nokia developing the N900.
I even sent a patch (twice) with Alan Cox on the loop but nothing has
happened. The patch might still be floating on the archives but
essentially it would make ->receive_buf() return the amount of received
bytes (as IMHO it should be done).
--
balbi
> I had the same problem while I was working at Nokia developing the N900.
> I even sent a patch (twice) with Alan Cox on the loop but nothing has
> happened.
http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/02347.html
is I assume the thread you are talking about - which ground to a halt
without actually figuring out what you were seeing and what was really
going on.
http://permalink.gmane.org/gmane.linux.usb.general/28976
would be the second generation one which was somewhat longer after I
ceased to be tty maintainer so I can't really say it never got upstream.
I still think its the right thing to do, so perhaps it's worth reposting
it, especially if it fixes this case too.
Alan
> I had also a look at the relevant fixes, a lot is done but I could not find
> the required protection of the attribute receive_room.
receive_room isn't protected because it may only be shrunk by the amount
of data sent to the ldisc or less. The ldisc is at liberty to grow the
value as it sees fit.
In essence if you get a value from receive_room it's a guarantee you may
send that many bytes, it is not a precise instantaneous perfect answer to
the question "exactly what number of bytes could fit at this precise
moment".
Which does of course mean you should never see the case where
receive_room is bigger than the actual space available in tty raw mode.
> In essence if you get a value from receive_room it's a guarantee you may
> send that many bytes, it is not a precise instantaneous perfect answer to
> the question "exactly what number of bytes could fit at this precise
> moment".
>
Hi
I have no problem with having more space available than guarantied by
the value of receive_room. But what I see is that the value can be less!
Data will be lost.
My test showed:
data to queue=201
receive_room (guarantied space) =287
real space in queue =0
In the last mail I attached the code for printing, the read_lock was only
added to show a constistant view of all current values.
Stefan
Hi
Returning the size of queued data is a reasonable solution for this problem.
Stefan
On Fri, Mar 18, 2011 at 06:06:57PM +0000, Alan Cox wrote:
> > I had the same problem while I was working at Nokia developing the N900.
> > I even sent a patch (twice) with Alan Cox on the loop but nothing has
> > happened.
>
> http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/02347.html
>
> is I assume the thread you are talking about - which ground to a halt
> without actually figuring out what you were seeing and what was really
> going on.
>
> http://permalink.gmane.org/gmane.linux.usb.general/28976
>
> would be the second generation one which was somewhat longer after I
> ceased to be tty maintainer so I can't really say it never got upstream.
>
> I still think its the right thing to do, so perhaps it's worth reposting
> it, especially if it fixes this case too.
Ok, I'll update that patch and re-send. I'll compile test on x86 and
boot test on an ARM box, so it would be cool to have Acks from several
people before applying, just in case.
--
balbi
On Mon, Mar 21, 2011 at 11:32:57AM +0200, Felipe Balbi wrote:
> On Fri, Mar 18, 2011 at 06:06:57PM +0000, Alan Cox wrote:
> > > I had the same problem while I was working at Nokia developing the N900.
> > > I even sent a patch (twice) with Alan Cox on the loop but nothing has
> > > happened.
> >
> > http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/02347.html
> >
> > is I assume the thread you are talking about - which ground to a halt
> > without actually figuring out what you were seeing and what was really
> > going on.
> >
> > http://permalink.gmane.org/gmane.linux.usb.general/28976
> >
> > would be the second generation one which was somewhat longer after I
> > ceased to be tty maintainer so I can't really say it never got upstream.
> >
> > I still think its the right thing to do, so perhaps it's worth reposting
> > it, especially if it fixes this case too.
>
> Ok, I'll update that patch and re-send. I'll compile test on x86 and
> boot test on an ARM box, so it would be cool to have Acks from several
> people before applying, just in case.
Patch attached, please give it a good round of test as I don't have how
to exercise all line disciplines. I ran 100 randconfigs over night and
no warnings or erros on that area, at least not that I could see (so
many warning while compiling the kernel :-( )
--
balbi
> Patch attached, please give it a good round of test as I don't have how
> to exercise all line disciplines. I ran 100 randconfigs over night and
> no warnings or erros on that area, at least not that I could see (so
> many warning while compiling the kernel :-( )
n_tty is I believe the only ldisc which assets throttling - the others
actually prefer to lose data on an overflow not get further blockages.
Greg - can we get this into -next once -rc1 is out and see who screams ?
On Tue, Mar 22, 2011 at 11:04:12AM +0000, Alan Cox wrote:
> > Patch attached, please give it a good round of test as I don't have how
> > to exercise all line disciplines. I ran 100 randconfigs over night and
> > no warnings or erros on that area, at least not that I could see (so
> > many warning while compiling the kernel :-( )
>
> n_tty is I believe the only ldisc which assets throttling - the others
> actually prefer to lose data on an overflow not get further blockages.
>
>
> Greg - can we get this into -next once -rc1 is out and see who screams ?
Yes, I will do that.
thanks,
greg k-h
On 22/03/2011 08:53, Felipe Balbi wrote:
> Patch attached, please give it a good round of test as I don't have how
> to exercise all line disciplines. I ran 100 randconfigs over night and
> no warnings or erros on that area, at least not that I could see (so
> many warning while compiling the kernel :-( )
Is this patch missing the changes to tty_buffer.c that were in
http://permalink.gmane.org/gmane.linux.usb.general/28976
Without the changes to tty_buffer.c to use the value returned from the
receive_buf call then doesn't this patch not work correctly?
Also with this patch, does the receive_room member of tty_struct have
any use? As far as I can tell it's also referenced in paste_selection in
drivers/tty/vt/selection.c. It's the case that past_selection uses
receive_buf so shouldn't it be updated to use the new return value
semantics for receive_buf?
Without modifying tty_buffer.c to not make use of receive_room I can't
get console terminals to work with this patch. Although I have to admit
that I've been applying the patch to 2.6.35.3 as that's the kernel my
development board is currently using, but I can't see any immediate
reason why the most recent kernel would be any different. However I'm
still fairly new to the interactions between the various bits of tty
code and drivers, so I could just be missing an important change that's
in 2.6.38+.
Regards,
Toby
Hi,
On Thu, Mar 24, 2011 at 12:07:56PM +0000, Toby Gray wrote:
> On 22/03/2011 08:53, Felipe Balbi wrote:
> >Patch attached, please give it a good round of test as I don't have how
> >to exercise all line disciplines. I ran 100 randconfigs over night and
> >no warnings or erros on that area, at least not that I could see (so
> >many warning while compiling the kernel :-( )
>
> Is this patch missing the changes to tty_buffer.c that were in
> http://permalink.gmane.org/gmane.linux.usb.general/28976
>
> Without the changes to tty_buffer.c to use the value returned from
> the receive_buf call then doesn't this patch not work correctly?
>
> Also with this patch, does the receive_room member of tty_struct have
> any use? As far as I can tell it's also referenced in paste_selection
> in drivers/tty/vt/selection.c. It's the case that past_selection uses
> receive_buf so shouldn't it be updated to use the new return value
> semantics for receive_buf?
>
> Without modifying tty_buffer.c to not make use of receive_room I
> can't get console terminals to work with this patch. Although I have
> to admit that I've been applying the patch to 2.6.35.3 as that's the
> kernel my development board is currently using, but I can't see any
> immediate reason why the most recent kernel would be any different.
> However I'm still fairly new to the interactions between the various
> bits of tty code and drivers, so I could just be missing an important
> change that's in 2.6.38+.
you are right, thanks for noticing that. Attached is an updated patch.
I left removal of receive_room out of this patch to prevent too invasive
change, that can be done on a separate patch.
--
balbi
On 24/03/2011 12:37, Felipe Balbi wrote:
> Hi,
>
> On Thu, Mar 24, 2011 at 12:07:56PM +0000, Toby Gray wrote:
>> Is this patch missing the changes to tty_buffer.c that were in
>> http://permalink.gmane.org/gmane.linux.usb.general/28976
>>
>> Without the changes to tty_buffer.c to use the value returned from
>> the receive_buf call then doesn't this patch not work correctly?
>>
<snip>
> you are right, thanks for noticing that. Attached is an updated patch.
> I left removal of receive_room out of this patch to prevent too invasive
> change, that can be done on a separate patch.
I've just tried removing receive_room myself and noticed that it is
still used in flush_to_ldisc to decide if it needs to schedule work to
be done later:
if (!tty->receive_room || seen_tail) {
schedule_work(&tty->buf.work);
break;
}
If receive_room is no longer being updated then isn't this the wrong
thing to do? Shouldn't it check if some data was copied after calling
receive_buf, and if there wasn't any then it should schedule the work
and break?
The other query I've got is about the return value from receive_buf. I
noticed that you've modified some drivers (such as
bluetooth/hci_ldisc.c) so that they can return error values, such as
-ENODEV. Won't this cause things to go wrong when flush_to_ldisc and
paste_selection use the return value without checking for it being negative?
Thank you for your quick reply to my first query, it's appreciated.
Regards,
Toby
On Thu, Mar 24, 2011 at 12:51:45PM +0000, Toby Gray wrote:
> On 24/03/2011 12:37, Felipe Balbi wrote:
> >Hi,
> >
> >On Thu, Mar 24, 2011 at 12:07:56PM +0000, Toby Gray wrote:
> >>Is this patch missing the changes to tty_buffer.c that were in
> >>http://permalink.gmane.org/gmane.linux.usb.general/28976
> >>
> >>Without the changes to tty_buffer.c to use the value returned from
> >>the receive_buf call then doesn't this patch not work correctly?
> >>
> <snip>
> >you are right, thanks for noticing that. Attached is an updated patch.
> >I left removal of receive_room out of this patch to prevent too invasive
> >change, that can be done on a separate patch.
>
> I've just tried removing receive_room myself and noticed that it is
> still used in flush_to_ldisc to decide if it needs to schedule work
> to be done later:
>
> if (!tty->receive_room || seen_tail) {
> schedule_work(&tty->buf.work);
> break;
> }
>
> If receive_room is no longer being updated then isn't this the wrong
> thing to do? Shouldn't it check if some data was copied after calling
> receive_buf, and if there wasn't any then it should schedule the work
> and break?
>
> The other query I've got is about the return value from receive_buf.
> I noticed that you've modified some drivers (such as
> bluetooth/hci_ldisc.c) so that they can return error values, such as
> -ENODEV. Won't this cause things to go wrong when flush_to_ldisc and
> paste_selection use the return value without checking for it being
> negative?
>
> Thank you for your quick reply to my first query, it's appreciated.
Can you try this:
diff --git a/drivers/tty/tty_buffer.c b/drivers/tty/tty_buffer.c
index 2f119e2..f0b9fb6 100644
--- a/drivers/tty/tty_buffer.c
+++ b/drivers/tty/tty_buffer.c
@@ -443,17 +443,19 @@ static void flush_to_ldisc(struct work_struct *work)
line discipline as we want to empty the queue */
if (test_bit(TTY_FLUSHPENDING, &tty->flags))
break;
- if (!tty->receive_room || seen_tail) {
- schedule_delayed_work(&tty->buf.work, 1);
- break;
- }
char_buf = head->char_buf_ptr + head->read;
flag_buf = head->flag_buf_ptr + head->read;
spin_unlock_irqrestore(&tty->buf.lock, flags);
copied = disc->ops->receive_buf(tty, char_buf,
flag_buf, count);
spin_lock_irqsave(&tty->buf.lock, flags);
+
head->read += copied;
+
+ if (copied == 0 || seen_tail) {
+ schedule_delayed_work(&tty->buf.work, 1);
+ break;
+ }
}
clear_bit(TTY_FLUSHING, &tty->flags);
}
if I read the code correctly, when we have no space to receive bytes,
then we schedule work, that should be the same.
--
balbi
Can you try this:
> diff --git a/drivers/tty/tty_buffer.c b/drivers/tty/tty_buffer.c
> index 2f119e2..f0b9fb6 100644
> --- a/drivers/tty/tty_buffer.c
> +++ b/drivers/tty/tty_buffer.c
I tried it and the console is working again.
There is still an problem on heavy load transfer from host to gadget.
I attached the patch to fix also this problem.
diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index 8e0f113..95d0a9c 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1359,6 +1359,7 @@ static unsigned int n_tty_receive_buf(struct
tty_struct *tty,
memcpy(tty->read_buf + tty->read_head, cp, i);
tty->read_head = (tty->read_head + i) & (N_TTY_BUF_SIZE-1);
tty->read_cnt += i;
+ ret += i;
spin_unlock_irqrestore(&tty->read_lock, cpuflags);
} else {
ret = count;
With this patch I was able to transfer gigabytes without lost data.
On 24/03/2011 15:40, Stefan Bigler wrote:
> Can you try this:
>> diff --git a/drivers/tty/tty_buffer.c b/drivers/tty/tty_buffer.c
>> index 2f119e2..f0b9fb6 100644
>> --- a/drivers/tty/tty_buffer.c
>> +++ b/drivers/tty/tty_buffer.c
>
> I tried it and the console is working again.
> There is still an problem on heavy load transfer from host to gadget.
> I attached the patch to fix also this problem.
>
> diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
> index 8e0f113..95d0a9c 100644
> --- a/drivers/tty/n_tty.c
> +++ b/drivers/tty/n_tty.c
> @@ -1359,6 +1359,7 @@ static unsigned int n_tty_receive_buf(struct
> tty_struct *tty,
> memcpy(tty->read_buf + tty->read_head, cp, i);
> tty->read_head = (tty->read_head + i) &
> (N_TTY_BUF_SIZE-1);
> tty->read_cnt += i;
> + ret += i;
> spin_unlock_irqrestore(&tty->read_lock, cpuflags);
> } else {
> ret = count;
>
> With this patch I was able to transfer gigabytes without lost data.
I can confirm that this combined with the last two patches from Felipe
Balbi fix all the issues I've had as well and seems stable and without
loss under fast data transfers.
Regards,
Toby
On Thu, Mar 24, 2011 at 04:15:02PM +0000, Toby Gray wrote:
> On 24/03/2011 15:40, Stefan Bigler wrote:
> >Can you try this:
> >>diff --git a/drivers/tty/tty_buffer.c b/drivers/tty/tty_buffer.c
> >>index 2f119e2..f0b9fb6 100644
> >>--- a/drivers/tty/tty_buffer.c
> >>+++ b/drivers/tty/tty_buffer.c
> >
> >I tried it and the console is working again.
> >There is still an problem on heavy load transfer from host to gadget.
> >I attached the patch to fix also this problem.
> >
> >diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
> >index 8e0f113..95d0a9c 100644
> >--- a/drivers/tty/n_tty.c
> >+++ b/drivers/tty/n_tty.c
> >@@ -1359,6 +1359,7 @@ static unsigned int n_tty_receive_buf(struct
> >tty_struct *tty,
> > memcpy(tty->read_buf + tty->read_head, cp, i);
> > tty->read_head = (tty->read_head + i) &
> >(N_TTY_BUF_SIZE-1);
> > tty->read_cnt += i;
> >+ ret += i;
> > spin_unlock_irqrestore(&tty->read_lock, cpuflags);
> > } else {
> > ret = count;
> >
> >With this patch I was able to transfer gigabytes without lost data.
>
> I can confirm that this combined with the last two patches from
> Felipe Balbi fix all the issues I've had as well and seems stable and
> without loss under fast data transfers.
Great guys, thanks a lot. Here's the final patch with your Tested-bys
--
balbi