2014-02-18 09:36:14

by Stanislaw Gruszka

[permalink] [raw]
Subject: locking changes in tty broke low latency feature

Hi,

setserial has low_latency option which should minimize receive latency
(scheduler delay). AFAICT it is used if someone talk to external device
via RS-485/RS-232 and need to have quick requests and responses . On
kernel this feature was implemented by direct tty processing from
interrupt context:

void tty_flip_buffer_push(struct tty_port *port)
{
struct tty_bufhead *buf = &port->buf;

buf->tail->commit = buf->tail->used;

if (port->low_latency)
flush_to_ldisc(&buf->work);
else
schedule_work(&buf->work);
}

But after 3.12 tty locking changes, calling flush_to_ldisc() from
interrupt context is a bug (we got scheduling while atomic bug report
here: https://bugzilla.redhat.com/show_bug.cgi?id=1065087 )

I'm not sure how this should be solved. After Peter get rid all of those
race condition in tty layer, we probably don't want go back to use
spin_lock's there. Maybe we can create WQ_HIGHPRI workqueue and schedule
flush_to_ldisc() work there. Or perhaps users that need to low latency,
should switch to thread irq and prioritize serial irq to meat
retirements. Anyway setserial low_latency is now broken and all who use
this feature in the past can not do this any longer on 3.12+ kernels.

Thoughts ?

Stanislaw


2014-02-18 09:57:54

by Alan Cox

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

> spin_lock's there. Maybe we can create WQ_HIGHPRI workqueue and schedule
> flush_to_ldisc() work there. Or perhaps users that need to low latency,
> should switch to thread irq and prioritize serial irq to meat
> retirements. Anyway setserial low_latency is now broken and all who use
> this feature in the past can not do this any longer on 3.12+ kernels.

Historically speaking it was never allowed to use low_latency from a port
that did tty_flip_buffer_push from an IRQ as opposed to scheduling work.
The code also rather pre-dates threaded IRQ but that may well be a better
approach.

IMHO the right fix is to fastpath most of the tty layer (non N_TTY ldisc,
N_TTY without ICANON or ECHO*). Most of the remaining tty locking would
then go away almost entirely for these cases and we'd massively improve
things like our dismal 3G modem performance.

Likewise the termios lock can go by using RCU and passing the termios
struct into the driver as a copy of the RCU managed object (so we can
deal with sleeping drivers). Termios structs are tiny so the copying
overhead is basically nil.

It just needs someone sufficiently crazy and with a fair bit of time to
actually do the heavy lifting. I've been poking at bits of it but the
changes when switching ldisc are not entirely trivial and the N_TTY
fastpaths are quite a lot of work. Thankfully the non N_TTY ones are
simple.

Alan

2014-02-18 22:12:19

by Peter Hurley

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

Hi Stanislaw,

On 02/18/2014 04:38 AM, Stanislaw Gruszka wrote:
> Hi,
>
> setserial has low_latency option which should minimize receive latency
> (scheduler delay). AFAICT it is used if someone talk to external device
> via RS-485/RS-232 and need to have quick requests and responses . On
> kernel this feature was implemented by direct tty processing from
> interrupt context:
>
> void tty_flip_buffer_push(struct tty_port *port)
> {
> struct tty_bufhead *buf = &port->buf;
>
> buf->tail->commit = buf->tail->used;
>
> if (port->low_latency)
> flush_to_ldisc(&buf->work);
> else
> schedule_work(&buf->work);
> }
>
> But after 3.12 tty locking changes, calling flush_to_ldisc() from
> interrupt context is a bug (we got scheduling while atomic bug report
> here: https://bugzilla.redhat.com/show_bug.cgi?id=1065087 )
>
> I'm not sure how this should be solved. After Peter get rid all of those
> race condition in tty layer, we probably don't want go back to use
> spin_lock's there. Maybe we can create WQ_HIGHPRI workqueue and schedule
> flush_to_ldisc() work there. Or perhaps users that need to low latency,
> should switch to thread irq and prioritize serial irq to meat
> retirements. Anyway setserial low_latency is now broken and all who use
> this feature in the past can not do this any longer on 3.12+ kernels.
>
> Thoughts ?

Can you give me an idea of your device's average and minimum required
latency (please be specific)? Is your target arch x86 [so I can evaluate the
the impact of bus-locked instructions relative to your expected]?

Also, how painful would it be if unsupported termios changes were rejected
if the port was in low_latency mode and/or if low_latency setting was
disallowed because of termios state?

It would be pointless to throttle low_latency, yes?

What would be an acceptable outcome of being unable to accept input?
Corrupted overrun? Dropped i/o? Queued for later? Please explain with
comparison to the outcome of missed minimum latency.

Regards,
Peter Hurley

2014-02-19 13:00:58

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

Hello,

On Tue, Feb 18, 2014 at 05:12:13PM -0500, Peter Hurley wrote:
> On 02/18/2014 04:38 AM, Stanislaw Gruszka wrote:
> >Hi,
> >
> >setserial has low_latency option which should minimize receive latency
> >(scheduler delay). AFAICT it is used if someone talk to external device
> >via RS-485/RS-232 and need to have quick requests and responses . On
> >kernel this feature was implemented by direct tty processing from
> >interrupt context:
> >
> >void tty_flip_buffer_push(struct tty_port *port)
> >{
> > struct tty_bufhead *buf = &port->buf;
> >
> > buf->tail->commit = buf->tail->used;
> >
> > if (port->low_latency)
> > flush_to_ldisc(&buf->work);
> > else
> > schedule_work(&buf->work);
> >}
> >
> >But after 3.12 tty locking changes, calling flush_to_ldisc() from
> >interrupt context is a bug (we got scheduling while atomic bug report
> >here: https://bugzilla.redhat.com/show_bug.cgi?id=1065087 )
> >
> >I'm not sure how this should be solved. After Peter get rid all of those
> >race condition in tty layer, we probably don't want go back to use
> >spin_lock's there. Maybe we can create WQ_HIGHPRI workqueue and schedule
> >flush_to_ldisc() work there. Or perhaps users that need to low latency,
> >should switch to thread irq and prioritize serial irq to meat
> >retirements. Anyway setserial low_latency is now broken and all who use
> >this feature in the past can not do this any longer on 3.12+ kernels.
> >
> >Thoughts ?
>
> Can you give me an idea of your device's average and minimum required
> latency (please be specific)? Is your target arch x86 [so I can evaluate the
> the impact of bus-locked instructions relative to your expected]?
>
> Also, how painful would it be if unsupported termios changes were rejected
> if the port was in low_latency mode and/or if low_latency setting was
> disallowed because of termios state?
>
> It would be pointless to throttle low_latency, yes?
>
> What would be an acceptable outcome of being unable to accept input?
> Corrupted overrun? Dropped i/o? Queued for later? Please explain with
> comparison to the outcome of missed minimum latency.

I'm sorry, I can not answer your questions.

For what I googled it looks like users wanted to get rid of 10ms jitter
caused by scheduler. Unfortunately that is all I know. I'm not user of
this feature and I don't know what are exact expectations. I'm adding Hal,
who reported bug to RH bugzilla, perhaps he can tell more about what
expectations are.

For now, since issue is not easy fixable, perhaps we should just fix
scheduling while atomic bug and add warning like on below patch ?

Thanks
Stanislaw

diff --git a/drivers/tty/tty_buffer.c b/drivers/tty/tty_buffer.c
index 765125d..0fe6d65 100644
--- a/drivers/tty/tty_buffer.c
+++ b/drivers/tty/tty_buffer.c
@@ -504,9 +504,13 @@ void tty_flip_buffer_push(struct tty_port *port)

buf->tail->commit = buf->tail->used;

- if (port->low_latency)
- flush_to_ldisc(&buf->work);
- else
+ if (port->low_latency) {
+ if (WARN_ONCE(in_irq() || irqs_disabled(),
+ "serial low_latency feature can not be used in interrupt context"))
+ schedule_work(&buf->work);
+ else
+ flush_to_ldisc(&buf->work);
+ } else
schedule_work(&buf->work);
}
EXPORT_SYMBOL(tty_flip_buffer_push);

2014-02-19 17:39:06

by Peter Hurley

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

Hi Grant,

On 02/19/2014 11:55 AM, Grant Edwards wrote:
> On 2014-02-19, Stanislaw Gruszka <[email protected]> wrote:
>> Hello,
>>
>> On Tue, Feb 18, 2014 at 05:12:13PM -0500, Peter Hurley wrote:
>>> On 02/18/2014 04:38 AM, Stanislaw Gruszka wrote:
>>>
>>>> setserial has low_latency option which should minimize receive latency
>>>> (scheduler delay). AFAICT it is used if someone talk to external device
>>>> via RS-485/RS-232 and need to have quick requests and responses.
>
> Exactly.

But not exactly, because I need a quantified value for "quick",
preferably with average latency measurements for 3.11- and 3.12+

>>>>
>>>> But after 3.12 tty locking changes, calling flush_to_ldisc() from
>>>> interrupt context is a bug (we got scheduling while atomic bug report
>>>> here: https://bugzilla.redhat.com/show_bug.cgi?id=1065087 )
>>>
>>> Can you give me an idea of your device's average and minimum required
>>> latency (please be specific)?
>
> If by "device" you mean the UART being supported by the the driver in
> question, then there is no answer.
>
> Latency issues and requirements are a user-space application level
> requirement that depend on the specific application's requirements and
> those of the widget on the other end of the serial cable.

I'm trying to determine if 3.12+ already satisfies the userspace requirement
(or if the requirement is infeasible).

The assumption is that 3.12+ w/o low_latency is worse than 3.11- w/ low_latency,
which may not be true.

Also, as you note, the latency requirement is in userspace, so bound
to the behavior of the scheduler anyway. Thus, immediately writing
to the read buffer from IRQ may have no different average latency than
handling by a worker (as measured by the elapsed time from interrupt to
userspace read).

>>> Also, how painful would it be if unsupported termios changes were rejected
>>> if the port was in low_latency mode and/or if low_latency setting was
>>> disallowed because of termios state?
>>>
>>> It would be pointless to throttle low_latency, yes?
>
> By "throttle" I assume you're talking about flow control?

Driver throttle (but the discussion can include auto-flow control devices).

How can the requirement be for both must-handle-in-minimum-time data
(low_latency) and the-userspace-reader-isn't-reading-fast-enough-
so-its-ok-to-halt-transmission ?

Throttling/unthrottling the sender seems counter to "low latency".

> _Usually_ applications that require low latency are exchanging short
> messages (up to a few hundred bytes, but usually more like a few
> dozen). In those cases flow control is not generally needed.
>
> Does it matter?

Driver throttling requires excluding concurrent unthrottle and calling into
the driver (and said driver has relied on sleeping locks for many
kernel versions).

But first I'd like some hard data on whether or not a low latency
mode is even necessary (at least for user-space).

>>> What would be an acceptable outcome of being unable to accept input?
>>> Corrupted overrun? Dropped i/o? Queued for later? Please explain with
>>> comparison to the outcome of missed minimum latency.
>>
>> I'm sorry, I can not answer your questions.
>>
>> For what I googled it looks like users wanted to get rid of 10ms jitter
>> caused by scheduler.
>
> Yes. That was the use for the low_latency option. Historically, all
> of my drivers had supported the low_latency options for customers that
> found scheduling delays to be a problem.
>
> However, low_latency has been broken in certain contexts for a long
> time. As a result, drivers have to avoid using it either completely or
> sometimes only where rx data is handled in certain contexts.
>
> For some drivers the end result of that is that you can choose either
> a low-latency I/O mechanism or low-latency TTY layer rx handling, but
> you can't use both at the same time because the low-latency I/O
> mechanism handles rx data in a context where low-latency TTY layer
> stuff can't be used.
>
> Now that HZ is often 1000 and tickless is commonly used, I don't think
> the scheduling delay is nearly as much an issue as it used to be. I
> haven't gotten any complaints since it was largely rendered useless
> several years ago.
>
> Now all my drivers will silently override users if they set the
> low_latency flag on a port in situations where it can't be used.

Right. I'd rather not 'fix' something that doesn't really need
fixing (other than to suppress any WARNING caused by low_latency).

Regards,
Peter Hurley

2014-02-19 18:13:09

by Grant Edwards

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

On 2014-02-19, Peter Hurley <[email protected]> wrote:
> On 02/19/2014 11:55 AM, Grant Edwards wrote:

>>>>> setserial has low_latency option which should minimize receive latency
>>>>> (scheduler delay). AFAICT it is used if someone talk to external device
>>>>> via RS-485/RS-232 and need to have quick requests and responses.
>>
>> Exactly.
>
> But not exactly, because I need a quantified value for "quick",

Only the end-user knows that, and they don't tell us what it is unless
it's not being met. :)

> preferably with average latency measurements for 3.11- and 3.12+

It's been a long, long, time since I've done those measurements, and
whatever data I had is no longer relevent (even if I could find it) --
so I'll leave that up to somebody who's actually lobbying for getting
low_latency fixed so that it works from an interrupt context.

Just to be clearn, I'm not lobbying either for or against that. I'm
just trying to provide some perspective on the low_latency flag,
why it _was_ there, and who used it.

> I'm trying to determine if 3.12+ already satisfies the userspace
> requirement (or if the requirement is infeasible).
>
> The assumption is that 3.12+ w/o low_latency is worse than 3.11- w/
> low_latency, which may not be true.

I haven't heard any complaints for probably 10+ years. The last time
I did hear a complaint, I told them to set the low_latency flag and
that solved the problem.

> Also, as you note, the latency requirement is in userspace, so bound
> to the behavior of the scheduler anyway. Thus, immediately writing to
> the read buffer from IRQ may have no different average latency than
> handling by a worker (as measured by the elapsed time from interrupt
> to userspace read).

Yes. And the low_latency flag used to (as in 10 years ago) have a big
effect on that. Without the low latency flag, the user read would
happen up to 10ms (assuming HZ=100) after the data was received by the
driver. Setting the low-latency flag eliminated that 10ms jitter. I
don't know if these days setting the low_latency flag (in contexts
where it does work) even has a noticeable effect.

> How can the requirement be for both must-handle-in-minimum-time data
> (low_latency) and the-userspace-reader-isn't-reading-fast-enough-
> so-its-ok-to-halt-transmission ?
>
> Throttling/unthrottling the sender seems counter to "low latency".

Agreed.

I can _imagine_ a case where an application has a strict limit how
long it will wait until it sees the first byte of the response, but
for some isolated cases (uploading large data logs) that response
might be very large -- large neough that the application may have to
pause while reading the response and rely on flow control to throttle
the upload. But, I can't point to any specific instance of that.


>> _Usually_ applications that require low latency are exchanging short
>> messages (up to a few hundred bytes, but usually more like a few
>> dozen). In those cases flow control is not generally needed.
>>
>> Does it matter?
>
> Driver throttling requires excluding concurrent unthrottle and
> calling into the driver (and said driver has relied on sleeping locks
> for many kernel versions).

Is that still an issue for drivers where the flow cotrol is handled by
the UART?

> But first I'd like some hard data on whether or not a low latency
> mode is even necessary (at least for user-space).

I don't have any hard data, but gut anser is that it is probably no
longer needed. The problem that existed for the past few years was
that there was a user-space way to set the low_latency flag (it didn't
require root and you could even do it from the command-line with
setserial) -- and doing so annoyed the kernel.

>> Now that HZ is often 1000 and tickless is commonly used, I don't think
>> the scheduling delay is nearly as much an issue as it used to be. I
>> haven't gotten any complaints since it was largely rendered useless
>> several years ago.
>>
>> Now all my drivers will silently override users if they set the
>> low_latency flag on a port in situations where it can't be used.
>
> Right. I'd rather not 'fix' something that doesn't really need fixing
> (other than to suppress any WARNING caused by low_latency).

Yes. I think what currently needs to be done is to prevent any issues
caused by the user setting the low_latency. [IMO, warnings from the
kernel are an issue, even if the serial port continues to operate
properly.]

If somebody has specific latency requirements that aren't being met,
the current performance need to be measured, and it needs to be
decided if a solution is fesable and if the right solution is "fixing"
the low_latency flag so it's allowed in more contexts.

--
Grant Edwards grant.b.edwards Yow! I can't decide which
at WRONG TURN to make first!!
gmail.com I wonder if BOB GUCCIONE
has these problems!

2014-02-19 18:43:04

by Peter Hurley

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

On 02/19/2014 01:12 PM, Grant Edwards wrote:
> On 2014-02-19, Peter Hurley <[email protected]> wrote:
>> On 02/19/2014 11:55 AM, Grant Edwards wrote:
>
>>>>>> setserial has low_latency option which should minimize receive latency
>>>>>> (scheduler delay). AFAICT it is used if someone talk to external device
>>>>>> via RS-485/RS-232 and need to have quick requests and responses.
>>>
>>> Exactly.
>>
>> But not exactly, because I need a quantified value for "quick",
>
> Only the end-user knows that, and they don't tell us what it is unless
> it's not being met. :)
>
>> preferably with average latency measurements for 3.11- and 3.12+
>
> It's been a long, long, time since I've done those measurements, and
> whatever data I had is no longer relevent (even if I could find it) --
> so I'll leave that up to somebody who's actually lobbying for getting
> low_latency fixed so that it works from an interrupt context.
>
> Just to be clearn, I'm not lobbying either for or against that. I'm
> just trying to provide some perspective on the low_latency flag,
> why it _was_ there, and who used it.

Yeah, I know; I didn't mean for you to supply those or champion the cause.

But since I'm being quoted on StackExchange, I thought I'd make my
expectations publicly known, so that if someone really feels they
need this feature, they'll speak up and back it up with numbers.

Actually right now, I'm looking into instrumenting the input path
with some event tracers so that I can actually measure that latency,
not in absolute terms, but at least relative to previous.

>> I'm trying to determine if 3.12+ already satisfies the userspace
>> requirement (or if the requirement is infeasible).
>>
>> The assumption is that 3.12+ w/o low_latency is worse than 3.11- w/
>> low_latency, which may not be true.
>
> I haven't heard any complaints for probably 10+ years. The last time
> I did hear a complaint, I told them to set the low_latency flag and
> that solved the problem.
>
>> Also, as you note, the latency requirement is in userspace, so bound
>> to the behavior of the scheduler anyway. Thus, immediately writing to
>> the read buffer from IRQ may have no different average latency than
>> handling by a worker (as measured by the elapsed time from interrupt
>> to userspace read).
>
> Yes. And the low_latency flag used to (as in 10 years ago) have a big
> effect on that. Without the low latency flag, the user read would
> happen up to 10ms (assuming HZ=100) after the data was received by the
> driver. Setting the low-latency flag eliminated that 10ms jitter. I
> don't know if these days setting the low_latency flag (in contexts
> where it does work) even has a noticeable effect.
>
>> How can the requirement be for both must-handle-in-minimum-time data
>> (low_latency) and the-userspace-reader-isn't-reading-fast-enough-
>> so-its-ok-to-halt-transmission ?
>>
>> Throttling/unthrottling the sender seems counter to "low latency".
>
> Agreed.
>
> I can _imagine_ a case where an application has a strict limit how
> long it will wait until it sees the first byte of the response, but
> for some isolated cases (uploading large data logs) that response
> might be very large -- large neough that the application may have to
> pause while reading the response and rely on flow control to throttle
> the upload. But, I can't point to any specific instance of that.
>
>
>>> _Usually_ applications that require low latency are exchanging short
>>> messages (up to a few hundred bytes, but usually more like a few
>>> dozen). In those cases flow control is not generally needed.
>>>
>>> Does it matter?
>>
>> Driver throttling requires excluding concurrent unthrottle and
>> calling into the driver (and said driver has relied on sleeping locks
>> for many kernel versions).
>
> Is that still an issue for drivers where the flow cotrol is handled by
> the UART?

Every driver pays the cost to evaluate if the throttling is required,
regardless of how it specifically handles flow control. pty is
now special cased in N_TTY because it forces throttling always on.

>> But first I'd like some hard data on whether or not a low latency
>> mode is even necessary (at least for user-space).
>
> I don't have any hard data, but gut anser is that it is probably no
> longer needed.

That's my belief as well.

> The problem that existed for the past few years was
> that there was a user-space way to set the low_latency flag (it didn't
> require root and you could even do it from the command-line with
> setserial) -- and doing so annoyed the kernel.
>
>>> Now that HZ is often 1000 and tickless is commonly used, I don't think
>>> the scheduling delay is nearly as much an issue as it used to be. I
>>> haven't gotten any complaints since it was largely rendered useless
>>> several years ago.
>>>
>>> Now all my drivers will silently override users if they set the
>>> low_latency flag on a port in situations where it can't be used.
>>
>> Right. I'd rather not 'fix' something that doesn't really need fixing
>> (other than to suppress any WARNING caused by low_latency).
>
> Yes. I think what currently needs to be done is to prevent any issues
> caused by the user setting the low_latency. [IMO, warnings from the
> kernel are an issue, even if the serial port continues to operate
> properly.]

Yeah, I'll go fix this.

> If somebody has specific latency requirements that aren't being met,
> the current performance need to be measured, and it needs to be
> decided if a solution is fesable and if the right solution is "fixing"
> the low_latency flag so it's allowed in more contexts.

Exactly.

Regards,
Peter Hurley

2014-02-19 19:17:37

by Alan Cox

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

> How can the requirement be for both must-handle-in-minimum-time data
> (low_latency) and the-userspace-reader-isn't-reading-fast-enough-
> so-its-ok-to-halt-transmission ?

Because low latency is about *turn around* time. There are plenty of
protocols that can flow control, do flow control and want low latency
because they are not windowed. It's not mutually exclusive by any means.

> But first I'd like some hard data on whether or not a low latency
> mode is even necessary (at least for user-space).

The easy way to simulate the annoying as crap worst cases from dumbass
firmware downloaders and the like is to set up a link between two PCs and
time 2000+ repetitions of

send 64 bytes
wait for a Y
send 64 bytes
wait for a Y
....

and the matching far end being a box running an existing kernel or a PIC
or something doing the responses.

Historically we used to lose about 20mS per cycle which over 2000 got to
be a bit of a PITA.

Low latency goes back to the days of flip buffers, bottom halves an a
100Hz clock. There are certainly better ways to do it now if its needed.

Alan

2014-02-19 20:22:20

by Peter Hurley

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

On 02/19/2014 02:17 PM, One Thousand Gnomes wrote:
>> How can the requirement be for both must-handle-in-minimum-time data
>> (low_latency) and the-userspace-reader-isn't-reading-fast-enough-
>> so-its-ok-to-halt-transmission ?
>
> Because low latency is about *turn around* time. There are plenty of
> protocols that can flow control, do flow control and want low latency
> because they are not windowed. It's not mutually exclusive by any means.

But if it's all about turn around time, how can the situation devolve to
needing throttling in the first place? For N_TTY, throttling only happens
when the read queue is close to overflow (only 128 bytes left in 4k buffer).

If the reader isn't pulling _all_ the data out the instant it's woken,
trying to trim off one worker wakeup (by processing input at interrupt time)
is pointless.

>> But first I'd like some hard data on whether or not a low latency
>> mode is even necessary (at least for user-space).
>
> The easy way to simulate the annoying as crap worst cases from dumbass
> firmware downloaders and the like is to set up a link between two PCs and
> time 2000+ repetitions of
>
> send 64 bytes
> wait for a Y
> send 64 bytes
> wait for a Y
> ....
>
> and the matching far end being a box running an existing kernel or a PIC
> or something doing the responses.

Well this is easy enough to mock up.

> Historically we used to lose about 20mS per cycle which over 2000 got to
> be a bit of a PITA.
>
> Low latency goes back to the days of flip buffers, bottom halves an a
> 100Hz clock. There are certainly better ways to do it now if its needed.

Still, as you've pointed out a couple of times, maybe there's a limited
fastpath that's easy and simple. That's why I was asking about throttling
because it's evaluated even for raw mode.

Regards,
Peter Hurley


2014-02-19 21:43:00

by Alan Cox

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

> > Because low latency is about *turn around* time. There are plenty of
> > protocols that can flow control, do flow control and want low latency
> > because they are not windowed. It's not mutually exclusive by any means.
>
> But if it's all about turn around time, how can the situation devolve to
> needing throttling in the first place? For N_TTY, throttling only happens
> when the read queue is close to overflow (only 128 bytes left in 4k buffer).

Because if we are on the uplink end of a dumb protocol then our flow ends
up as

< 4K in
read
rest of block
read
block complete
write Y

until at some point sods law decrees that we do a big load of disk
writes, an SMM event steals the processor, firefox garbage collects or
somesuch and we go

4K in
throttle
flow control


It only has to happen *once*, not every time to screw stuff up.

> If the reader isn't pulling _all_ the data out the instant it's woken,
> trying to trim off one worker wakeup (by processing input at interrupt time)
> is pointless.

No - because of the statistical corner cases. The standard Linux is not
remotely hard real time, and even if it were most of the tools involved
are not.

> > Low latency goes back to the days of flip buffers, bottom halves an a
> > 100Hz clock. There are certainly better ways to do it now if its needed.
>
> Still, as you've pointed out a couple of times, maybe there's a limited
> fastpath that's easy and simple. That's why I was asking about throttling
> because it's evaluated even for raw mode.

Even if there isn't the goal of low latency was always 'get stuff to
happen soon' not 'get stuff to happen in the IRQ handler' If you have the
tty processing happening immediately after the IRQ returns when
low_latency is set I'm sure the numbers will be just fine and as good as
historically.

Whether you can always do that I guess depends what happens on really
slow boxes if you don't do any deferral - eg on Geert's Amiga.

Flow control even in raw mode is expected Unix behaviour. However do we
ever need to evaluate it if the tty termios has IXON and CRTSCTS clear ?

Alan

2014-02-19 23:16:27

by Hal Murray

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

> Can you give me an idea of your device's average and minimum required
> latency (please be specific)? Is your target arch x86 [so I can evaluate the
> the impact of bus-locked instructions relative to your expected]?

The code I'm familiar with is ntpd and gpsd. They run on almost any hardware
or OS and talk to a wide collection of devices.

There is no hard requirement for latency. They just work better with lower
latency. The lower the better.

People gripe about the latency due to USB polling which is about a ms.

I can easily notice a few 10s of microseconds. I probably wouldn't notice a
few microseconds, but there are people who would. The latency isn't
critical, it's the jitter. (ntpd has fudge factors to correct for a fixed
offset.) Yes, down at the microsecond level luck-of-the-cache is important.


> Also, how painful would it be if unsupported termios changes were rejected
> if the port was in low_latency mode and/or if low_latency setting was
> disallowed because of termios state?

What does "unsupported termios changes" mean?

ntpd has only a few places where it opens a serial port. I'll collect a list
of the options that are used if that will help. The common cases are either
raw binary, or lines of text. It doesn't need any fancy editing.


> It would be pointless to throttle low_latency, yes?

What does "throttle" mean? If you mean what I call flow-control, then no,
it's not interesting.

There shouldn't be any problem with ntpd or gpsd grabbing all the data
promptly.


> What would be an acceptable outcome of being unable to accept input?
> Corrupted overrun? Dropped i/o? Queued for later? Please explain with
> comparison to the outcome of missed minimum latency.

Corruption would be evil. Longer latency would be OK, especially if it
didn't happen often. (ntpd has code to discard outliers.) 3% of the time
would probably not be a problem. 25% might cause problems.

We can allocate a bigger buffer if that helps.

--------

gpsd uses TIOCMIWAIT to get a wakeup from a PPS signal connected to a modem
control line. That path might have the same problem and/or some ideas on how
to handle the data case.



--
These are my opinions. I hate spam.


2014-02-19 23:35:34

by Alan Cox

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

> gpsd uses TIOCMIWAIT to get a wakeup from a PPS signal connected to a modem
> control line. That path might have the same problem and/or some ideas on how
> to handle the data case.

TIOCMIWAIT should be fine. The data processing is deferred but very few
devices defer processing of modem change events.

Alan

2014-02-20 02:20:04

by Peter Hurley

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

On 02/19/2014 04:42 PM, One Thousand Gnomes wrote:
> It only has to happen *once*, not every time to screw stuff up.
>
>> If the reader isn't pulling _all_ the data out the instant it's woken,
>> trying to trim off one worker wakeup (by processing input at interrupt time)
>> is pointless.
>
> No - because of the statistical corner cases. The standard Linux is not
> remotely hard real time, and even if it were most of the tools involved
> are not.

Ok, so this is still only about "best effort", and really bad
worst case behavior (that the tty core has no control over) is ok.

Going to great lengths to trim one wakeup when nouveau disables interrupts
for 2ms seemed like a waste of time.

>>> Low latency goes back to the days of flip buffers, bottom halves an a
>>> 100Hz clock. There are certainly better ways to do it now if its needed.
>>
>> Still, as you've pointed out a couple of times, maybe there's a limited
>> fastpath that's easy and simple. That's why I was asking about throttling
>> because it's evaluated even for raw mode.
>
> Even if there isn't the goal of low latency was always 'get stuff to
> happen soon' not 'get stuff to happen in the IRQ handler' If you have the
> tty processing happening immediately after the IRQ returns when
> low_latency is set I'm sure the numbers will be just fine and as good as
> historically.
>
> Whether you can always do that I guess depends what happens on really
> slow boxes if you don't do any deferral - eg on Geert's Amiga.

Sure.

I think _some_ effort will yield positive results. For example,
in flush_to_ldisc():

if (disc == NULL)
return;

- mutex_lock(&buf->lock);
+ if (!mutex_trylock(&buf->lock))
+ goto deref;

while (1) {
struct tty_buffer *head = buf->head;

.....

mutex_unlock(&buf->lock);
-
+ deref:
tty_ldisc_deref(disc);
}

This change makes flush_to_ldisc() itself safely callable from
interrupt context, and:
1. doesn't lose data (ie., buffers if the ldisc is filling up)
2. automatically picks the optimum handling whether the input worker
is running or not
3. doesn't require more locks to exclude flushing or the input worker

This still leaves fixing termios lock, properly handling throttling
and the exclusion logic for incompatible termios options and low_latency.

[For this change to work properly, I still need to solve a race where
the driver has just updated the commit marker but can't grab the buf
lock because tty_buffer_flush() still has the lock but has already
performed the flush -- in this case, the work needs to be restarted
or something because there's still data in the buffer that needs
processing. There's plenty of possible solutions to this; I'm thinking on
which is the right one].

> Flow control even in raw mode is expected Unix behaviour. However do we
> ever need to evaluate it if the tty termios has IXON and CRTSCTS clear ?

It doesn't right now because the assumption is that drivers can
have a whole host of reasons why they want to throttle (the pty driver
leaves it on permanently; the vt driver needs unthrottle to support
paste that no one uses even though I fixed it).

Putting aside for a moment the issue of termios safety inside
the throttle and unthrottle driver methods, the exclusion locks here could
be spinlocks if the drivers can be audited/fixed to not sleep here.

Then that just leaves the termios lock, which is a non-trivial problem, and
I'm not convinced RCU will magically fix it.

Regards,
Peter Hurley

2014-02-20 02:55:27

by Peter Hurley

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

On 02/19/2014 06:06 PM, Hal Murray wrote:
>> Can you give me an idea of your device's average and minimum required
>> latency (please be specific)? Is your target arch x86 [so I can evaluate the
>> the impact of bus-locked instructions relative to your expected]?
>
> The code I'm familiar with is ntpd and gpsd. They run on almost any hardware
> or OS and talk to a wide collection of devices.
>
> There is no hard requirement for latency. They just work better with lower
> latency. The lower the better.
>
> People gripe about the latency due to USB polling which is about a ms.

Have you tried 3.12+ without low_latency? I ripped out a lot of locks
from 3.12+ so it's possible it already meets your requirements.

> I can easily notice a few 10s of microseconds. I probably wouldn't notice a
> few microseconds, but there are people who would. The latency isn't
> critical, it's the jitter. (ntpd has fudge factors to correct for a fixed
> offset.) Yes, down at the microsecond level luck-of-the-cache is important.

Hopefully you meant "milliseconds" here; single-digit microsecond latency on
any kind of stable duty cycle is linux-rt territory, and simply not a reasonable
goal for mainline.

The jitter is all scheduler and since the user-space app is sleeping waiting
for input, there's nothing the tty core can do about that. Removing one
mandatory scheduling wakeup _may_ help latency, but will probably not make
much difference regarding jitter.

>> Also, how painful would it be if unsupported termios changes were rejected
>> if the port was in low_latency mode and/or if low_latency setting was
>> disallowed because of termios state?
>
> What does "unsupported termios changes" mean?

For example, once the port is in low_latency mode, setting L_ECHO (and its ilk)
would be rejected. And vice versa, if the L_ECHO is set in termios, low_latency
would be rejected.

So running a vt console is low_latency mode is not going to work.

> ntpd has only a few places where it opens a serial port. I'll collect a list
> of the options that are used if that will help. The common cases are either
> raw binary, or lines of text. It doesn't need any fancy editing.
>
>
>> It would be pointless to throttle low_latency, yes?
>
> What does "throttle" mean? If you mean what I call flow-control, then no,
> it's not interesting.

Yes, whatever the driver currently considers flow-control. The core is
agnostic about the mechanism; throttling is the generic requirement.

> There shouldn't be any problem with ntpd or gpsd grabbing all the data
> promptly.

Ok.

>> What would be an acceptable outcome of being unable to accept input?
>> Corrupted overrun? Dropped i/o? Queued for later? Please explain with
>> comparison to the outcome of missed minimum latency.
>
> Corruption would be evil. Longer latency would be OK, especially if it
> didn't happen often. (ntpd has code to discard outliers.) 3% of the time
> would probably not be a problem. 25% might cause problems.
>
> We can allocate a bigger buffer if that helps.

No need, I already solved this step.

> gpsd uses TIOCMIWAIT to get a wakeup from a PPS signal connected to a modem
> control line. That path might have the same problem and/or some ideas on how
> to handle the data case.

What Alan said. low_latency has no impact on the handling of the PPS signal
through DCD.

Regards,
Peter Hurley

2014-02-20 04:14:23

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

On Wed, Feb 19, 2014 at 09:55:21PM -0500, Peter Hurley wrote:
> On 02/19/2014 06:06 PM, Hal Murray wrote:
> >> Can you give me an idea of your device's average and minimum required
> >> latency (please be specific)? Is your target arch x86 [so I can evaluate the
> >> the impact of bus-locked instructions relative to your expected]?
> >
> > The code I'm familiar with is ntpd and gpsd. They run on almost any hardware
> > or OS and talk to a wide collection of devices.
> >
> > There is no hard requirement for latency. They just work better with lower
> > latency. The lower the better.
> >
> > People gripe about the latency due to USB polling which is about a ms.
>
> Have you tried 3.12+ without low_latency? I ripped out a lot of locks
> from 3.12+ so it's possible it already meets your requirements.

Once USB gets involved, I don't want to hear anyone start complaining
about "latency" issues. Almost all USB->serial devices do not take
latency into account at all. The ones that do are really expensive and
not showing up in GPS devices and other "normal" devices at all.

So go blame the device manufactures for this, they obviously don't care
about issues like that if they use USB bulk endpoints for their data.

And yes, there are rumors that new hardware in a few years will be able
to handle time differences with USB latencies and the like, but I'll
wait until I see that hardware ship before even start to worry about the
issues involved...

thanks,

greg k-h

2014-02-20 18:16:45

by Peter Hurley

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

On 02/19/2014 09:55 PM, Peter Hurley wrote:
> On 02/19/2014 06:06 PM, Hal Murray wrote:
>>> Can you give me an idea of your device's average and minimum required
>>> latency (please be specific)? Is your target arch x86 [so I can evaluate the
>>> the impact of bus-locked instructions relative to your expected]?
>>
>> The code I'm familiar with is ntpd and gpsd. They run on almost any hardware
>> or OS and talk to a wide collection of devices.
>>
>> There is no hard requirement for latency. They just work better with lower
>> latency. The lower the better.
>>
>> People gripe about the latency due to USB polling which is about a ms.
>
> Have you tried 3.12+ without low_latency? I ripped out a lot of locks
> from 3.12+ so it's possible it already meets your requirements.

Using Alan's idea to mock up a latency test, I threw together a test jig
using two computers running 3.14-rc1 and my fwserial driver (modified to
not aggregrate writes) in raw mode where the target does this:

while (1) {
read 64 bytes
compare to pattern
write 1 byte response
}

and the sender does this:

for (i = 0; i < 2000; i++) {
write 64-byte pattern
read 1 byte response
}

Sender completes 2000 loops in 160ms total run time;
that's 80us average per complete round-trip.

Here's a snapshot of a function trace for 1 complete round trip at the target:

<idle>-0 [007] d.h1 4856.935561: irq_handler: firewire_ohci 0000:07:06.0: IRQ 00000010 AR_req
<idle>-0 [007] ..s1 4856.935566: log_ar_at_event.isra.33: firewire_ohci 0000:07:06.0: AR spd 2 tl 2a, ffc1 -> ffc0, ack_complete, BW req, 800000000004 40,0
<idle>-0 [007] ..s1 4856.935567: tty_flip_buffer_push <-fwtty_port_handler
kworker/7:0-6872 [007] .... 4856.935571: flush_to_ldisc <-process_one_work
tty-latency-tes-6891 [003] .... 4856.935578: n_tty_write <-tty_write
<idle>-0 [003] d.h1 4856.935591: irq_handler: firewire_ohci 0000:07:06.0: IRQ 00000001 AT_req
<idle>-0 [003] ..s1 4856.935594: log_ar_at_event.isra.33: firewire_ohci 0000:07:06.0: AT spd 2 tl 12, ffc0 -> ffc1, ack_complete, BW req, 800000000004 1,0

The same snapshot explained:

The firewire controller IRQ for received packet in rx dma fifo:

<idle>-0 [007] d.h1 4856.935561: irq_handler: firewire_ohci 0000:07:06.0: IRQ 00000010 AR_req

The firewire-ohci driver rx fifo tasklet running:

<idle>-0 [007] ..s1 4856.935566: log_ar_at_event.isra.33: firewire_ohci 0000:07:06.0: AR spd 2 tl 2a, ffc1 -> ffc0, ack_complete, BW req, 800000000004 40,0

The fwserial driver schdeduling the rx data:

<idle>-0 [007] ..s1 4856.935567: tty_flip_buffer_push <-fwtty_port_handler

The tty buffer input worker running:

kworker/7:0-6872 [007] .... 4856.935571: flush_to_ldisc <-process_one_work

The test application having received the data, compared it,
and writing the 1-byte response:

tty-latency-tes-6891 [003] .... 4856.935578: n_tty_write <-tty_write

The firewire controller IRQ for running the tx dma fifo:

<idle>-0 [003] d.h1 4856.935591: irq_handler: firewire_ohci 0000:07:06.0: IRQ 00000001 AT_req

The firewire-ohci driver tx fifo tasklet acknowledging the sender
of the 64-byte packet has received the 1-byte response:

<idle>-0 [003] ..s1 4856.935594: log_ar_at_event.isra.33: firewire_ohci 0000:07:06.0: AT spd 2 tl 12, ffc0 -> ffc1, ack_complete, BW req, 800000000004 1,0


... trace continued for another 4x 64-byte reads and 1 byte responses:

<idle>-0 [007] d.h1 4856.935646: irq_handler: firewire_ohci 0000:07:06.0: IRQ 00000010 AR_req
<idle>-0 [007] ..s1 4856.935650: log_ar_at_event.isra.33: firewire_ohci 0000:07:06.0: AR spd 2 tl 2b, ffc1 -> ffc0, ack_complete, BW req, 800000000004 40,0
<idle>-0 [007] ..s1 4856.935651: tty_flip_buffer_push <-fwtty_port_handler
kworker/7:0-6872 [007] .... 4856.935655: flush_to_ldisc <-process_one_work
tty-latency-tes-6891 [003] .... 4856.935662: n_tty_write <-tty_write
<idle>-0 [003] d.h1 4856.935674: irq_handler: firewire_ohci 0000:07:06.0: IRQ 00000001 AT_req
<idle>-0 [003] ..s1 4856.935678: log_ar_at_event.isra.33: firewire_ohci 0000:07:06.0: AT spd 2 tl 13, ffc0 -> ffc1, ack_complete, BW req, 800000000004 1,0
<idle>-0 [007] d.h1 4856.935729: irq_handler: firewire_ohci 0000:07:06.0: IRQ 00000010 AR_req
<idle>-0 [007] ..s1 4856.935733: log_ar_at_event.isra.33: firewire_ohci 0000:07:06.0: AR spd 2 tl 2c, ffc1 -> ffc0, ack_complete, BW req, 800000000004 40,0
<idle>-0 [007] ..s1 4856.935734: tty_flip_buffer_push <-fwtty_port_handler
kworker/7:0-6872 [007] .... 4856.935738: flush_to_ldisc <-process_one_work
tty-latency-tes-6891 [003] .... 4856.935746: n_tty_write <-tty_write
<idle>-0 [003] d.h1 4856.935758: irq_handler: firewire_ohci 0000:07:06.0: IRQ 00000001 AT_req
<idle>-0 [003] ..s1 4856.935762: log_ar_at_event.isra.33: firewire_ohci 0000:07:06.0: AT spd 2 tl 14, ffc0 -> ffc1, ack_complete, BW req, 800000000004 1,0
<idle>-0 [007] d.h1 4856.935812: irq_handler: firewire_ohci 0000:07:06.0: IRQ 00000010 AR_req
<idle>-0 [007] ..s1 4856.935816: log_ar_at_event.isra.33: firewire_ohci 0000:07:06.0: AR spd 2 tl 2d, ffc1 -> ffc0, ack_complete, BW req, 800000000004 40,0
<idle>-0 [007] ..s1 4856.935817: tty_flip_buffer_push <-fwtty_port_handler
kworker/7:0-6872 [007] .... 4856.935821: flush_to_ldisc <-process_one_work
tty-latency-tes-6891 [003] .... 4856.935828: n_tty_write <-tty_write
<idle>-0 [003] d.h1 4856.935844: irq_handler: firewire_ohci 0000:07:06.0: IRQ 00000001 AT_req
<idle>-0 [003] ..s1 4856.935847: log_ar_at_event.isra.33: firewire_ohci 0000:07:06.0: AT spd 2 tl 15, ffc0 -> ffc1, ack_complete, BW req, 800000000004 1,0
<idle>-0 [007] d.h1 4856.935895: irq_handler: firewire_ohci 0000:07:06.0: IRQ 00000010 AR_req
<idle>-0 [007] ..s1 4856.935899: log_ar_at_event.isra.33: firewire_ohci 0000:07:06.0: AR spd 2 tl 2e, ffc1 -> ffc0, ack_complete, BW req, 800000000004 40,0
<idle>-0 [007] ..s1 4856.935900: tty_flip_buffer_push <-fwtty_port_handler
kworker/7:0-6872 [007] .... 4856.935904: flush_to_ldisc <-process_one_work
tty-latency-tes-6891 [003] .... 4856.935911: n_tty_write <-tty_write
<idle>-0 [003] d.h1 4856.935923: irq_handler: firewire_ohci 0000:07:06.0: IRQ 00000001 AT_req
<idle>-0 [003] ..s1 4856.935927: log_ar_at_event.isra.33: firewire_ohci 0000:07:06.0: AT spd 2 tl 16, ffc0 -> ffc1, ack_complete, BW req, 800000000004 1,0


I think this shows that low_latency is unnecessary and should
just be removed/ignored by the tty core.

Let me know if you want my test jig (or fwserial patch). The test
was run on a 10-year old Vaio single-core laptop as the sender and
6-year old dual socket xeon workstation (so nothing super tricked
out for this test).


Regards,
Peter Hurley

2014-02-20 19:34:18

by Grant Edwards

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

On 2014-02-20, Peter Hurley <[email protected]> wrote:
> On 02/19/2014 09:55 PM, Peter Hurley wrote:
>> On 02/19/2014 06:06 PM, Hal Murray wrote:
>>>> Can you give me an idea of your device's average and minimum required
>>>> latency (please be specific)? Is your target arch x86 [so I can evaluate the
>>>> the impact of bus-locked instructions relative to your expected]?
>>>
>>> The code I'm familiar with is ntpd and gpsd. They run on almost any hardware
>>> or OS and talk to a wide collection of devices.
>>>
>>> There is no hard requirement for latency. They just work better with lower
>>> latency. The lower the better.
>>>
>>> People gripe about the latency due to USB polling which is about a ms.
>>
>> Have you tried 3.12+ without low_latency? I ripped out a lot of locks
>> from 3.12+ so it's possible it already meets your requirements.
>
> Using Alan's idea to mock up a latency test, I threw together a test jig
> using two computers running 3.14-rc1 and my fwserial driver (modified to
> not aggregrate writes) in raw mode where the target does this:
>
> while (1) {
> read 64 bytes
> compare to pattern
> write 1 byte response
> }
>
> and the sender does this:
>
> for (i = 0; i < 2000; i++) {
> write 64-byte pattern
> read 1 byte response
> }
>
> Sender completes 2000 loops in 160ms total run time;
> that's 80us average per complete round-trip.

If I understand correctly, that 80us _includes_ the actual time for
the bits on the wire (which means the actual "baud rate" involved is
high enough that it's negligible).


> I think this shows that low_latency is unnecessary and should
> just be removed/ignored by the tty core.

If that's the sort of latency that you get for typical kernel
configurations for typical distros, then I agree that the low_latency
flag is not needed by the tty later.

However, it might still be useful for the lower-level tty or
serial-core driver to control CPU usage vs. latency trade-offs (for
exaple, one of my drivers uses it to decide where to set the rx FIFO
threshold).

--
Grant Edwards grant.b.edwards Yow! I wonder if I could
at ever get started in the
gmail.com credit world?

2014-02-20 21:55:44

by Hal Murray

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature


> Have you tried 3.12+ without low_latency? I ripped out a lot of locks from
> 3.12+ so it's possible it already meets your requirements.

Looks good. I don't think I could tell the difference by looking at my
normal collection of graphs.


> Hopefully you meant "milliseconds" here; single-digit microsecond latency on
> any kind of stable duty cycle is linux-rt territory, and simply not a
> reasonable goal for mainline.

No, I really meant microseconds. Remember, I'm running on a lightly loaded
system, not trying to squeeze the impossible out of an overloaded system.

top says gpsd has niced itself to -10, and ntpd is marked RT.

100 microseconds is easy. I can get down to a few 10s of microseconds. I'm
not sure how low I could get.


>> What does "unsupported termios changes" mean?

> For example, once the port is in low_latency mode, setting L_ECHO (and its
> ilk) would be rejected. And vice versa, if the L_ECHO is set in termios,
> low_latency would be rejected.

> So running a vt console is low_latency mode is not going to work.

OK. I doubt if there is any problem, but we should be sure to be explicit
about what "and its ilk" really means.

------------

I don't understand the scheduler issues that triggered this bug.

Let's go back to the big picture. In the old old days, time sharing systems
had lots of serial ports. It was common for the hardware to buffer up
several characters before requesting an interrupt in order to reduce the CPU
load. There was usually a bit in the hardware to bypass this if you thought
that response time was more important than CPU load. I was expecting
low_latency to set that bit.

Is that option even present in modern serial chips? Do the various chips
claiming to be 8250/16550 and friends correctly implement all the details of
the specs?

Many gigabit ethernet controllers have the same issue. It's often called
interrupt coalescing.

What/why is the serial/scheduler doing differently in the low_latency case?
What case does that help?



--
These are my opinions. I hate spam.


2014-02-20 22:06:15

by Peter Hurley

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

On 02/20/2014 02:33 PM, Grant Edwards wrote:
> On 2014-02-20, Peter Hurley <[email protected]> wrote:
>> Sender completes 2000 loops in 160ms total run time;
>> that's 80us average per complete round-trip.
>
> If I understand correctly, that 80us _includes_ the actual time for
> the bits on the wire (which means the actual "baud rate" involved is
> high enough that it's negligible).

Yes, 80us includes the transmit time.

>> I think this shows that low_latency is unnecessary and should
>> just be removed/ignored by the tty core.
>
> If that's the sort of latency that you get for typical kernel
> configurations for typical distros, then I agree that the low_latency
> flag is not needed by the tty later.

Stock ubuntu kernel config but preempt and 250hz (and debugging stuff).

> However, it might still be useful for the lower-level tty or
> serial-core driver to control CPU usage vs. latency trade-offs (for
> exaple, one of my drivers uses it to decide where to set the rx FIFO
> threshold).

Sure, it could be left for driver consumption.

Regards,
Peter Hurley

2014-02-20 22:15:23

by Grant Edwards

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

On 2014-02-20, Hal Murray <[email protected]> wrote:

> Let's go back to the big picture. In the old old days, time sharing
> systems had lots of serial ports. It was common for the hardware to
> buffer up several characters before requesting an interrupt in order
> to reduce the CPU load.

There were even serial boards that had a cooked "line mode" which
buffered up a whole line of input: they handled limited line-editing
and didn't interrupt the CPU until they saw 'enter' or 'ctrl-C'.

> There was usually a bit in the hardware to bypass this if you thought
> that response time was more important than CPU load. I was expecting
> low_latency to set that bit.

It might. That depends on whether the driver paid any attention to
the low_latency flag. IIRC, some did, some didn't.

> Is that option even present in modern serial chips?

Sure. In pretty much all of the UARTs I know of, you can configure
the rx FIFO threshold or disable the rx FIFO altogether [though
setting the threshold to 1 is usually a better idea than disabling the
rx FIFO]. At least one of my serial_core drivers looks at the
low_latency flag and configure a lower rx FIFO threshold if it's set.

> Do the various chips claiming to be 8250/16550 and friends correctly
> implement all the details of the specs?

What specs?

> Many gigabit ethernet controllers have the same issue. It's often
> called interrupt coalescing.
>
> What/why is the serial/scheduler doing differently in the low_latency
> case? What case does that help?

Back in the old days, when a serial driver pushed characters up to the
tty layer it didn't immediately wake up a process that was blocking on
a read(). AFAICT, that didn't happen until the next system tick. I'm
not sure if that was just because the scheduler wasn't called until a
tick happened, or if there was some intermediate tty-layer
worker-thread that had to run.

Setting the low_latency flag avoided that.

When the driver pushed characters to the tty layer with the
low_latency flag set, the user-space process that was blocking on
read() would wake up "immediately". This potentially used up a lot
more CPU time, since a user process that is reading a large block of
data _might_ be woken up and then block again for every rx byte --
assuming no rx FIFO. Without the low_latency flag, the user process
would wake up every 10ms and be handed 10ms worth of data. (Back then
HZ was always 100.)

At least that's how I remember it...

--
Grant Edwards grant.b.edwards Yow! My EARS are GONE!!
at
gmail.com

2014-02-21 15:40:05

by Alan Cox

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

> Ok, so this is still only about "best effort", and really bad
> worst case behavior (that the tty core has no control over) is ok.
>
> Going to great lengths to trim one wakeup when nouveau disables interrupts
> for 2ms seemed like a waste of time.

If it used to work and it doesn't now it's a regression. It's also a
nasty one if you've removed the facility for it.

> This change makes flush_to_ldisc() itself safely callable from
> interrupt context, and:
> 1. doesn't lose data (ie., buffers if the ldisc is filling up)
> 2. automatically picks the optimum handling whether the input worker
> is running or not
> 3. doesn't require more locks to exclude flushing or the input worker

Yep

> Putting aside for a moment the issue of termios safety inside
> the throttle and unthrottle driver methods, the exclusion locks here could
> be spinlocks if the drivers can be audited/fixed to not sleep here.

That was basically insoluble when the lock first went in. We tried with a
spinlock but a lot of USB widgets need to go and chatter with the device
when you do flow control. Flow control is fundamentally ordered but
asynchronous however so if the right fix was to make the USB dongles
queue the work then no harm is done (and the queued flow control
assertion would worst case be no different to a non queued one from a
queued flush_to_ldisc)

> Then that just leaves the termios lock, which is a non-trivial problem, and
> I'm not convinced RCU will magically fix it.

If you pass a snapshot of the termios state down then I think it does,
but it's still not remotely trivial.

First question though comes before all of this - and that is do we need
low_latency at all any more or is the current scheduling logic now good
enough to do the job anyway.

Alan

2014-02-21 15:44:09

by Alan Cox

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

> Back in the old days, when a serial driver pushed characters up to the
> tty layer it didn't immediately wake up a process that was blocking on
> a read(). AFAICT, that didn't happen until the next system tick. I'm
> not sure if that was just because the scheduler wasn't called until a
> tick happened, or if there was some intermediate tty-layer
> worker-thread that had to run.

Historically the interrupt handler tried to get out of the interrupt as
fast as possible and to do the minimum instruction count, because a 56K
modem on a 486 with a typical 16450A UART was *tight* and also a use case
everyone cared about.

So historically the code worked on the basis that there were two buffers
per tty. Each timer tick the kernel flipped the buffers over (hence the
legancy flip naming here and thre), and processed the data.

We do need low latency to the drivers, for FIFO setting, DMA watermarks
and for some USB dongles for configuring the packetising

Alan

2014-02-21 15:58:21

by Peter Hurley

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

Hi Alan,

On 02/21/2014 10:39 AM, One Thousand Gnomes wrote:
>> Ok, so this is still only about "best effort", and really bad
>> worst case behavior (that the tty core has no control over) is ok.
>>
>> Going to great lengths to trim one wakeup when nouveau disables interrupts
>> for 2ms seemed like a waste of time.
>
> If it used to work and it doesn't now it's a regression. It's also a
> nasty one if you've removed the facility for it.

I think the consensus is to leave the low_latency facility in, but
remove it's connection to the tty buffers.

If the known-to-be-already-in-non-interrupt-context drivers want,
I can add a different function for executing flush_to_ldisc()
directly. But I don't want to do that without a use-case and test
subject.

>> This change makes flush_to_ldisc() itself safely callable from
>> interrupt context, and:
>> 1. doesn't lose data (ie., buffers if the ldisc is filling up)
>> 2. automatically picks the optimum handling whether the input worker
>> is running or not
>> 3. doesn't require more locks to exclude flushing or the input worker
>
> Yep
>
>> Putting aside for a moment the issue of termios safety inside
>> the throttle and unthrottle driver methods, the exclusion locks here could
>> be spinlocks if the drivers can be audited/fixed to not sleep here.
>
> That was basically insoluble when the lock first went in. We tried with a
> spinlock but a lot of USB widgets need to go and chatter with the device
> when you do flow control. Flow control is fundamentally ordered but
> asynchronous however so if the right fix was to make the USB dongles
> queue the work then no harm is done (and the queued flow control
> assertion would worst case be no different to a non queued one from a
> queued flush_to_ldisc)

Oh. That's not something I want to take on.

>> Then that just leaves the termios lock, which is a non-trivial problem, and
>> I'm not convinced RCU will magically fix it.
>
> If you pass a snapshot of the termios state down then I think it does,
> but it's still not remotely trivial.

That was my thought too -- that only dependency injection would work.
Which would require adding that to most, if not all, driver methods, which
seems way too painful.

> First question though comes before all of this - and that is do we need
> low_latency at all any more or is the current scheduling logic now good
> enough to do the job anyway.

Right.

Based on my recent test, I think low_latency doesn't need to be a knob for
the tty core. Drivers can continue to use it to mess with their rx fifo
settings and such like.

I plan on sending Greg a patch to do just that, probably this weekend.

Regards,
Peter Hurley

2014-02-21 16:31:35

by Grant Edwards

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

On 2014-02-21, Peter Hurley <[email protected]> wrote:

> I think the consensus is to leave the low_latency facility in, but
> remove it's connection to the tty buffers.
>
> If the known-to-be-already-in-non-interrupt-context drivers want,
> I can add a different function for executing flush_to_ldisc()
> directly. But I don't want to do that without a use-case and test
> subject.

Three of the drivers I maintain have modes where they handle all rx
data in non-interrupt contexts, but I'm not convinced (or even
suspicious) that there would be any noticeable benefit from such a
function. If, at some point in the future, it becomes apparent that
there is "too much latency" in certain cases then perhaps it can be
looked at again -- but I think doing it now is premature optimization.
That said, all things being equal, it would be nice to avoid anything
that would make such an addition impossible in the future.

>> First question though comes before all of this - and that is do we need
>> low_latency at all any more or is the current scheduling logic now good
>> enough to do the job anyway.
>
> Right.
>
> Based on my recent test, I think low_latency doesn't need to be a
> knob for the tty core.

I Agree: there doesn't seem to be any evidence that it's needed by the
tty/ldisc layer.

> Drivers can continue to use it to mess with their rx fifo settings
> and such like.

Excellent. One of my serial_core drivers still has (in it's default
configuration) 10ms of latency that I can choose to eliminate on a
per-port bases (at the cost of extra CPU cycles) when the low_latency
flag is set.

> I plan on sending Greg a patch to do just that, probably this weekend.

Cool. Thanks much for your attention to this.

--
Grant

2014-02-23 22:33:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

On Thu, 20 Feb 2014, Peter Hurley wrote:

> On 02/19/2014 09:55 PM, Peter Hurley wrote:
> > On 02/19/2014 06:06 PM, Hal Murray wrote:
> > > > Can you give me an idea of your device's average and minimum required
> > > > latency (please be specific)? Is your target arch x86 [so I can
> > > > evaluate the
> > > > the impact of bus-locked instructions relative to your expected]?
> > >
> > > The code I'm familiar with is ntpd and gpsd. They run on almost any
> > > hardware
> > > or OS and talk to a wide collection of devices.
> > >
> > > There is no hard requirement for latency. They just work better with
> > > lower
> > > latency. The lower the better.
> > >
> > > People gripe about the latency due to USB polling which is about a ms.
> >
> > Have you tried 3.12+ without low_latency? I ripped out a lot of locks
> > from 3.12+ so it's possible it already meets your requirements.
>
> Using Alan's idea to mock up a latency test, I threw together a test jig
> using two computers running 3.14-rc1 and my fwserial driver (modified to
> not aggregrate writes) in raw mode where the target does this:

This is a complete pointless test. Use a bog standard 8250 UART on the
PC and connect a microcontroller on the other end which serves you an
continous stream of data at 115200 Baud.

There is no way you can keep up with that without the low latency
option neither on old and nor on new machines if you have enough other
stuff going on in the system.

Thanks,

tglx


2014-02-24 00:23:35

by Peter Hurley

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

Hi Thomas,

On 02/23/2014 05:33 PM, Thomas Gleixner wrote:
> On Thu, 20 Feb 2014, Peter Hurley wrote:
>
>> On 02/19/2014 09:55 PM, Peter Hurley wrote:
>>> On 02/19/2014 06:06 PM, Hal Murray wrote:
>>>>> Can you give me an idea of your device's average and minimum required
>>>>> latency (please be specific)? Is your target arch x86 [so I can
>>>>> evaluate the
>>>>> the impact of bus-locked instructions relative to your expected]?
>>>>
>>>> The code I'm familiar with is ntpd and gpsd. They run on almost any
>>>> hardware
>>>> or OS and talk to a wide collection of devices.
>>>>
>>>> There is no hard requirement for latency. They just work better with
>>>> lower
>>>> latency. The lower the better.
>>>>
>>>> People gripe about the latency due to USB polling which is about a ms.
>>>
>>> Have you tried 3.12+ without low_latency? I ripped out a lot of locks
>>> from 3.12+ so it's possible it already meets your requirements.
>>
>> Using Alan's idea to mock up a latency test, I threw together a test jig
>> using two computers running 3.14-rc1 and my fwserial driver (modified to
>> not aggregrate writes) in raw mode where the target does this:
>
> This is a complete pointless test. Use a bog standard 8250 UART on the
> PC and connect a microcontroller on the other end which serves you an
> continous stream of data at 115200 Baud.
>
> There is no way you can keep up with that without the low latency
> option neither on old and nor on new machines if you have enough other
> stuff going on in the system.

I'm not sure exactly what you mean by 'keep up'?
115kbaud is 11.25KB/sec which is a trivial workload (unless you're
using a 1-byte read buffer).

If you have enough other stuff going on in the system (hackbench?),
even the low_latency knob won't fix the inability to keep up because all
the buffering will fill up and overrun anyway.

So what I need to understand about your setup is:
a) is throughput the actual problem or is latency? IOW, does the
device have a minimum response time from a user-space process
or is buffered data getting dropped?
b) is the device flow-controlled or is that not an option?

Based on those answers, if necessary, I could get you an instrumentation
patch, if your willing, so I can profile where the problem is.

And I haven't seen a bog standard 8250 UART in 3 decades. What UART
do you actually have?

Regards,
Peter Hurley


2014-02-24 13:24:22

by Alan Cox

[permalink] [raw]
Subject: Re: locking changes in tty broke low latency feature

> This is a complete pointless test. Use a bog standard 8250 UART on the
> PC and connect a microcontroller on the other end which serves you an
> continous stream of data at 115200 Baud.
>
> There is no way you can keep up with that without the low latency
> option neither on old and nor on new machines if you have enough other
> stuff going on in the system.

Sorry but having done this in the past the reverse is true. On ancient
machines with crap uarts the low_latency case would routinely overrun
while the non low_latency case did not. That was half of the point of
deferred processing - it pushed tty processing out of the IRQ handler so
bytes were not lost and a 486DX could cope with a 56Kbit modem.

Alan