2015-05-09 10:22:56

by Mason

[permalink] [raw]
Subject: Hardware spec prevents optimal performance in device driver

Hello everyone,

I'm writing a device driver for a serial-ish kind of device.
I'm interested in the TX side of the problem. (I'm working on
an ARM Cortex A9 system by the way.)

There's a 16-byte TX FIFO. Data is queued to the FIFO by writing
{1,2,4} bytes to a TX{8,16,32} memory-mapped register.
Reading the TX_DEPTH register returns the current queue depth.

The TX_READY IRQ is asserted when (and only when) TX_DEPTH
transitions from 1 to 0.

With this spec in mind, I don't see how it is possible to
attain optimal TX performance in the driver. There's a race
between the SW thread filling the queue and the HW thread
emptying it.

My first attempt went along these lines:

SW thread pseudo-code (blocking write)

while (bytes_to_send > 16) {
write 16 bytes to the queue /* NON ATOMIC */
bytes_to_send -= 16;
wait for semaphore
}
write the last bytes to the queue
wait for semaphore

The simplest way to "write 16 bytes to the queue" is
a byte-access loop.

for (i = 0; i < 16; ++i) write buf[i] to TX8
or -- just slightly more complex
for (i = 0; i < 4; ++i) write buf[4i .. 4i+3] to TX32

But you see the problem: I write a byte, and then, for some
reason (low freq from cpufreq, IRQ) the CPU takes a very long
time to get to the next, thus TX_READY fires before I even
write the next byte.

In short, TX_READY could fire at any point while filling the queue.

In my opinion, the semantics of TX_READY are fuzzy. When I hit
the ISR, I just know that "the TX queue reached 0 at some point
in time" but the HW might still be working on sending some bytes.

Seems the best one can do is:

while (bytes_to_send > 4) {
write 4 bytes to TX32 /* ATOMIC */
bytes_to_send -= 4;
wait for semaphore
}
while (bytes_to_send > 0) {
write 1 byte to TX8 /* ATOMIC */
bytes_to_send -= 1;
wait for semaphore
}

(This is ignoring the fact that the original buffer to
send may not be word-aligned, I will have to investigate
misaligned loads, or handle the first 0-3 bytes manually.)

In the solution proposed above, using atomic writes to
the device, I know that TX_READY signals "the work you
requested in now complete". But I have sacrificed
performance, as I will take an IRQ for every 4 bytes,
instead of one for every 16 bytes.

Is this making any sense? Or am I completely mistaken?

Regards.


2015-05-09 17:33:12

by Alan Cox

[permalink] [raw]
Subject: Re: Hardware spec prevents optimal performance in device driver

On Sat, 09 May 2015 12:22:43 +0200
Mason <[email protected]> wrote:

> Hello everyone,
>
> I'm writing a device driver for a serial-ish kind of device.
> I'm interested in the TX side of the problem. (I'm working on
> an ARM Cortex A9 system by the way.)
>
> There's a 16-byte TX FIFO. Data is queued to the FIFO by writing
> {1,2,4} bytes to a TX{8,16,32} memory-mapped register.
> Reading the TX_DEPTH register returns the current queue depth.
>
> The TX_READY IRQ is asserted when (and only when) TX_DEPTH
> transitions from 1 to 0.

If the last statement is correct then your performance is probably always
going to suck unless there is additional invisible queueing beyond the
visible FIFO.

FIFOs on sane serial ports either have an adjustable threshold or fire
when its some way off empty. That way our normal flow is that you take
the TX interrupt before the port empties so you can fill it back up.

On that kind of port I'd expect optimal to probably be something like
writing 4 bytes until < 4 is left, and repeating that until your own
transmit queue is < 4 bytes and the write the dribble.

You don't normally want to perfectly fill the FIFO, you just want to ram
stuff into it efficiently with sufficient hardware queue and latency of
response that the queue never empties. Beyond that it doesn't matter.

Alan

2015-05-09 20:49:13

by Mason

[permalink] [raw]
Subject: Re: Hardware spec prevents optimal performance in device driver

One Thousand Gnomes wrote:

> Mason wrote:
>
>> I'm writing a device driver for a serial-ish kind of device.
>> I'm interested in the TX side of the problem. (I'm working on
>> an ARM Cortex A9 system by the way.)
>>
>> There's a 16-byte TX FIFO. Data is queued to the FIFO by writing
>> {1,2,4} bytes to a TX{8,16,32} memory-mapped register.
>> Reading the TX_DEPTH register returns the current queue depth.
>>
>> The TX_READY IRQ is asserted when (and only when) TX_DEPTH
>> transitions from 1 to 0.
>
> If the last statement is correct then your performance is probably always
> going to suck unless there is additional invisible queueing beyond the
> visible FIFO.

Do you agree with my assessment that the current semantics for
TX_READY lead to a race condition, unless we limit ourselves
to a single (atomic) write between interrupts?

> FIFOs on sane serial ports either have an adjustable threshold or fire
> when its some way off empty. That way our normal flow is that you take
> the TX interrupt before the port empties so you can fill it back up.

This is where I must be missing something obvious.

As far as I can see, the race condition still exists, even if
the hardware provides a TX threshold.

Suppose we set the threshold to 4, then write 4-byte words to the queue.
TX_READY may fire between two writes if the CPU is very slow
(unlikely) or is required to do something else (more likely).

Thus in the ISR, I can't tell exactly what happened, and I cannot
signal something clear to the other thread.

What am I missing?

BTW, I checked the HW spec. There's a RX thresh, but no TX thresh.

> On that kind of port I'd expect optimal to probably be something like
> writing 4 bytes until < 4 is left, and repeating that until your own
> transmit queue is < 4 bytes and the write the dribble.

To keep the data flowing between FIFO and device. I agree.

> You don't normally want to perfectly fill the FIFO, you just want to ram
> stuff into it efficiently with sufficient hardware queue and latency of
> response that the queue never empties. Beyond that it doesn't matter.

Well there's another dimension to optimize: minimizing IRQs to
the CPU. And completely filling the FIFO achieves that.

Interrupting once for every 12 bytes sounds better than interrupting
once for every 4 or 8 bytes, don't you agree? What am I missing?

Regards.

2015-05-10 10:37:07

by Måns Rullgård

[permalink] [raw]
Subject: Re: Hardware spec prevents optimal performance in device driver

Mason <[email protected]> writes:

> One Thousand Gnomes wrote:
>
>> Mason wrote:
>>
>>> I'm writing a device driver for a serial-ish kind of device.
>>> I'm interested in the TX side of the problem. (I'm working on
>>> an ARM Cortex A9 system by the way.)
>>>
>>> There's a 16-byte TX FIFO. Data is queued to the FIFO by writing
>>> {1,2,4} bytes to a TX{8,16,32} memory-mapped register.
>>> Reading the TX_DEPTH register returns the current queue depth.
>>>
>>> The TX_READY IRQ is asserted when (and only when) TX_DEPTH
>>> transitions from 1 to 0.
>>
>> If the last statement is correct then your performance is probably always
>> going to suck unless there is additional invisible queueing beyond the
>> visible FIFO.
>
> Do you agree with my assessment that the current semantics for
> TX_READY lead to a race condition, unless we limit ourselves
> to a single (atomic) write between interrupts?

No. To get best throughput, you can simply busy-wait until TX_DEPTH
indicates the FIFO is almost empty, then write a few words, but no more
than you know fit in the FIFO. Repeat until all data has been written.
Use the IRQ only to signal completion of the entire packet.

If the transmit rate is low, you can save some CPU time by filling the
FIFO, then sleeping until it should be almost empty, fill again, etc.

Whether busy-waiting or sleeping, this approach keeps the data flowing
as fast as possible.

With the hardware you describe, there is unfortunately a trade-off
between throughput and CPU efficiency. You'll have to decide which is
more important to you.

--
M?ns Rullg?rd
[email protected]

2015-05-10 16:46:16

by Mason

[permalink] [raw]
Subject: Re: Hardware spec prevents optimal performance in device driver

On 10/05/2015 12:29, M?ns Rullg?rd wrote:

> Mason writes:
>
>> One Thousand Gnomes wrote:
>>
>>> Mason wrote:
>>>
>>>> I'm writing a device driver for a serial-ish kind of device.
>>>> I'm interested in the TX side of the problem. (I'm working on
>>>> an ARM Cortex A9 system by the way.)
>>>>
>>>> There's a 16-byte TX FIFO. Data is queued to the FIFO by writing
>>>> {1,2,4} bytes to a TX{8,16,32} memory-mapped register.
>>>> Reading the TX_DEPTH register returns the current queue depth.
>>>>
>>>> The TX_READY IRQ is asserted when (and only when) TX_DEPTH
>>>> transitions from 1 to 0.
>>>
>>> If the last statement is correct then your performance is probably always
>>> going to suck unless there is additional invisible queueing beyond the
>>> visible FIFO.
>>
>> Do you agree with my assessment that the current semantics for
>> TX_READY lead to a race condition, unless we limit ourselves
>> to a single (atomic) write between interrupts?
>
> No. To get best throughput, you can simply busy-wait until TX_DEPTH
> indicates the FIFO is almost empty, then write a few words, but no more
> than you know fit in the FIFO. Repeat until all data has been written.
> Use the IRQ only to signal completion of the entire packet.

Would you fill the FIFO with TX_READY disabled?
or with all interrupts masked?

I will show with pseudo-code where (I think) the race condition
breaks the algorithm you suggest. (When using IRQs, not busy wait.)

> If the transmit rate is low, you can save some CPU time by filling the
> FIFO, then sleeping until it should be almost empty, fill again, etc.

For one data point, the test app I have sets the tx rate to 128 kbps.
Thus, 1 ms to transmit an entire queue. CPU runs at 100-1000 MHz
depending on the mood of cpufreq.

> Whether busy-waiting or sleeping, this approach keeps the data flowing
> as fast as possible.
>
> With the hardware you describe, there is unfortunately a trade-off
> between throughput and CPU efficiency. You'll have to decide which is
> more important to you.

I can ask the hardware designer to change the behavior for the next
iteration of the SoC.

Regards.