2002-04-06 19:54:39

by Paul Davis

[permalink] [raw]
Subject: delayed interrupt processing caused by cswitching/pipe_writes?

I am trying to track down some latency-related issues that we are
having with a low latency audio system (JACK, the Jack Audio
Connection Kit, http://jackit.sf.net/).

Its the usual "soft RT" thing (SCHED_FIFO, mlockall), except that it
involves pipe-driven IPC between two RT threads to get the work
done. Read the website if you want to understand why; its not relevant
here.

Its clear from a week of instrumented kernels, and hours poring over
trace files, that the presence of "activity" on the machine causes
delays in the handling of interrupts from the audio interface by up to
a millisecond. Given that the interrupt frequency is about 1000Hz,
this is not good :)

I know from trials of older kernels and systems in which all the audio
handling resided in the same thread that we can handle a 1.3msec
interrupt interval without real problems. But now that JACK splits the
handling across two threads in different processes, the thing that
kills us is not the context switch times, not the delay caused by
cache and/or TLB invalidation, or any of that stuff. instead, its that
we start delaying the execution of the audio interface interrupt
handler to the point where our assumptions about handling every
interrupt on time fall apart.

the extra work doesn't involve any increased disk activity - the only
extra work involves more writing to memory (which is all mlocked),
writing to pipes for IPC, and extra context switches.

which of these is most likely to cause us to mask interrupts for up to
a millisecond or more? i know its not the handler for the interrupt in
question - it never takes more than 25 usecs to execute.

--p


2002-04-07 01:16:30

by Andrew Morton

[permalink] [raw]
Subject: Re: delayed interrupt processing caused by cswitching/pipe_writes?

Paul Davis wrote:
>

Hi, Paul.

> ...
> But now that JACK splits the
> handling across two threads in different processes, the thing that
> kills us is not the context switch times, not the delay caused by
> cache and/or TLB invalidation, or any of that stuff. instead, its that
> we start delaying the execution of the audio interface interrupt
> handler to the point where our assumptions about handling every
> interrupt on time fall apart.

Conceivably, something somewhere is forgetting to reenable interrupts,
and we're just not noticing because the scheduler and/or return
from syscall just turns them on unconditionally.

There are a couple of things you can try.

First is to just use the kernel profiler. Do it on uniprocessor
to make life simpler. If you see a particular function
is showing up prominently then perhaps that's the place
where interrupts are getting turned back on, and that may
point us at the source of the problem.

Note that you'll probably get better retults with a higher
profiling frequency - build the UP kernel with local APIC
support and use `readprofile -M10' to profile at 1000 Hz.

The other approach would be to apply

http://www.zip.com.au/~akpm/linux/timepeg-2.4.19-pre6-1.patch.gz

and enable "Timepeg instrumentation" and "Interrupt latency
instrumentation" in the Kernel Hacking menu.

Compile up http://www.zip.com.au/~akpm/linux/tpt.tar.gz
and run

sudo ./tpt -s | sort -nr +5 | head -20

then run your workload, then run the above command a second time.
I get:

slab.c:1323 -> slab.c:1347 572 .66 55.91 5.90 3375.52
do_IRQ.in:0 -> softirq.c:84 1628 10.72 22.18 11.22 18270.64
exit.c:396 -> exit.c:430 3 5.50 10.66 7.66 22.98
ll_rw_blk.c:886 -> ll_rw_blk.c:996 54 1.25 8.72 2.15 116.28

which shows that the worst-case interrupt-off times
were between lines 1347 and 1323 of slab.c.

Beware that the timepeg patch increases your kernel
size tremendously - it's probably OK with a uniprocessor
build, but even though I've dropped NR_CPUS to 2, a
large SMP kernel build can simply fail to load.

-