2005-10-10 20:17:00

by Mark Knecht

[permalink] [raw]
Subject: Latency data - 2.6.14-rc3-rt13

Hi,
I'm still suffering the occasional xrun using Jack so I'm trying to
understand what information I can feed back on this. I turned on the
histogram feature for both IRQ-off and preempt-off. From what I can
see so far it seems that something is turning IRQs off for long
periods of time? I think that this is similar to what Lee Revell
showed the other day, but I don't know how to get the same sort of
printout he showed.

This is the Jack data:

nperiods = 2 for capture
nperiods = 2 for playback
08:16:26.401 Server configuration saved to "/home/mark/.jackdrc".
08:16:26.401 Statistics reset.
08:16:26.637 Client activated.
08:16:26.639 Audio connection change.
08:16:26.642 Audio connection graph change.
08:16:28.395 MIDI connection graph change.
08:16:28.472 MIDI connection change.
08:16:29.664 Audio connection graph change.
08:19:50.071 Audio connection graph change.
08:19:50.263 MIDI connection graph change.
08:19:50.273 MIDI connection graph change.
10:04:36.179 Audio connection graph change.
10:04:36.362 Audio connection change.
10:04:38.800 Audio connection graph change.
10:04:38.810 Audio connection graph change.
12:34:09.601 XRUN callback (1).
**** alsa_pcm: xrun of at least 0.133 msecs
13:09:41.344 MIDI connection graph change.
13:09:41.439 MIDI connection change.
13:09:42.427 Audio connection graph change.
13:09:45.919 Audio connection graph change.
13:09:45.941 Audio connection graph change.
13:09:45.960 MIDI connection graph change.

This is what I see for preempt off latency. This looks really good to me:

#Minimum latency: 0 microseconds.
#Average latency: 0 microseconds.
#Maximum latency: 1207040 microseconds.
#Total samples: 196871313
#There are 2 samples greater or equal than 10240 microseconds
#usecs samples
0 190504250
1 5881275
2 379533
3 69966
4 13858
5 5708
6 2287
7 7726
8 1771
9 465
10 649
11 935
12 948
13 720
14 573
15 329
16 139
17 71
18 46
19 27
20 21
21 7
22 3
23 2
24 1
25 0
26 0
27 0


However, for IRQ off latency I have far larger numbers. I Assume that
it's probably pretty bad that interrupts could be turned off for as
long as 4mS when I'm trying to run sub3mS?

#Minimum latency: 0 microseconds.
#Average latency: 4 microseconds.
#Maximum latency: 3998 microseconds.
#Total samples: 3102665680
#There are 0 samples greater or equal than 10240 microseconds
#usecs samples
0 3066905198
1 3817807
2 15630448
3 347330
4 55362
5 40098
6 27080
7 27740
8 22484
9 20298
10 15727
11 15346
12 14870
13 15258
14 101559
15 16152
<SNIP>
1365 18434
1366 15529
1367 12471
1368 11392
1369 13927
1370 21813
1371 39568
1372 87654
1373 250408
1374 535990
1375 577197
1376 370609
1377 120045
1378 20431
1379 5312
1380 5349
1381 8203
1382 14773
1383 15352
1384 9515
1385 5020
1386 3618
1387 3738
1388 5554
1389 8669
1390 8244
1391 6448
1392 4948
1393 4151
1394 5984
1395 10599
1396 19058
1397 41222
1398 109732
1399 360362
1400 354579
1401 113700
1402 11014
1403 1240
1404 754
1405 772
1406 963
1407 850
1408 445
1409 276
<SNIP>
3976 62
3977 57
3978 71
3979 70
3980 59
3981 61
3982 69
3983 76
3984 75
3985 72
3986 76
3987 80
3988 65
3989 64
3990 78
3991 70
3992 85
3993 128
3994 236
3995 358
3996 96
3997 0
3998 1
3999 0
4000 0
4001 0
4002 0


How can I get data that would be more useful in terms of real debug?

Thanks,
Mark


2005-10-10 20:49:22

by Daniel Walker

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On Mon, 2005-10-10 at 13:16 -0700, Mark Knecht wrote:
> Hi,

>
> How can I get data that would be more useful in terms of real debug?

The IRQ off times look like the worst . If you do "make menuconfig"

then goto Kernel Hacking and select
"Interrupts-off critical section latency timing"
Then select "Latency tracing"

Then when you boot the system before run the following,

"echo 0 > /proc/sys/kernel/preempt_max_latency"


That will record a trace for every maximum latency observed for IRQ
latency . You can view the trace with this command
"cat /proc/latency_trace" , and you can attach the trace to an email to
LKML so we can review it (compress it if it's big though) .

Daniel

2005-10-10 21:12:20

by Mark Knecht

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On 10/10/05, Daniel Walker <[email protected]> wrote:
> On Mon, 2005-10-10 at 13:16 -0700, Mark Knecht wrote:
> > Hi,
>
> >
> > How can I get data that would be more useful in terms of real debug?
>
> The IRQ off times look like the worst . If you do "make menuconfig"
>
> then goto Kernel Hacking and select
> "Interrupts-off critical section latency timing"
> Then select "Latency tracing"
>
> Then when you boot the system before run the following,
>
> "echo 0 > /proc/sys/kernel/preempt_max_latency"

So this disables the tracing of preempt times but keeps IRQ times on? Cool.
>
>
> That will record a trace for every maximum latency observed for IRQ
> latency . You can view the trace with this command
> "cat /proc/latency_trace" , and you can attach the trace to an email to
> LKML so we can review it (compress it if it's big though) .
>
> Daniel

Will do. Building now. I'll be back later.

Is there anything specific I should look for in the traces myself?
Anyway to help narrow it down?

Cheers,
Mark

2005-10-10 21:28:18

by Mark Knecht

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On 10/10/05, Mark Knecht <[email protected]> wrote:
<SNIP>
> > then goto Kernel Hacking and select
> > "Interrupts-off critical section latency timing"
> > Then select "Latency tracing"

Only had to turn on Latency Tracing. The others I had on...

<SNIP>
> >
> > Daniel
>
> Will do. Building now. I'll be back later.
>

Unfortunately I didn't think I'd be back this fast. I built the new
kernel and rebooted. The boot starts, gets down to the point where it
tells me that the preempt debug stuff is on, and then jumps to an
endlessly repeating error message:

init[1]: segfault at ffffffff8010fce0 rip ffffffff8010fce0 rsp
00007fffffcb09b8 error 15

This error repeasts endlessly until I reboot.

Good thing I had another kernel I could boot back into! ;-)

So, something isn't happy. Is this a -rt thing or a kernel issue?

Cheers,
Mark

2005-10-10 21:39:36

by Daniel Walker

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On Mon, 2005-10-10 at 14:12 -0700, Mark Knecht wrote:
> On 10/10/05, Daniel Walker <[email protected]> wrote:
> > On Mon, 2005-10-10 at 13:16 -0700, Mark Knecht wrote:
> > > Hi,
> >
> > >
> > > How can I get data that would be more useful in terms of real debug?
> >
> > The IRQ off times look like the worst . If you do "make menuconfig"
> >
> > then goto Kernel Hacking and select
> > "Interrupts-off critical section latency timing"
> > Then select "Latency tracing"
> >
> > Then when you boot the system before run the following,
> >
> > "echo 0 > /proc/sys/kernel/preempt_max_latency"
>
> So this disables the tracing of preempt times but keeps IRQ times on? Cool.

Yes, that what the compile options are doing.

> >
> > That will record a trace for every maximum latency observed for IRQ
> > latency . You can view the trace with this command
> > "cat /proc/latency_trace" , and you can attach the trace to an email to
> > LKML so we can review it (compress it if it's big though) .
> >
> > Daniel
>
> Will do. Building now. I'll be back later.
>
> Is there anything specific I should look for in the traces myself?
> Anyway to help narrow it down?

You want the trace to represent the largest latency. For instance in the
histogram it showed IRQ latency was ~3000us or greater, so you want the
trace to show at least that size latency .

Daniel

2005-10-10 21:44:37

by Daniel Walker

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On Mon, 2005-10-10 at 14:28 -0700, Mark Knecht wrote:
> On 10/10/05, Mark Knecht <[email protected]> wrote:
> <SNIP>
> > > then goto Kernel Hacking and select
> > > "Interrupts-off critical section latency timing"
> > > Then select "Latency tracing"
>
> Only had to turn on Latency Tracing. The others I had on...
>
> <SNIP>
> > >
> > > Daniel
> >
> > Will do. Building now. I'll be back later.
> >
>
> Unfortunately I didn't think I'd be back this fast. I built the new
> kernel and rebooted. The boot starts, gets down to the point where it
> tells me that the preempt debug stuff is on, and then jumps to an
> endlessly repeating error message:
>
> init[1]: segfault at ffffffff8010fce0 rip ffffffff8010fce0 rsp
> 00007fffffcb09b8 error 15
>
> This error repeasts endlessly until I reboot.
>
> Good thing I had another kernel I could boot back into! ;-)
>
> So, something isn't happy. Is this a -rt thing or a kernel issue?

Hmm, it looks like latency tracing doesn't work on x86_64 .. I guess
you'll have to wait till someone fixes it .

Another option is to turn off "Latency Tracing" then reboot, like it was
before but w/o the histogram. Then run,

"echo 0 > /proc/sys/kernel/preempt_max_latency"

Whenever a new maximum latency is observed it will log it with a stack
trace in the system logs. You can report that back here on LKML .

You can view the system log with the command "dmesg" , I think .

Daniel

2005-10-10 22:09:45

by Mark Knecht

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On 10/10/05, Daniel Walker <[email protected]> wrote:
> On Mon, 2005-10-10 at 14:28 -0700, Mark Knecht wrote:
> > On 10/10/05, Mark Knecht <[email protected]> wrote:
> > <SNIP>
> > > > then goto Kernel Hacking and select
> > > > "Interrupts-off critical section latency timing"
> > > > Then select "Latency tracing"
> >
> > Only had to turn on Latency Tracing. The others I had on...
> >
> > <SNIP>
> > > >
> > > > Daniel
> > >
> > > Will do. Building now. I'll be back later.
> > >
> >
> > Unfortunately I didn't think I'd be back this fast. I built the new
> > kernel and rebooted. The boot starts, gets down to the point where it
> > tells me that the preempt debug stuff is on, and then jumps to an
> > endlessly repeating error message:
> >
> > init[1]: segfault at ffffffff8010fce0 rip ffffffff8010fce0 rsp
> > 00007fffffcb09b8 error 15
> >
> > This error repeasts endlessly until I reboot.
> >
> > Good thing I had another kernel I could boot back into! ;-)
> >
> > So, something isn't happy. Is this a -rt thing or a kernel issue?
>
> Hmm, it looks like latency tracing doesn't work on x86_64 .. I guess
> you'll have to wait till someone fixes it .
>
> Another option is to turn off "Latency Tracing" then reboot, like it was
> before but w/o the histogram. Then run,
>
> "echo 0 > /proc/sys/kernel/preempt_max_latency"
>
> Whenever a new maximum latency is observed it will log it with a stack
> trace in the system logs. You can report that back here on LKML .
>
> You can view the system log with the command "dmesg" , I think .
>
> Daniel
>
>
Yes, already that looks interesting. Do I have something going on with
acpi? This is before running Jack. I'm in Gnome.There are a lot of
these messages, but they've stopped. I suppose the 3997 is the delay?
That would coorespond with the info I sent earlier.

( softirq-timer/0-3 |#0): new 3997 us maximum-latency critical section.
=> started at timestamp 185717289: <acpi_processor_idle+0x20/0x379>
=> ended at timestamp 185721287: <thread_return+0xb5/0x11a>

Call Trace:<ffffffff801519ac>{check_critical_timing+492}
<ffffffff80151e05>{sub_preempt_count_ti+133}
<ffffffff803f632c>{thread_return+70}
<ffffffff803f639b>{thread_return+181}
<ffffffff803f6505>{schedule+261} <ffffffff8013a84a>{ksoftirqd+138}
<ffffffff8013a7c0>{ksoftirqd+0} <ffffffff8014aa8d>{kthread+205}
<ffffffff8010f716>{child_rip+8} <ffffffff8014a9c0>{kthread+0}
<ffffffff8010f70e>{child_rip+0}
---------------------------
| preempt count: 00000001 ]
| 1-level deep critical section nesting:
----------------------------------------
.. [<ffffffff803f5df8>] .... __schedule+0xb8/0x5a6
.....[<00000000>] .. ( <= 0x0)

=> dump-end timestamp 185721364

mark@lightning ~ $

I started up Jack, mounted my 1394 drives and started streaming a
little bit of data. Basicaly my ultra-light load test. I see nothing
new in dmesg yet. I'll keep watching.

Thanks,
Mark

2005-10-10 22:28:25

by Daniel Walker

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On Mon, 2005-10-10 at 15:09 -0700, Mark Knecht wrote:
> On 10/10/05, Daniel Walker <[email protected]> wrote:
> > On Mon, 2005-10-10 at 14:28 -0700, Mark Knecht wrote:
> > > On 10/10/05, Mark Knecht <[email protected]> wrote:
> > > <SNIP>
> > > > > then goto Kernel Hacking and select
> > > > > "Interrupts-off critical section latency timing"
> > > > > Then select "Latency tracing"
> > >
> > > Only had to turn on Latency Tracing. The others I had on...
> > >
> > > <SNIP>
> > > > >
> > > > > Daniel
> > > >
> > > > Will do. Building now. I'll be back later.
> > > >
> > >
> > > Unfortunately I didn't think I'd be back this fast. I built the new
> > > kernel and rebooted. The boot starts, gets down to the point where it
> > > tells me that the preempt debug stuff is on, and then jumps to an
> > > endlessly repeating error message:
> > >
> > > init[1]: segfault at ffffffff8010fce0 rip ffffffff8010fce0 rsp
> > > 00007fffffcb09b8 error 15
> > >
> > > This error repeasts endlessly until I reboot.
> > >
> > > Good thing I had another kernel I could boot back into! ;-)
> > >
> > > So, something isn't happy. Is this a -rt thing or a kernel issue?
> >
> > Hmm, it looks like latency tracing doesn't work on x86_64 .. I guess
> > you'll have to wait till someone fixes it .
> >
> > Another option is to turn off "Latency Tracing" then reboot, like it was
> > before but w/o the histogram. Then run,
> >
> > "echo 0 > /proc/sys/kernel/preempt_max_latency"
> >
> > Whenever a new maximum latency is observed it will log it with a stack
> > trace in the system logs. You can report that back here on LKML .
> >
> > You can view the system log with the command "dmesg" , I think .
> >
> > Daniel
> >
> >
> Yes, already that looks interesting. Do I have something going on with
> acpi? This is before running Jack. I'm in Gnome.There are a lot of
> these messages, but they've stopped. I suppose the 3997 is the delay?
> That would coorespond with the info I sent earlier.

I think this is a false reading . Sometimes when a system goes idle it
will cause interrupt disable latency that isn't real (due to process
halting) . You could try turning ACPI off if you can , and turn off
power management ..

Daniel

2005-10-10 23:33:17

by Mark Knecht

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On 10/10/05, Daniel Walker <[email protected]> wrote:

> > Yes, already that looks interesting. Do I have something going on with
> > acpi? This is before running Jack. I'm in Gnome.There are a lot of
> > these messages, but they've stopped. I suppose the 3997 is the delay?
> > That would coorespond with the info I sent earlier.
>
> I think this is a false reading . Sometimes when a system goes idle it
> will cause interrupt disable latency that isn't real (due to process
> halting) . You could try turning ACPI off if you can , and turn off
> power management ..
>
> Daniel

OK. NO immediate difference with most of the power management stuff
turned off. I'll just let it run a few hours and see if I get an xrun.
If I do I'll look at the logs again and report back.

Thanks for the help. I feel like I've got a chance of spotting something.

Cheers,
Mark

2005-10-10 23:49:14

by Mark Knecht

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On 10/10/05, Mark Knecht <[email protected]> wrote:
> On 10/10/05, Daniel Walker <[email protected]> wrote:
>
> > > Yes, already that looks interesting. Do I have something going on with
> > > acpi? This is before running Jack. I'm in Gnome.There are a lot of
> > > these messages, but they've stopped. I suppose the 3997 is the delay?
> > > That would coorespond with the info I sent earlier.
> >
> > I think this is a false reading . Sometimes when a system goes idle it
> > will cause interrupt disable latency that isn't real (due to process
> > halting) . You could try turning ACPI off if you can , and turn off
> > power management ..
> >
> > Daniel
>
> OK. NO immediate difference with most of the power management stuff
> turned off. I'll just let it run a few hours and see if I get an xrun.
> If I do I'll look at the logs again and report back.
>
> Thanks for the help. I feel like I've got a chance of spotting something.
>
> Cheers,
> Mark

Well, I'm disappointed again. Some xruns came but no additional data
was put into the dmesage file:

( softirq-timer/0-3 |#0): new 3997 us maximum-latency critical section.
=> started at timestamp 253995015: <acpi_processor_idle+0x20/0x379>
=> ended at timestamp 253999013: <thread_return+0xb5/0x11a>

Call Trace:<ffffffff8014e32c>{check_critical_timing+492}
<ffffffff8014e785>{sub_preempt_count_ti+133}
<ffffffff803ec76c>{thread_return+70}
<ffffffff803ec7db>{thread_return+181}
<ffffffff803ec945>{schedule+261} <ffffffff801371ca>{ksoftirqd+138}
<ffffffff80137140>{ksoftirqd+0} <ffffffff8014740d>{kthread+205}
<ffffffff8010e6e6>{child_rip+8} <ffffffff80147340>{kthread+0}
<ffffffff8010e6de>{child_rip+0}
---------------------------
| preempt count: 00000001 ]
| 1-level deep critical section nesting:
----------------------------------------
.. [<ffffffff803ec238>] .... __schedule+0xb8/0x5a6
.....[<00000000>] .. ( <= 0x0)

=> dump-end timestamp 253999092

kjournald starting. Commit interval 5 seconds
EXT3 FS on sdb1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
lightning ~ #

Basically the trace above was there pretty much right after booting. I
mounted a 1394 drive, which is shown, and then started running apps. I
got 3 xruns:

16:42:59.826 XRUN callback (1).
**** alsa_pcm: xrun of at least 0.273 msecs
16:43:03.764 XRUN callback (2).
**** alsa_pcm: xrun of at least 2.082 msecs
16:43:41.618 XRUN callback (3).
**** alsa_pcm: xrun of at least 2.040 msecs

but no additional data in the logs.

So possibly this is an IRQ off latency? Or am I jumping to
conclusions? The trace above, if I understand it, indicates a delay of
nearly 4mS, and again, since I'm trying to run at sub 3mS, that would
seem like a potential problem.

I'm going to completely shut off power management as a test just to
see where that leads.

- Mark

2005-10-11 03:45:32

by Mark Knecht

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On 10/10/05, Mark Knecht <[email protected]> wrote:
>
> ( softirq-timer/0-3 |#0): new 3997 us maximum-latency critical section.

So the root cause of this 4mS delay is the 250Hz timer. If I change
the system to use the 1Khz timer then the time in this section drops,
as expected, to 1mS.

( softirq-timer/0-3 |#0): new 998 us maximum-latency critical section.
=> started at timestamp 121040020: <acpi_processor_idle+0x20/0x379>
=> ended at timestamp 121041019: <thread_return+0xb5/0x11a>

So, thinking very interesting here I think.

Back to the drawing board as to my xruns.

- Mark

2005-10-11 03:56:08

by Matan Peled

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mark Knecht wrote:
> Well, I'm disappointed again. Some xruns came but no additional data
> was put into the dmesage file:

"dmesage file"? You mean, "/var/log/dmesg"? No, thats not the file you want...
At least here, that file contains the logs as they were when the boot finishes.

The information you want can be retrieved by running the command (actual program
to run, type it in a terminal) "dmesg".

Or you might be able to look whats in the end of the /var/log/messages file, if
you have one.

- --
[Name ] :: [Matan I. Peled ]
[Location ] :: [Israel ]
[Public Key] :: [0xD6F42CA5 ]
[Keyserver ] :: [keyserver.kjsl.com]
encrypted/signed plain text preferred

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDSzgDA7Qvptb0LKURApE4AKCOxBS3Xos5yEAnd5QbItjRmYlIsQCePSmm
ULBWIDcIOPxXAnMSodZNgCg=
=qmrM
-----END PGP SIGNATURE-----

2005-10-11 04:45:47

by Mark Knecht

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On 10/10/05, Matan Peled <[email protected]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Mark Knecht wrote:
> > Well, I'm disappointed again. Some xruns came but no additional data
> > was put into the dmesage file:
>
> "dmesage file"? You mean, "/var/log/dmesg"? No, thats not the file you want...
> At least here, that file contains the logs as they were when the boot finishes.
>
> The information you want can be retrieved by running the command (actual program
> to run, type it in a terminal) "dmesg".

Right, sorry. Obviously bad typing on my part. I am running 'dmesg' at
the command line to get the traces. My bad...

Anyway, these latencies seem understood at this point. They are just
the timer that you set in the kernel hacking section. No big deal.
However, none of the messages yet give any clues (that I understand)
as to the cause of the timing misses I'm seeing with 2.6.14-rc3-rt13.
I shall look into a 2.6.14-rc4-rtX tomorrow.

Cheers,
Mark

2005-10-11 06:13:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13


* Mark Knecht <[email protected]> wrote:

> Anyway, these latencies seem understood at this point. They are just
> the timer that you set in the kernel hacking section. No big deal.
> However, none of the messages yet give any clues (that I understand)
> as to the cause of the timing misses I'm seeing with 2.6.14-rc3-rt13.
> I shall look into a 2.6.14-rc4-rtX tomorrow.

do you have 64-bit userspace too? If you have 32-bit userspace then
could you try running the x86 kernel? Generally the 64-bit kernel has
less mature debugging options in the -rt tree: e.g. latency tracing
itself doesnt work (due to gcc silliness).

Ingo

2005-10-11 06:50:56

by Steven Rostedt

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13


On Mon, 10 Oct 2005, Mark Knecht wrote:

> On 10/10/05, Mark Knecht <[email protected]> wrote:
> >
> > ( softirq-timer/0-3 |#0): new 3997 us maximum-latency critical section.
>
> So the root cause of this 4mS delay is the 250Hz timer. If I change
> the system to use the 1Khz timer then the time in this section drops,
> as expected, to 1mS.
>
> ( softirq-timer/0-3 |#0): new 998 us maximum-latency critical section.
> => started at timestamp 121040020: <acpi_processor_idle+0x20/0x379>
> => ended at timestamp 121041019: <thread_return+0xb5/0x11a>
>
> So, thinking very interesting here I think.
>
> Back to the drawing board as to my xruns.
>

Are your xruns showing the same type of latency? If you switch to the
1000Hz do you get only 1mS latency on your xruns too? This sounds like
the application might have gone to sleep and didn't wake up until
something scheduled it in. Or something else with the scheduler. I doubt
that this has to due with preemption or interrupts being off, but
something that uses jiffies to calculate.

-- Steve

2005-10-11 07:52:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13


* Ingo Molnar <[email protected]> wrote:

> > Anyway, these latencies seem understood at this point. They are just
> > the timer that you set in the kernel hacking section. No big deal.
> > However, none of the messages yet give any clues (that I understand)
> > as to the cause of the timing misses I'm seeing with 2.6.14-rc3-rt13.
> > I shall look into a 2.6.14-rc4-rtX tomorrow.
>
> do you have 64-bit userspace too? If you have 32-bit userspace then
> could you try running the x86 kernel? Generally the 64-bit kernel has
> less mature debugging options in the -rt tree: e.g. latency tracing
> itself doesnt work [...]

let me take that back - latency tracing does work on x64 too.

Ingo

2005-10-11 08:56:22

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On Mon, 2005-10-10 at 20:45 -0700, Mark Knecht wrote:
> On 10/10/05, Mark Knecht <[email protected]> wrote:
> >
> > ( softirq-timer/0-3 |#0): new 3997 us maximum-latency critical section.
>
> So the root cause of this 4mS delay is the 250Hz timer. If I change
> the system to use the 1Khz timer then the time in this section drops,
> as expected, to 1mS.
>
> ( softirq-timer/0-3 |#0): new 998 us maximum-latency critical section.
> => started at timestamp 121040020: <acpi_processor_idle+0x20/0x379>
> => ended at timestamp 121041019: <thread_return+0xb5/0x11a>
>
> So, thinking very interesting here I think.
>

Maximum preemption latencies near the timer period
are sometimes indicative of mismatched
preempt_disable/preempt_enable pairs in the code.

Usually the scheduler warns about that on the console, however.

Sven

> Back to the drawing board as to my xruns.
>
> - Mark
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
***********************************
Sven-Thorsten Dietrich
Real-Time Software Architect
MontaVista Software, Inc.
1237 East Arques Ave.
Sunnyvale, CA 94085

Phone: 408.992.4515
Fax: 408.328.9204

http://www.mvista.com
Platform To Innovate
***********************************

2005-10-11 11:16:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13


* Mark Knecht <[email protected]> wrote:

> > ( softirq-timer/0-3 |#0): new 3997 us maximum-latency critical section.
>
> So the root cause of this 4mS delay is the 250Hz timer. If I change
> the system to use the 1Khz timer then the time in this section drops,
> as expected, to 1mS.

this was a bug in the critical-section-latency measurement code of x64.
The timer irq is the one that leaves userspace running for the longest
time, between two kernel calls.

I have fixed these bugs in -rc4-rt1, could you try it? It should report
much lower latencies, regardless of PM settings.

Ingo

2005-10-11 22:45:33

by Mark Knecht

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On 10/11/05, Ingo Molnar <[email protected]> wrote:
>
> * Mark Knecht <[email protected]> wrote:
>
> > > ( softirq-timer/0-3 |#0): new 3997 us maximum-latency critical section.
> >
> > So the root cause of this 4mS delay is the 250Hz timer. If I change
> > the system to use the 1Khz timer then the time in this section drops,
> > as expected, to 1mS.
>
> this was a bug in the critical-section-latency measurement code of x64.
> The timer irq is the one that leaves userspace running for the longest
> time, between two kernel calls.
>
> I have fixed these bugs in -rc4-rt1, could you try it? It should report
> much lower latencies, regardless of PM settings.
>
> Ingo
>

Ingo,
This test now reports much more intersting data:

( dmesg-8010 |#0): new 13 us maximum-latency critical section.
=> started at timestamp 117628604: <do_IRQ+0x29/0x50>
=> ended at timestamp 117628618: <do_IRQ+0x39/0x50>

Call Trace: <IRQ> <ffffffff8014c42c>{check_critical_timing+492}
<ffffffff8014c64b>{sub_preempt_count_ti+75} <ffffffff80110159>{do_IRQ+57}
<ffffffff8010e16c>{ret_from_intr+0} <EOI>
<ffffffff80249a75>{copy_page+5}
<ffffffff801671f1>{do_no_page+737}
<ffffffff801674ee>{__handle_mm_fault+414}
<ffffffff8014c64b>{sub_preempt_count_ti+75}
<ffffffff803eac99>{do_page_fault+1049}
<ffffffff8014c64b>{sub_preempt_count_ti+75}
<ffffffff80168eee>{vma_link+286}
<ffffffff8010e4c1>{error_exit+0} <ffffffff8024a866>{__clear_user+22}
<ffffffff801a46fb>{padzero+27} <ffffffff801a4b42>{load_elf_interp+850}
<ffffffff801a5b1d>{load_elf_binary+3341}
<ffffffff8014c64b>{sub_preempt_count_ti+75}
<ffffffff8014c64b>{sub_preempt_count_ti+75}
<ffffffff801a4e10>{load_elf_binary+0}
<ffffffff80182ee0>{search_binary_handler+272}
<ffffffff801831f5>{do_execve+405}
<ffffffff8010dbc6>{system_call+126} <ffffffff8010c634>{sys_execve+68}
<ffffffff8010dfea>{stub_execve+106}
=> dump-end timestamp 117628777

( nautilus-7955 |#0): new 14 us maximum-latency critical section.
=> started at timestamp 127874927: <do_IRQ+0x29/0x50>
=> ended at timestamp 127874941: <do_IRQ+0x39/0x50>

Call Trace: <IRQ> <ffffffff8014c42c>{check_critical_timing+492}
<ffffffff8014c64b>{sub_preempt_count_ti+75} <ffffffff80110159>{do_IRQ+57}
<ffffffff8010e16c>{ret_from_intr+0} <EOI>
<ffffffff8010dbc6>{system_call+126}

=> dump-end timestamp 127874996

( nautilus-7955 |#0): new 14 us maximum-latency critical section.
=> started at timestamp 128647499: <do_IRQ+0x29/0x50>
=> ended at timestamp 128647514: <do_IRQ+0x39/0x50>

Call Trace: <IRQ> <ffffffff8014c42c>{check_critical_timing+492}
<ffffffff8014c64b>{sub_preempt_count_ti+75} <ffffffff80110159>{do_IRQ+57}
<ffffffff8010e16c>{ret_from_intr+0} <EOI>
=> dump-end timestamp 128647567

lightning ~ #

NOTE: In my first kernel build I turned on latency measurement for
both max preempt and ITQ-off. The kernel segfaulted immediately after
the boot messages reminding me that these were on. I rebuilt the
kernel with only the max preempt and that worked. I suspect some
problem with the IRQ-off section since that's the only change I made
and it did not segfault in -rc3-rt13.

Thanks,
Mark

2005-10-12 00:02:51

by Lee Revell

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On Tue, 2005-10-11 at 15:45 -0700, Mark Knecht wrote:
> On 10/11/05, Ingo Molnar <[email protected]> wrote:
> >
> > * Mark Knecht <[email protected]> wrote:
> >
> > > > ( softirq-timer/0-3 |#0): new 3997 us maximum-latency critical section.
> > >
> > > So the root cause of this 4mS delay is the 250Hz timer. If I change
> > > the system to use the 1Khz timer then the time in this section drops,
> > > as expected, to 1mS.
> >
> > this was a bug in the critical-section-latency measurement code of x64.
> > The timer irq is the one that leaves userspace running for the longest
> > time, between two kernel calls.
> >
> > I have fixed these bugs in -rc4-rt1, could you try it? It should report
> > much lower latencies, regardless of PM settings.
> >
> > Ingo
> >
>
> Ingo,
> This test now reports much more intersting data:
>
> ( dmesg-8010 |#0): new 13 us maximum-latency critical section.
> => started at timestamp 117628604: <do_IRQ+0x29/0x50>
> => ended at timestamp 117628618: <do_IRQ+0x39/0x50>

This is the expected, correct behavior - very small maximum latency
critical sections. Do you get anything longer (say 300 usecs or more)
if you leave it running?

So far the latency tracer on my much slower system has only gone up to
123 usecs. So the bug seems to be fixed at least on i386.

Lee

2005-10-12 01:09:20

by Mark Knecht

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On 10/11/05, Lee Revell <[email protected]> wrote:
> On Tue, 2005-10-11 at 15:45 -0700, Mark Knecht wrote:
> > On 10/11/05, Ingo Molnar <[email protected]> wrote:
> > >
> > > * Mark Knecht <[email protected]> wrote:
> > >
> > > > > ( softirq-timer/0-3 |#0): new 3997 us maximum-latency critical section.
> > > >
> > > > So the root cause of this 4mS delay is the 250Hz timer. If I change
> > > > the system to use the 1Khz timer then the time in this section drops,
> > > > as expected, to 1mS.
> > >
> > > this was a bug in the critical-section-latency measurement code of x64.
> > > The timer irq is the one that leaves userspace running for the longest
> > > time, between two kernel calls.
> > >
> > > I have fixed these bugs in -rc4-rt1, could you try it? It should report
> > > much lower latencies, regardless of PM settings.
> > >
> > > Ingo
> > >
> >
> > Ingo,
> > This test now reports much more intersting data:
> >
> > ( dmesg-8010 |#0): new 13 us maximum-latency critical section.
> > => started at timestamp 117628604: <do_IRQ+0x29/0x50>
> > => ended at timestamp 117628618: <do_IRQ+0x39/0x50>
>
> This is the expected, correct behavior - very small maximum latency
> critical sections. Do you get anything longer (say 300 usecs or more)
> if you leave it running?
>
> So far the latency tracer on my much slower system has only gone up to
> 123 usecs. So the bug seems to be fixed at least on i386.
>
> Lee
>
>

I've been watching this now for a couple of hours. Still no xruns. Max
latency has only gone up to 19uS, so this is all good. However the
thing I am seeing is that all my free memory has gone away and I've
now swapped out just a little nit. There was really no change in the
state of the machine, other than xscreensaver starting and running for
a while, which is standard for me. None the less here's the sort of
state things are in - 17uS right after booting, and only 19uS now.
That's good. But in the middle look at the change in memory statistics
during a one hour time period.

( X-7860 |#0): new 16 us maximum-latency critical section.
=> started at timestamp 259455017: <__schedule+0xb8/0x596>
=> ended at timestamp 259455034: <thread_return+0xb4/0x10a>

Call Trace:<ffffffff8014c42c>{check_critical_timing+492}
<ffffffff8014c64b>{sub_preempt_count_ti+75}
<ffffffff803e7d9a>{thread_return+180} <ffffffff803e7ef5>{schedule+261}
<ffffffff803e89e4>{schedule_timeout+148}
<ffffffff8013a660>{process_timeout+0}
<ffffffff8018b767>{do_select+967} <ffffffff8018b2b0>{__pollwait+0}
<ffffffff8018baad>{sys_select+749}
<ffffffff8010d9e9>{sys_rt_sigreturn+553}
<ffffffff8010dbc6>{system_call+126}
=> dump-end timestamp 259455125

( X-7860 |#0): new 17 us maximum-latency critical section.
=> started at timestamp 375451745: <__schedule+0xb8/0x596>
=> ended at timestamp 375451762: <thread_return+0xb4/0x10a>

Call Trace:<ffffffff8014c42c>{check_critical_timing+492}
<ffffffff8014c64b>{sub_preempt_count_ti+75}
<ffffffff803e7d9a>{thread_return+180} <ffffffff803e7ef5>{schedule+261}
<ffffffff803e89e4>{schedule_timeout+148}
<ffffffff8013a660>{process_timeout+0}
<ffffffff8018b767>{do_select+967} <ffffffff8018b2b0>{__pollwait+0}
<ffffffff8018baad>{sys_select+749}
<ffffffff8010d9e9>{sys_rt_sigreturn+553}
<ffffffff8010dbc6>{system_call+126}
=> dump-end timestamp 375451852

kjournald starting. Commit interval 5 seconds
EXT3 FS on sdc1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
lightning ~ #



lightning ~ # date && free
Tue Oct 11 16:21:41 PDT 2005
total used free shared buffers cached
Mem: 510460 410816 99644 0 15896 169664
-/+ buffers/cache: 225256 285204
Swap: 996020 0 996020
lightning ~ # date && free
Tue Oct 11 17:39:03 PDT 2005
total used free shared buffers cached
Mem: 510460 504464 5996 0 39828 119012
-/+ buffers/cache: 345624 164836
Swap: 996020 304 995716
lightning ~ #



( hdspmixer-8071 |#0): new 18 us maximum-latency critical section.
=> started at timestamp 2132546112: <snd_ctl_open+0x71/0x200 [snd]>
=> ended at timestamp 2132546131: <snd_ctl_open+0x8a/0x200 [snd]>

Call Trace:<ffffffff8014c42c>{check_critical_timing+492}
<ffffffff8014c64b>{sub_preempt_count_ti+75}
<ffffffff88006a0a>{:snd:snd_ctl_open+138}
<ffffffff8800414b>{:snd:snd_open+267}
<ffffffff801806d9>{chrdev_open+441} <ffffffff801766a8>{__dentry_open+280}
<ffffffff80176877>{filp_open+135}
<ffffffff8014c64b>{sub_preempt_count_ti+75}
<ffffffff80176a06>{get_unused_fd+230} <ffffffff80176b71>{do_sys_open+81}
<ffffffff8010dbc6>{system_call+126}
=> dump-end timestamp 2132546239

( IRQ 58-4526 |#0): new 19 us maximum-latency critical section.
=> started at timestamp 8439787551: <__schedule+0xb8/0x596>
=> ended at timestamp 8439787571: <thread_return+0xb4/0x10a>

Call Trace:<ffffffff8014c42c>{check_critical_timing+492}
<ffffffff8014c64b>{sub_preempt_count_ti+75}
<ffffffff803e7d9a>{thread_return+180}
<ffffffff80146550>{keventd_create_kthread+0}
<ffffffff803e7ef5>{schedule+261} <ffffffff80156062>{do_irqd+514}
<ffffffff80155e60>{do_irqd+0} <ffffffff8014650d>{kthread+205}
<ffffffff8010e676>{child_rip+8}
<ffffffff80146550>{keventd_create_kthread+0}
<ffffffff80354d00>{pci_conf1_read+0} <ffffffff80146440>{kthread+0}
<ffffffff8010e66e>{child_rip+0}
=> dump-end timestamp 8439787658

lightning ~ #


Now, I know these command line methods are probably frowned upon by
folks in the know so I'm happy to do this in a more rigorous way. (Is
it called valgrind?) Should free memory drop like that over time?

The only apps running were Aqualung, hdspmixer and Firefox. Is one of
them leaking, is the kernel leaking, or is this normal?

I will continue to let this run for another 4-6 hours today and hope I
catch another xrun. If not I'll start it up again tomorrow and
certainly will get something.

Thanks much,
Mark

2005-10-12 01:21:06

by Lee Revell

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On Tue, 2005-10-11 at 18:09 -0700, Mark Knecht wrote:
> Should free memory drop like that over time?

Yes this is perfectly normal. When a system first boots all the memory
your apps aren't using is initially free. As applications access more
data over time then it will be cached in memory until free memory drops
to near zero.

"Free memory" is actually wasted memory - it's better to use all
available RAM for caching.

Lee

2005-10-12 01:25:14

by Mark Knecht

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On 10/11/05, Lee Revell <[email protected]> wrote:
> On Tue, 2005-10-11 at 18:09 -0700, Mark Knecht wrote:
> > Should free memory drop like that over time?
>
> Yes this is perfectly normal. When a system first boots all the memory
> your apps aren't using is initially free. As applications access more
> data over time then it will be cached in memory until free memory drops
> to near zero.
>
> "Free memory" is actually wasted memory - it's better to use all
> available RAM for caching.
>
> Lee

OK, non-intuitive for a guitar player, but good to know it's normal. Thanks!

In that case I'll just wait for an xrun and hope I catch something in
dmesg. We'll just have to wait and see.

Cheers,
Mark

2005-10-12 06:45:34

by Steven Rostedt

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13


On Tue, 11 Oct 2005, Lee Revell wrote:

> On Tue, 2005-10-11 at 18:09 -0700, Mark Knecht wrote:
> > Should free memory drop like that over time?
>
> Yes this is perfectly normal. When a system first boots all the memory
> your apps aren't using is initially free. As applications access more
> data over time then it will be cached in memory until free memory drops
> to near zero.
>
> "Free memory" is actually wasted memory - it's better to use all
> available RAM for caching.
>

But the swap being touched bothers me. Although I've had problems with
leaving Mozilla up for long times and it leaking. Without Mozilla running
and running lots of other apps, I have almost 100% memory used, but 0%
swap.

If the swap starts to increase slowly over time, you _do_ have a
leak somewhere. Probably not in the kernel (kernel memory never goes into
swap). But if you want to see if the kernel is leaking, examine
/proc/slabinfo once in a while and if you see something there constantly
growing, then that might indicate a leak. Just pay attention to the first
column.

-- Steve

2005-10-12 16:18:30

by Lee Revell

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13

On Wed, 2005-10-12 at 02:38 -0400, Steven Rostedt wrote:
> On Tue, 11 Oct 2005, Lee Revell wrote:
>
> > On Tue, 2005-10-11 at 18:09 -0700, Mark Knecht wrote:
> > > Should free memory drop like that over time?
> >
> > Yes this is perfectly normal. When a system first boots all the memory
> > your apps aren't using is initially free. As applications access more
> > data over time then it will be cached in memory until free memory drops
> > to near zero.
> >
> > "Free memory" is actually wasted memory - it's better to use all
> > available RAM for caching.
> >
>
> But the swap being touched bothers me. Although I've had problems with
> leaving Mozilla up for long times and it leaking. Without Mozilla running
> and running lots of other apps, I have almost 100% memory used, but 0%
> swap.

I believe this is the expected behavior under 2.6 unless you
set /proc/sys/vm/swappiness to 0. If an app allocates memory and then
never touches it then those pages will eventually be swapped out to make
room for hot ones.

Lee

2005-10-12 17:04:12

by Steven Rostedt

[permalink] [raw]
Subject: Re: Latency data - 2.6.14-rc3-rt13


On Wed, 12 Oct 2005, Lee Revell wrote:

>
> I believe this is the expected behavior under 2.6 unless you
> set /proc/sys/vm/swappiness to 0. If an app allocates memory and then
> never touches it then those pages will eventually be swapped out to make
> room for hot ones.
>

OK, thanks for the info. I guess my apps don't allocate enough memory to
be eventually swapped out (I obviously run 2.6). Or those apps use the
memory that it allocates often.

Whatever..

-- Steve