Hi Arjan, Ingo,
I am using latest latencytop from git (commit
92b6ca9d40998697866689f64b95647eca3200cb), and I'm seeing some strange
things:
[for minor latencytop userspace specific issues see at end of mail]
* if I compile with CONFIG_SCHED_DEBUG I see a 30-40msec latency from
the scheduler, always (Scheduler: waiting for cpu).
There is also a constant ~5% user, ~2% sys CPU usage on an idle system,
regardless if latencytop sysctl is on or off.
Is this normal? (is overhead really 40msec?)
I was also seeing an unusually high number of context switches (as shown
by vmstat), I usually have 400-800 with non-patched kernels (while
running mplayer too), but I was getting steadily over 1100 with the
patch (on idle system).
CPU usage with latencytop on:
Cpu(s): 7.3%us, 7.0%sy, 0.3%ni, 85.3%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
CPU usage with latencytop sysctl off:
Cpu(s): 5.0%us, 1.7%sy, 0.0%ni, 93.3%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
However top won't tell me who's using 5% CPU!
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 10388 648 548 S 0.0 0.0 0:01.70 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 15 -5 0 0 0 S 0.0 0.0 0:00.13 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
High number of context switches (>1100):
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
2 0 520 60828 92276 832212 0 0 281 55 330 1215 16 3
73 8
0 0 520 60828 92276 832212 0 0 0 2 242 1158 5 1
94 0
0 0 520 60828 92288 832212 0 0 0 3 290 1261 8 1
91 0
0 0 520 60836 92288 832216 0 0 1 0 227 1407 10 2
89 0
Scheduler latency:
Cause Maximum Percentage
put_device elv_insert blk_plug_device default_wake152.4 msec 4.3 %
Creating block layer request 68.1 msec 2.5 %
Scheduler: waiting for cpu 26.6 msec 6.5 %
Writing a page to disk 15.5 msec 1.8 %
Checking for media change 10.3 msec 7.8 %
SCSI cdrom ioctl 9.4 msec 1.3 %
Checking CDROM media present 7.1 msec 0.3 %
Waiting for event (poll) 5.0 msec 59.4 %
Application requested delay 4.9 msec 11.6 %
Process hddtemp (3772)
put_device elv_insert blk_plug_device default_wake152.4 msec
61.5 %blk_rq_bio_prep blk_rq_append_bio blk_rq_map_user sg_io scsi_
Creating block layer request 68.1 msec 36.3 %
Scheduler: waiting for cpu 6.5 msec 2.2 %
* I compile without CONFIG_SCHED_DEBUG, I no longer get *any* latency
from the scheduler, even if I run multi-threaded programs, etc. Is this
to be expected? (i.e. is this feature available only when enabling
CONFIG_SCHED_DEBUG?)
I was trying to trigger some scheduler latencies artifically, so I wrote
a program that creates 10 threads, each doing an usleep(1) in an infloop.
Still I get no scheduler latency (I was getting scheduler latency with
version 0.1).
If latencytop sysctl is on, when running this test, after a short while
X becomes unusable, and almost freezes; yet I still can't see latencies.
Before it freezes I captured top outputs (see below), notice how
otherwise idle programs get a high CPU usage!
If latencytop sysctl is off no freeze occurs, and CPU usage% is as
normal (a.out getting 99%)
Top output when latencytop sysctl is on:
top - 19:33:31 up 1:28, 2 users, load average: 3.45, 1.21, 1.05
Tasks: 120 total, 9 running, 110 sleeping, 1 stopped, 0 zombie
Cpu(s): 3.9%us, 95.5%sy, 0.6%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 2060592k total, 1979756k used, 80836k free, 69028k buffers
Swap: 3903724k total, 0k used, 3903724k free, 1473896k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7008 edwin 20 0 38764 472 356 R 58.4 0.0 0:05.94 a.out
5613 edwin 20 0 179m 10m 7452 R 17.3 0.5 1:45.19 gkrellm
5920 edwin 20 0 565m 90m 25m R 17.3 4.5 1:41.82 icedove-bin
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is an idle program!
5610 edwin 20 0 152m 21m 13m R 1.9 1.1 1:28.88 konsole
6749 edwin 20 0 254m 7800 5496 S 1.9 0.4 0:00.87 mplayer
5209 root 20 0 110m 35m 9016 S 1.3 1.8 1:36.31 Xorg
908 root 15 -5 0 0 0 S 0.6 0.0 0:01.60 scsi_eh_0
5132 postgres 20 0 77216 1284 736 S 0.6 0.1 0:00.35 postgres
5151 root 35 15 41068 29m 648 S 0.6 1.5 0:27.96 preload
7005 edwin 20 0 18976 1184 880 R 0.6 0.1 0:00.02 top
1 root 20 0 10388 712 588 S 0.0 0.0 0:01.68 init
with latencytop sysctl off:
top - 19:37:03 up 1:32, 2 users, load average: 0.76, 0.95, 0.98
Tasks: 120 total, 6 running, 113 sleeping, 1 stopped, 0 zombie
Cpu(s): 10.6%us, 89.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 2060592k total, 1984368k used, 76224k free, 69476k buffers
Swap: 3903724k total, 0k used, 3903724k free, 1474340k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7107 edwin 20 0 87940 504 368 S 94.1 0.0 0:05.52 a.out
5610 edwin 20 0 152m 21m 13m R 3.0 1.1 1:31.26 konsole
5209 root 20 0 110m 35m 9016 S 1.5 1.8 1:41.19 Xorg
5613 edwin 20 0 179m 10m 7452 S 1.5 0.5 1:48.59 gkrellm
6749 edwin 20 0 254m 7808 5496 S 1.5 0.4 0:02.31 mplayer
7111 edwin 20 0 18976 1180 880 R 1.5 0.1 0:00.01 top
1 root 20 0 10388 712 588 S 0.0 0.0 0:01.68 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd
Some minor latencytop userspace issues:
* percentages: new feature (nice!), but the values seem all wrong
(global perc. are always wrong, per app perc. are usually ok, but see
second example below)
* I miss the Average latency column. If it is too costly to keep account
of an overall average, can we have last N second average?
Or just cumulative latency so far (unless it overflows)
This is especially useful for lock contention latencies: if I see max
latency a few miliseconds I don't worry, if I see average latency a few
miliseconds I'm starting to worry.
* unknown reasons show a backtrace, but backtrace doesn't have enough
room on screen
* on startup there are lots of latencies with process_timeout as first
in backtrace
* /proc/latency_stats still says it is v0.1
Example output to illustrate some of the issues:
LatencyTOP version 0.3 (C) 2008 Intel Corporation
Cause Maximum Percentage
put_device elv_insert blk_plug_device default_wake 29.3 msec 0.2 %
Creating block layer request 20.5 msec 0.2 %
Waiting for event (poll) 5.0 msec 43.1 %
Application requested delay 5.0 msec 36.5 %
Waiting for TTY data 5.0 msec 14.2 %
Page fault 4.9 msec 3.0 %
Waiting for data on unix socket 4.0 msec 0.1 %
Reading from a pipe 3.4 msec 0.0 %
Userspace lock contention 1.6 msec 0.0 %
Process hddtemp (3760)
put_device elv_insert blk_plug_device default_wake 29.3 msec
51.1 %blk_rq_bio_prep blk_rq_append_bio blk_rq_map_user sg_io scs
Creating block layer request 20.5 msec 48.9 %
reiserfs/0 udevd kpsmoused kgameportd mdadm dhclient3 mdadm
dirmngr lighttpd syslogd acpid cupsd dbus-daemon hddtemp
LatencyTOP version 0.3 (C) 2008 Intel Corporation
Cause Maximum Percentage
Waiting for TTY data 5.0 msec 13.5 %
Waiting for event (poll) 5.0 msec 44.2 %
Application requested delay 5.0 msec 36.7 %
synchronous write 5.0 msec 1.9 %
Waiting for data on unix socket 4.9 msec 0.1 %
Page fault 4.9 msec 2.4 %
Writing buffer to disk (synchronous) 4.8 msec 0.0 %
Reading from a pipe 3.7 msec 0.1 %
Closing a file 1.9 msec 0.1 %
Process hald-addon-stor (4645)
SCSI cdrom ioctl 1.1 msec 17.7 %
Checking for media change 0.6 msec 77.9 %
Executing raw SCSI command 0.2 msec 4.3 %
Best regards,
--Edwin
T?r?k Edwin wrote:
> Hi Arjan, Ingo,
> [for minor latencytop userspace specific issues see at end of mail]
>
> * if I compile with CONFIG_SCHED_DEBUG I see a 30-40msec latency from
> the scheduler, always (Scheduler: waiting for cpu).
> There is also a constant ~5% user, ~2% sys CPU usage on an idle system,
> regardless if latencytop sysctl is on or off.
> Is this normal? (is overhead really 40msec?)
I'm seeing similar, I would not rule out that this is actual scheduler behavior ;(
at least it seems consistent.
I also made a patch (to lkml yesterday) that counts the total and count of scheduler latencies
to also show the cumulative effect of these latencies
> I was also seeing an unusually high number of context switches (as shown
> by vmstat), I usually have 400-800 with non-patched kernels (while
> running mplayer too), but I was getting steadily over 1100 with the
> patch (on idle system).
that's weird; there's nothing blocking in the entire patch at all.
(the only lock is a spinlock)
The performance aspect... collecting the data isn't cheap (which is why it's made a sysctl),
I still plan to look at optimizing it but it won't ever be free.
> * I compile without CONFIG_SCHED_DEBUG, I no longer get *any* latency
> from the scheduler, even if I run multi-threaded programs, etc. Is this
> to be expected? (i.e. is this feature available only when enabling
> CONFIG_SCHED_DEBUG?)
afaik yes.
> Some minor latencytop userspace issues:
> * percentages: new feature (nice!), but the values seem all wrong
> (global perc. are always wrong, per app perc. are usually ok, but see
> second example below)
note that the percentages are percentage this entry had compared the total sum of latencies.
The maximum entry is only loosely related to that; say you have 1 latency in "foo" for 100ms
but 900 of 1ms in "bar", "foo" will show 10% and "bar" will show 90%
> * I miss the Average latency column. If it is too costly to keep account
> of an overall average, can we have last N second average?
it's not costly to calculate, it's the screen space versus the value of the information :(
> * unknown reasons show a backtrace, but backtrace doesn't have enough
> room on screen
still working on that; you can pass the --unknown option to dump these so that I can add them
to the translation table.
Arjan van de Ven wrote:
> T?r?k Edwin wrote:
>> Is this normal? (is overhead really 40msec?)
>
> I'm seeing similar, I would not rule out that this is actual scheduler
> behavior ;(
> at least it seems consistent.
Ok, it is good that we are seeing same behaviour.
> I also made a patch (to lkml yesterday) that counts the total and
> count of scheduler latencies
> to also show the cumulative effect of these latencies
Found it: http://lkml.org/lkml/2008/1/21/223.
I'll try it the next days. Do you want me to report results? [and
especially, should I Cc: Ingo Molnar on the report?]
>
>> I was also seeing an unusually high number of context switches (as shown
>> by vmstat), I usually have 400-800 with non-patched kernels (while
>> running mplayer too), but I was getting steadily over 1100 with the
>> patch (on idle system).
>
> that's weird; there's nothing blocking in the entire patch at all.
> (the only lock is a spinlock)
Maybe the overhead of latencytop is triggerring other behaviour in the
scheduler? Just a guess.
Are there any tests I can run to see why there are more context switches?
>
> The performance aspect... collecting the data isn't cheap (which is
> why it's made a sysctl),
> I still plan to look at optimizing it but it won't ever be free.
Yes, I understand that. Is there a way latencytop could track its own
overhead? I suppose it would lead to more accurate results
(not that there would be anything wrong with the current ones).
>
>> * I compile without CONFIG_SCHED_DEBUG, I no longer get *any* latency
>> from the scheduler, even if I run multi-threaded programs, etc. Is this
>> to be expected? (i.e. is this feature available only when enabling
>> CONFIG_SCHED_DEBUG?)
>
> afaik yes.
Ok.
>> * percentages: new feature (nice!), but the values seem all wrong
>> (global perc. are always wrong, per app perc. are usually ok, but see
>> second example below)
> Some minor latencytop userspace issues:
>
> note that the percentages are percentage this entry had compared the
> total sum of latencies.
> The maximum entry is only loosely related to that; say you have 1
> latency in "foo" for 100ms
> but 900 of 1ms in "bar", "foo" will show 10% and "bar" will show 90%
Thanks, that explains it. Looks like a good substitute for the average
column.
This is the functionality I was missing, when that column went away.
Still, I wouldn't mind to see the average column too (maybe activated
via a hotkey, or shown only if enough screenspace is available?).
>
>> * I miss the Average latency column. If it is too costly to keep account
>> of an overall average, can we have last N second average?
>
> it's not costly to calculate, it's the screen space versus the value
> of the information :(
If I stretch my terminal window there's room for 2 or 3 more columns :)
>
>> * unknown reasons show a backtrace, but backtrace doesn't have enough
>> room on screen
>
> still working on that; you can pass the --unknown option to dump these
> so that I can add them
> to the translation table.
I gathered these while writing this reply:
Unknown: put_device elv_insert blk_plug_device default_wake_function
blk_execute_rq blk_rq_bio_prep blk_rq_append_bio blk_rq_map_user sg_io
scsi_cmd_ioctl ip_queue_xmit tcp_transmit_skb
Unknown: md_thread autoremove_wake_function md_thread md_thread kthread
child_rip kthread child_rip
Unknown: kswapd autoremove_wake_function kswapd kswapd kthread child_rip
kthread child_rip
Best regards,
--Edwin
T?r?k Edwin wrote:
>
>> The performance aspect... collecting the data isn't cheap (which is
>> why it's made a sysctl),
>> I still plan to look at optimizing it but it won't ever be free.
>>
>
> Yes, I understand that. Is there a way latencytop could track its own
> overhead? I suppose it would lead to more accurate results
> (not that there would be anything wrong with the current ones).
>
Latencytop userspace tool shows latencies > 0.1 msec, thus capturing
backtraces for latencies <0.1msec could be avoided.
If I apply the patch below, then enabling latencytop doesn't freeze X
when running the "10-threads doing infloop usleep(1)" test.
Still, I don't want to loose track of the latencies we didn't collect
backtraces for, so I added a special "untraced" category, reported as
first line in /proc/latency_stats. If needed, instead of hardcoding the
threshold, it could be made a sysctl, or set via writing to
/proc/latency_stats,...
While I am running the test-program:
$ grep untraced /proc/latency_stats
4875605 5120414 49 untraced
On an idle system:
$ grep untraced /proc/latency_stats
532 3287 47 untraced
$ grep untraced /proc/latency_stats
853 5778 47 untraced
$ grep untraced /proc/latency_stats
950 6788 47 untraced
$ grep untraced /proc/latency_stats
1343 9977 49 untraced
$ grep untraced /proc/latency_stats
1448 11075 49 untraced
Best regards,
--Edwin
---
latencytop.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
--- /tmp/linux-2.6.24-rc8/kernel/latencytop.c 2008-01-24
11:27:32.727487585 +0200
+++ kernel/latencytop.c 2008-01-24 10:42:25.000000000 +0200
@@ -24,8 +24,11 @@
static DEFINE_SPINLOCK(latency_lock);
#define MAXLR 128
+/* we are not interested in latencies less than 0.1msec, so
+ * don't get backtraces for latencies <0.05msec.*/
+#define LATENCY_TRACE_THRESHOLD 50
static struct latency_record latency_record[MAXLR];
-
+static struct latency_record untraced;
int latencytop_enabled;
void clear_all_latency_tracing(struct task_struct *p)
@@ -47,6 +50,7 @@
spin_lock_irqsave(&latency_lock, flags);
memset(&latency_record, 0, sizeof(latency_record));
+ memset(&untraced, 0, sizeof(untraced));
spin_unlock_irqrestore(&latency_lock, flags);
}
@@ -124,6 +128,15 @@
if (inter && usecs > 5000)
return;
+ if(usecs < LATENCY_TRACE_THRESHOLD) {
+ /* don't get stacktrace for very low latencies */
+ untraced.time += usecs;
+ if(usecs > untraced.max)
+ untraced.max = usecs;
+ untraced.count++;
+ return;
+ }
+
memset(&lat, 0, sizeof(lat));
lat.count = 1;
lat.time = usecs;
@@ -177,6 +190,10 @@
seq_puts(m, "Latency Top version : v0.1\n");
+ seq_printf(m, "%i %li %li untraced \n",
+ untraced.count,
+ untraced.time,
+ untraced.max);
for (i = 0; i < MAXLR; i++) {
if (latency_record[i].backtrace[0]) {
int q;
T?r?k Edwin wrote:
>
> Latencytop userspace tool shows latencies > 0.1 msec, thus capturing
> backtraces for latencies <0.1msec could be avoided.
> If I apply the patch below, then enabling latencytop doesn't freeze X
> when running the "10-threads doing infloop usleep(1)" test.
ok I like the idea; I would propose though to make the 0.1 msec a sysctl or something,
so that people who really care about latencies lower than that can just set that lower.
(and the tool can then read the value and use that)
> Still, I don't want to loose track of the latencies we didn't collect
> backtraces for, so I added a special "untraced" category, reported as
> first line in /proc/latency_stats. If needed, instead of hardcoding the
> threshold, it could be made a sysctl, or set via writing to
> /proc/latency_stats,...
yeah for the total it makes sense; I'll do the sysctl thing for the threshold
and integrate this idea as well. Thanks a lot!