2008-07-13 20:07:28

by Rafael Almeida

[permalink] [raw]
Subject: /proc/data information

Hello,

I'm interested in knowing how the cpu data from /proc/stat is gathered.
Following my way from this function:

http://lxr.linux.no/linux+v2.6.25.10/fs/proc/proc_misc.c#L459

I've figured that the time is probably gathered using those
account_*_time on sched.c. I'm not sure where the times are read from,
though.

Anyhow, I thought I'd do a little test. I expected that if I added
together all the values on the cpu line of /proc/stat, waited 1 second
and added together all the values again, then their difference would be
a constant value. That is, the value should be the same any time I would
repeat the experiment. My reasoning was that those values accounted for
some unit of time and that the amount of time units between seconds
should always be the same. I expected a small variation due to the
kernel not always being able to wake the process that was doing the
measurement in exactly one second.

What I noticed, though, was that sometimes there's a very big variation.
For this experiment I used the following bash code:

t=`head -n1 /proc/stat |
awk '{ print $2 + $3 + $4 + $5 + $6 + $7 + $8 + $9 }'`
sleep 1
for (( i=0; i < 1800; i=i+1 )); do
nt=`head -n1 /proc/stat |
awk '{ print $2 + $3 + $4 + $5 + $6 + $7 + $8 + $9 }'`
echo $(( $nt - $t )) >> /tmp/values
t=$nt
sleep 1
done

The file with all the values can be reached at:
http://homepages.dcc.ufmg.br/~rafaelc/values

The mean of the numbers in the values file was: 109.85
The standard deviation was: 29.47
The maximum was: 589
The minimum was: 99

I found those values rather odd. They were gathered while I was using
the system like I usually do. There were 40 values above 200 and 7
above 300. I didn't expect those big values to show up. Why does it happen?

It looks like that when I run:

% dd if=/dev/zero of=/tmp/foo

things get more variable. I'm not sure why that would happen. Running
CPU-bound process added for some variation, but not nearly as much as if
I used the dd command (remembert that my computer is rather old).

A little info about my system (Debian etch):
$ uname -a
Linux gaz 2.6.18-6-686 #1 SMP Fri Jun 6 22:22:11 UTC 2008 i686 GNU/Linux
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 10
cpu MHz : 898.087
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat
pse36 mmx fxsr sse up
bogomips : 1797.57

$

[]'s
Rafael


2008-07-14 16:39:16

by Andi Kleen

[permalink] [raw]
Subject: Re: /proc/data information

"Rafael C. de Almeida" <[email protected]> writes:

> I'm interested in knowing how the cpu data from /proc/stat is gathered.
> Following my way from this function:
>
> http://lxr.linux.no/linux+v2.6.25.10/fs/proc/proc_misc.c#L459
>
> I've figured that the time is probably gathered using those
> account_*_time on sched.c. I'm not sure where the times are read from,
> though.

They are normally (some architectures do it differently to cope with
virtualized environments) sampled by a regular timer interrupt, which
runs HZ times per second on each CPU. Common values for HZ is 250
(2.5ms interval), but you can compile with others too.

I suspect the effects you're seeing all come from sampling error.
The interval is also not fully stable because the kernel sometimes
disables interrupts and that will delay the timer interrupt of course.
How often this happens depends on the workload.

Then there are architectures like s390 who do "microstate accounting":
they keep track instead on every kernel entry/exit and every interrupt.
That can be more accurate, but is also more costly.

-Andi

2008-07-14 19:47:44

by Rafael Almeida

[permalink] [raw]
Subject: Re: /proc/data information

Andi Kleen wrote:
> "Rafael C. de Almeida" <[email protected]> writes:
>
>> I'm interested in knowing how the cpu data from /proc/stat is gathered.
>> Following my way from this function:
>>
>> http://lxr.linux.no/linux+v2.6.25.10/fs/proc/proc_misc.c#L459
>>
>> I've figured that the time is probably gathered using those
>> account_*_time on sched.c. I'm not sure where the times are read from,
>> though.
>
> They are normally (some architectures do it differently to cope with
> virtualized environments) sampled by a regular timer interrupt, which
> runs HZ times per second on each CPU. Common values for HZ is 250
> (2.5ms interval), but you can compile with others too.

How can it always sample regularly like that? The only way I can think
of accounting is doing so in an event-based manner. That is, when a
process is given to a CPU you start counting, when you remove it you
stop the counter, that would be your CPU user time, then you'd start
counting for the system. Accounting for IO time and other CPU times
would happen in the same manner.

Now, I took a look at ``void scheduler_tick(void)'' (from sched.c). I
think it's the function that gets called in a timely manner. All I can
see it doing regarding clock is updating its value. It doesn't seem to
account for idle, user and system time.

> I suspect the effects you're seeing all come from sampling error.
> The interval is also not fully stable because the kernel sometimes
> disables interrupts and that will delay the timer interrupt of course.
> How often this happens depends on the workload.

I didn't think about the kernel disabling interrupts. But I'm not sure
that's the main issue in my experimentation, after all, that would make
me get smaller values rather than big values, no? I mean, on a idle
system each second has 100 samples, what I observed was that sometimes I
get as much as 500 samples in one second. If disabling interrupts was
the issue I think that I'd see values much smaller than 100.

I've noticed that the error doesn't change (thus becoming relatively
smaller) when I sleep for more time. So, looking at the samples each 10
seconds usually gives me 1000 samples, but it gets at 1500 at tops
(which is much better than getting 100 samples and reaching 500 at
tops). I wonder if maybe the error I'm seeing here has more to do with
the system not respecting the sleep time too well.

> Then there are architectures like s390 who do "microstate accounting":
> they keep track instead on every kernel entry/exit and every interrupt.
> That can be more accurate, but is also more costly.

My interest here is on the x86 architeture, I don't suppose I can turn
on microstate accounting on it, can I?