Hello,
my name is Mario Holbe.
Among others, I'm running a 2-CPU i386 system with a current
uptime of 348 days.
Since around 250 days uptime, we recognize kernel statistics
overflows, which result in serveral problems in HZ calculations
in the procps utilities:
$ uptime
Unknown HZ value! (28) Assume 0.
14:49:20 up 348 days, 33 min, ...
$ cat /proc/stat
cpu 259658341 5066163 336423925 1117681229
cpu0 129832988 2535697 168074057 2706455735
cpu1 129825353 2530466 168349868 2706192790
As you can see, the 'unused jiffies' counter in the cpu
summary flows over, since 2706455735 + 2706192790 > 2^32.
Besides the general problem, that jiffies and kernel_stat
components are 32bit on i386 platforms:
sched.h:extern unsigned long volatile jiffies;
kernel_stat.h:struct kernel_stat {
kernel_stat.h: unsigned int per_cpu_user[NR_CPUS],
kernel_stat.h: per_cpu_nice[NR_CPUS],
kernel_stat.h: per_cpu_system[NR_CPUS];
... and therefor flow over after around 500 days, on SMP
systems the summary calculations in the proc-fs:
proc_misc.c: unsigned int sum = 0, user = 0, nice = 0, system = 0;
... already flow over after around 500 / NR_CPUS days,
this means, on a 2-CPU system after around 250 days, on
a 4-CPU system after around 125 days and so on.
[Yes, I know, it's some 2^(32 - round_up(log2(NR_CPUS)))
calculation in real :)]
Resulting in - at least - permanent warnings in procps
utilities.
Since it should not be a big problem to fix this, to
at least reduce the problem back to the 500 days
jiffies-overflow problem, I'd suggest to do so.
No need to mention, that 64bit jiffies and statistics on
all platforms at all would be great :)
Yes, I know the performance implications resulting in this
for 32bit platforms.
Btw... Could anybody please explain me the problems to
expect while a jiffies overflow? Would a kernel possibly
survive this at all and if, what's the chance to? :)
PS: Please CC: me in replies, because I'm not on the list.
regards,
Mario
--
We are the Bore. Resistance is futile. You will be bored.
On Fri, 2002-11-15 at 14:22, Mario 'BitKoenig' Holbe wrote:
> Btw... Could anybody please explain me the problems to
> expect while a jiffies overflow? Would a kernel possibly
> survive this at all and if, what's the chance to? :)
The kernel uses time_before/time_after functions which know about the
wrapping of time so it should all go ok
On Fri, 15 Nov 2002, Mario 'BitKoenig' Holbe wrote:
[system idle time will overflow after about 497/NR_CPUS days]
>
> Since it should not be a big problem to fix this, to
> at least reduce the problem back to the 500 days
> jiffies-overflow problem, I'd suggest to do so.
>
> No need to mention, that 64bit jiffies and statistics on
> all platforms at all would be great :)
2.5 has 64 bit jiffies (but not (yet?) 64 bit statistics).
A patch for 2.4 that fixes the overflow in proc_stat as well as
introducing 63 bit jiffies is at
http://www.physik3.uni-rostock.de/tim/kernel/2.4/jiffies64-20.patch.gz
>
> Btw... Could anybody please explain me the problems to
> expect while a jiffies overflow? Would a kernel possibly
> survive this at all and if, what's the chance to? :)
"ps" will report processes started before the jiffies wrap as being
started in the future, but this won't do any harm.
Tim
> "ps" will report processes started before the jiffies wrap
> as being started in the future, but this won't do any harm.
Sure it does harm. You might kill the wrong process,
like so:
pgrep -n foo # kill the newest foo
The CPU usage stats in "top" will also be horribly messed up.