Starting with kernel 2.6.9 the process start time is set wrongly for
processes that get started early in the boot process. Below is a dump from
my 'ps' command. Note the start time for processes 1-12. After process 12
the start time is set right.
Jerome
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 1372 500 ? S 20:59 0:00 init [3]
root 2 0.0 0.0 0 0 ? SN 20:59 0:00 [ksoftirqd/0]
root 3 0.0 0.0 0 0 ? S< 20:59 0:00 [events/0]
root 4 0.0 0.0 0 0 ? S< 20:59 0:00 [khelper]
root 5 0.0 0.0 0 0 ? S< 20:59 0:00 [kblockd/0]
root 6 0.0 0.0 0 0 ? S 20:59 0:00 [pdflush]
root 7 0.0 0.0 0 0 ? S 20:59 0:00 [pdflush]
root 9 0.0 0.0 0 0 ? S< 20:59 0:00 [aio/0]
root 8 0.0 0.0 0 0 ? S 20:59 0:00 [kswapd0]
root 10 0.0 0.0 0 0 ? S 20:59 0:00 [kseriod]
root 11 0.0 0.0 0 0 ? S 20:59 0:00 [scsi_eh_0]
root 12 0.0 0.0 0 0 ? S 20:59 0:00 [ahc_dv_0]
root 13 0.0 0.0 0 0 ? S 19:48 0:00 [scsi_eh_1]
root 14 0.0 0.0 0 0 ? S 19:48 0:00 [ahc_dv_1]
root 15 0.0 0.0 0 0 ? S 19:48 0:00 [scsi_eh_2]
root 16 0.0 0.0 0 0 ? S 19:48 0:00 [ahc_dv_2]
root 17 0.0 0.0 0 0 ? S 19:49 0:00 [kjournald]
root 43 0.0 0.0 0 0 ? S 19:49 0:00 [kjournald]
root 44 0.0 0.0 0 0 ? S 19:49 0:00 [kjournald]
root 45 0.0 0.0 0 0 ? S 19:49 0:00 [kjournald]
root 46 0.0 0.0 0 0 ? S 19:49 0:00 [kjournald]
root 47 0.0 0.0 0 0 ? S 19:49 0:00 [kjournald]
root 48 0.0 0.0 0 0 ? S 19:49 0:00 [kjournald]
root 122 0.0 0.2 1420 552 ? Ss 19:49 0:00 /sbin/syslogd -m 0
root 124 0.0 0.1 1376 452 ? Ss 19:49 0:00 /sbin/klogd
root 131 0.0 0.3 1640 776 ? Ss 19:49 0:00 /sbin/apcupsd
root 139 0.0 0.9 2444 2444 ? SLs 19:49 0:00 /usr/bin/ntpd
ldap 148 0.0 1.8 50084 4696 ? Ssl 19:49 0:00 /usr/sbin/slapd -4 -u ldap -h ldap:/// ldapi:///
nscd 153 0.0 0.6 10208 1640 ? Ssl 19:49 0:00 /usr/sbin/nscd
root 162 0.0 0.5 3156 1392 ? Ss 19:49 0:00 /usr/sbin/sshd
root 168 0.0 0.9 5396 2436 ? Ss 19:49 0:00 sendmail: accepting connections
smmsp 173 0.0 0.8 5176 2144 ? Ss 19:49 0:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
root 177 0.0 0.2 1372 564 ? Ss 19:49 0:00 /usr/sbin/cron
On Tue, 2004-10-19 at 11:21, Jerome Borsboom wrote:
> Starting with kernel 2.6.9 the process start time is set wrongly for
> processes that get started early in the boot process. Below is a dump from
> my 'ps' command. Note the start time for processes 1-12. After process 12
> the start time is set right.
How reproducible is this? Are the correct and incorrect time values
always off by the same amount?
Are you running NTP? I'm curious if you are changing your system time
during boot.
thanks
-john
On Tue, 19 Oct 2004, john stultz wrote:
> On Tue, 2004-10-19 at 11:21, Jerome Borsboom wrote:
> > Starting with kernel 2.6.9 the process start time is set wrongly for
> > processes that get started early in the boot process. Below is a dump from
> > my 'ps' command. Note the start time for processes 1-12. After process 12
> > the start time is set right.
>
> How reproducible is this? Are the correct and incorrect time values
> always off by the same amount?
>
> Are you running NTP? I'm curious if you are changing your system time
> during boot.
I'd bet that some process early in the boot adjusts your system time.
Then this is expected behavior. This is why I would have preferred the
simple back-out patch for the boot times problem.
I'm sorry I fell of the net for so long and didn't stand up for the
simpler change in this case. Oh well.
I'll probably supply a back-out patch for -mm then, after wading through
my multi-megabyte email backlog (sorry John, still need to read your time
keeping proposal and all its discussion).
Tim
On Tue, 2004-10-19 at 17:42, Tim Schmielau wrote:
> On Tue, 19 Oct 2004, john stultz wrote:
>
> > On Tue, 2004-10-19 at 11:21, Jerome Borsboom wrote:
> > > Starting with kernel 2.6.9 the process start time is set wrongly for
> > > processes that get started early in the boot process. Below is a dump from
> > > my 'ps' command. Note the start time for processes 1-12. After process 12
> > > the start time is set right.
> >
> > How reproducible is this? Are the correct and incorrect time values
> > always off by the same amount?
> >
> > Are you running NTP? I'm curious if you are changing your system time
> > during boot.
>
> I'd bet that some process early in the boot adjusts your system time.
He claims that's not the case (you weren't CC'ed on his reply, but its
on lkml). He believes the time changes before NTP starts up. Might be
something else, but I'm not sure.
> Then this is expected behavior. This is why I would have preferred the
> simple back-out patch for the boot times problem.
>
> I'm sorry I fell of the net for so long and didn't stand up for the
> simpler change in this case. Oh well.
>
> I'll probably supply a back-out patch for -mm then, after wading through
> my multi-megabyte email backlog (sorry John, still need to read your time
> keeping proposal and all its discussion).
I've begun to agree with you about this issue. It seems that until we
can catch every use of jiffies for time, doing one by one is going to
cause consistency problems. So I'd support the full backout of the
do_posix_clock_monotonic_gettime changes to the proc interface.
George, would you protest this?
As for the timeofday overhaul, I've had zero time to work on it
recently. I hate that I dropped code and then went missing for weeks.
I'll have to see if I can get a few cycles at home to sync up my current
tree and send it out.
thanks
-john
On Tue, 19 Oct 2004, john stultz wrote:
> As for the timeofday overhaul, I've had zero time to work on it
> recently. I hate that I dropped code and then went missing for weeks.
> I'll have to see if I can get a few cycles at home to sync up my current
> tree and send it out.
I still haven't looked at your code and it's discussion. From what I
remember, I liked your proposal very much. It's surely where we want to
end up someday. But from the above mail it strikes me that we just don't
have enough manpower to get there all at once, so we should have a plan
for the time code to gradually evolve into what we finally want. I think
we could do it in the following steps:
1. Sync up jiffies with the monotonic clock, very much like we
already handle lost ticks. This would immediately remove the
hassles with incompatible time sources.
Judging from the jiffies wrap experience, we there probably are
some drivers which need fixing (mostly because they wait until
jiffies==something), but these are bugs already right now
in the case of lost ticks.
2. Decouple jiffies from the actual interrupt counter. We could
then e.g. set HZ to 10000, also increasing the resolution of
timers, without increasing the interrupt frequency.
We'd then need to identify the places where this might lead to
overflows and promote them to use jiffies_64 instead of jiffies
(where this hasn't been done already).
3. Increase HZ all the way up to 1e9. jiffies_64 would then be the
same as your plain 64 bit nanoseconds value.
This would require an optimization to the timer code to be able
to increment jiffies in steps larger than 1.
Thoughts?
On Tue, 2004-10-19 at 23:05, Tim Schmielau wrote:
> I think we could do it in the following steps:
>
> 1. Sync up jiffies with the monotonic clock,...
> 2. Decouple jiffies from the actual interrupt counter...
> 3. Increase HZ all the way up to 1e9....
> Thoughts?
Yes, for long periods of idle, I'd like to see the periodic clock tick
disabled entirely. Clock ticks causes the hardware to exit power-saving
idle states.
The current design with HZ=1000 gives us 1ms = 1000usec between clock
ticks. But some platforms take nearly that long just to enter/exit low
power states; which means that on Linux the hardware pays a long idle
state exit latency (performance hit) but gets little or no power savings
from the time it resides in that idle state.
thanks,
-Len
Len Brown wrote:
> On Tue, 2004-10-19 at 23:05, Tim Schmielau wrote:
>
>>I think we could do it in the following steps:
>>
>> 1. Sync up jiffies with the monotonic clock,...
>> 2. Decouple jiffies from the actual interrupt counter...
>> 3. Increase HZ all the way up to 1e9....
Before we do any of the above, I think we need to stop and ponder just what a
"jiffie" is. Currently it is, by default (or historically) the "basic tick" of
the system clock. On top of this a lot of interpolation code has been "grafted"
to allow the system to resolve time to finer levels, i.e. to the nanosecond.
But none of this interpolation code actually changes the tick, i.e. the
interrupt still happens at the same periodic rate.
As the "basic tick", it is used to do a lot of accounting and scheduling house
keeping AND as a driver of the system timers.
So, by this definition, it REQUIRES a system interrupt.
I have built a "tick less" system and have evidence from that that such systems
are over load prone. The faster the context switch rate, the more accounting
needs to be done. On the otherhand, the ticked system has flat accounting
overhead WRT load.
Regardless of what definitions we settle on, the system needs an interrupt
source to drive the system timers, and, as I indicate above, the accounting and
scheduling stuff. It is a MUST that these interrupts occure at the required
times or the system timers will be off. This is why we have a jiffies value
that is "rather odd" in the x86 today.
George
>
>
>>Thoughts?
>
>
> Yes, for long periods of idle, I'd like to see the periodic clock tick
> disabled entirely. Clock ticks causes the hardware to exit power-saving
> idle states.
>
> The current design with HZ=1000 gives us 1ms = 1000usec between clock
> ticks. But some platforms take nearly that long just to enter/exit low
> power states; which means that on Linux the hardware pays a long idle
> state exit latency (performance hit) but gets little or no power savings
> from the time it resides in that idle state.
>
> thanks,
> -Len
>
>
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Len Brown wrote:
> On Tue, 2004-10-19 at 23:05, Tim Schmielau wrote:
>
>>I think we could do it in the following steps:
>>
>> 1. Sync up jiffies with the monotonic clock,...
>> 2. Decouple jiffies from the actual interrupt counter...
>> 3. Increase HZ all the way up to 1e9....
>
>
>>Thoughts?
>
>
> Yes, for long periods of idle, I'd like to see the periodic clock tick
> disabled entirely. Clock ticks causes the hardware to exit power-saving
> idle states.
>
> The current design with HZ=1000 gives us 1ms = 1000usec between clock
> ticks. But some platforms take nearly that long just to enter/exit low
> power states; which means that on Linux the hardware pays a long idle
> state exit latency (performance hit) but gets little or no power savings
> from the time it resides in that idle state.
I (and MontaVista) will be expanding on the VST patches. There are, currently,
two levels of VST. VST-I when entering the idle state (task) looks ahead in the
timer list, finds the next event, and shuts down the "tick" until that time. An
interrupts resets things, be it from the end of the time counter or another source.
VST-II adds a call back list to idle entry and exit. This allows one to add
code to change (or even remove) timers on idle entry and restore them on exit.
We are doing this work to support deeply embedded applications that often times
run on small batteries (think cell phone if you like).
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
On Wed, 20 Oct 2004, George Anzinger wrote:
> Len Brown wrote:
>> On Tue, 2004-10-19 at 23:05, Tim Schmielau wrote:
>>
>>> I think we could do it in the following steps:
>>>
>>> 1. Sync up jiffies with the monotonic clock,...
>>> 2. Decouple jiffies from the actual interrupt counter...
>>> 3. Increase HZ all the way up to 1e9....
>
> Before we do any of the above, I think we need to stop and ponder just what a
> "jiffie" is. Currently it is, by default (or historically) the "basic tick"
> of the system clock. On top of this a lot of interpolation code has been
> "grafted" to allow the system to resolve time to finer levels, i.e. to the
> nanosecond. But none of this interpolation code actually changes the tick,
> i.e. the interrupt still happens at the same periodic rate.
>
> As the "basic tick", it is used to do a lot of accounting and scheduling
> house keeping AND as a driver of the system timers.
>
> So, by this definition, it REQUIRES a system interrupt.
>
> I have built a "tick less" system and have evidence from that that such
> systems are over load prone. The faster the context switch rate, the more
> accounting needs to be done. On the otherhand, the ticked system has flat
> accounting overhead WRT load.
>
> Regardless of what definitions we settle on, the system needs an interrupt
> source to drive the system timers, and, as I indicate above, the accounting
> and scheduling stuff. It is a MUST that these interrupts occure at the
> required times or the system timers will be off. This is why we have a
> jiffies value that is "rather odd" in the x86 today.
>
> George
>
>
You need that hardware interrupt for more than time-keeping.
Without a hardware-interrupt, to force a new time-slice,
for(;;)
;
... would allow a user to grab the CPU forever ...
So, getting rid of the hardware interrupt can't be done.
Also, much effort has gone into obtaining high resolution
timing without any high resolution hardware to back it
up. This means that user's can get numbers like 987,654
microseconds and the last 654 are as valuable as teats
on a bull. With a HZ timer tick, you get 1/HZ resolution
pure and simple. The rest of the "interpolation" is just
guess-work which leads to lots of problems, especially
when one attempts to read a spinning down-count value
from a hardware device accessed off some ports!
If the ix86 CMOS timer was used you could get better
accuracy than present, but accuracy is something one
can accommodate with automatic adjustment of time,
tracable to some appropriate standard.
The top-level schedule-code could contain some flag that
says; "are we in a power-down mode". If so, it could
execute minimal in-cache code, i.e. :
for(;;)
{
hlt(); // Sleep until next tick
if(mode != power_down)
schedule();
}
The timer-tick ISR or any other ISR wakes us up from halt.
This keeps the system sleeping, not wasting power grabbing
code/data from RAM and grunching some numbers that are
not going to be used.
>>
>>
>>> Thoughts?
>>
>>
>> Yes, for long periods of idle, I'd like to see the periodic clock tick
>> disabled entirely. Clock ticks causes the hardware to exit power-saving
>> idle states.
>>
>> The current design with HZ=1000 gives us 1ms = 1000usec between clock
>> ticks. But some platforms take nearly that long just to enter/exit low
>> power states; which means that on Linux the hardware pays a long idle
>> state exit latency (performance hit) but gets little or no power savings
>> from the time it resides in that idle state.
>>
>> thanks,
>> -Len
>>
>>
>
> --
> George Anzinger [email protected]
> High-res-timers: http://sourceforge.net/projects/high-res-timers/
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 GrumpyMips).
98.36% of all statistics are fiction.
On Wed, 2004-10-20 at 07:51, George Anzinger wrote:
> john stultz wrote:
> > I've begun to agree with you about this issue. It seems that until we
> > can catch every use of jiffies for time, doing one by one is going to
> > cause consistency problems. So I'd support the full backout of the
> > do_posix_clock_monotonic_gettime changes to the proc interface.
> >
> > George, would you protest this?
>
> It seems to me that if we do that we will stop making any changes at all. I.e.
> we will not see the rest of the "jiffies for time" code, as it will not "hurt"
> any more.
Sorry, not sure I followed that. Could you explain further?
> Also, the orgional change was made for a reason...
Right, but I thought it was you who made the original change, and I
don't recall you answering what that reason was? I wouldn't want the
code ripped out if it was fixing an actual problem, so that's why I'm
asking.
At the moment, I'd like the idea I think Tim is suggesting. Where we fix
time so we have a stable base, then we decouple xtime and jiffies from
the timer interrupt and instead emulate them from the time code.
So rather then every tick incrementing jiffies, instead jiffies is set
equal to (monotonic_clock()*HZ)/NSEC_PER_SEC.
Thoughts?
-john
On Tue, 2004-10-19 at 20:05, Tim Schmielau wrote:
> On Tue, 19 Oct 2004, john stultz wrote:
>
> > As for the timeofday overhaul, I've had zero time to work on it
> > recently. I hate that I dropped code and then went missing for weeks.
> > I'll have to see if I can get a few cycles at home to sync up my current
> > tree and send it out.
>
> I still haven't looked at your code and it's discussion. From what I
> remember, I liked your proposal very much. It's surely where we want to
> end up someday. But from the above mail it strikes me that we just don't
> have enough manpower to get there all at once, so we should have a plan
> for the time code to gradually evolve into what we finally want. I think
> we could do it in the following steps:
>
> 1. Sync up jiffies with the monotonic clock...
>
> 2. Decouple jiffies from the actual interrupt counter...
>
> 3. Increase HZ all the way up to 1e9....
> Thoughts?
They all sound good. I like the notion of basing jiffies off of system
time, rather then interrupt counts. However, I'm a little cautious of
changing the meaning of jiffies too drastically.
Right now jiffies has two core meanings:
1. Count of the number of timer ticks that have passed.
2. Accurate system uptime, measured in units of 1/HZ
(Let me know if I forgot any others)
The problem being, neither of those meaning are 100% true.
#1 isn't true because when we loose timer ticks, we try to compensate
for them (i386 specifically). But at the same time #2 isn't true because
the timer interrupts don't necessarily run at exactly HZ (again, i386
specifically).
Basically due to our hardware constraints, we need to break one of these
two assumptions. The problem is which do we choose?
Do we base jiffies off of monotonic_clock(), guaranteeing #2 and
possibly breaking anyone who is assuming #1? Or do we change all users
of jiffies for time to use monotonic_clock, guaranteeing #1, which will
require quite a bit of work.
And which choice makes it harder for folks to create tickless systems?
Its a tough call.
On top of that, we still have the issue that the current interpolation
used in the time of day subsystem is broken (in my opinion), and we need
to fix that before we can have a reliable monotonic_clock.
The joke being of course that I'll need to set my /etc/ntp/ntp.drift
file to 500 to find the time to work on any of this. And really, anyone
who really found that funny needs to go home.
thanks
-john
On Wed, 2004-10-20 at 13:09, Lee Revell wrote:
> On Wed, 2004-10-20 at 03:47, Len Brown wrote:
> > The current design with HZ=1000 gives us 1ms = 1000usec between
> > clock ticks. But some platforms take nearly that long just
> > to enter/exit low power states; which means that on Linux
> > the hardware pays a long idle state exit latency
> > (performance hit) but gets little or no power savings
> > from the time it resides in that idle state.
>
> My testing shows that the timer interrupt runs for about 21 usec.
> That's 2.1% of its time just running the timer ISR! No wonder this
> causes PM issues, 2.1% cpu load is not exactly an idle machine. This
> is a 600Mhz C3, so on a slower embedded system this might be 5%.
>
> So, any solution that would allow high res timers with Hz = 100 would
> be welcome.
5% residency in the clock tick handler is likely more of a problem when
we're _not_ idle -- a 5% performance hit. When we're idle we've got
nothing better to do with the processor than run these instructions for
5% of the time and run no instructions 95% of the time -- so tick
handler residency isn't the problem in idle, tick frequency is the
problem.
When an interrupt occurrs, the hardware needs to ramp up its voltages,
resume its clocks and all the stuff it need to do to get out of the
power saving state to run the code that services the interrupt. This
"exit latency" can take a long time. On a volume Centrino system today
it is up to 185usec. On other hardware it is as high as 1000 usec.
Time spent in this exit latency is a double penalty -- we're not saving
power and we're delaying before the processor starts executing
instructions -- so we want to pay this price only when necessary.
-Len
john stultz wrote:
> On Wed, 2004-10-20 at 07:51, George Anzinger wrote:
>
>>john stultz wrote:
>>
>>>I've begun to agree with you about this issue. It seems that until we
>>>can catch every use of jiffies for time, doing one by one is going to
>>>cause consistency problems. So I'd support the full backout of the
>>>do_posix_clock_monotonic_gettime changes to the proc interface.
>>>
>>>George, would you protest this?
>>
>>It seems to me that if we do that we will stop making any changes at all. I.e.
>>we will not see the rest of the "jiffies for time" code, as it will not "hurt"
>>any more.
>
>
> Sorry, not sure I followed that. Could you explain further?
If we rip out the code folks will stop sending in bug reports on it. Simple as
that.
>
>
>>Also, the orgional change was made for a reason...
>
>
> Right, but I thought it was you who made the original change, and I
> don't recall you answering what that reason was? I wouldn't want the
> code ripped out if it was fixing an actual problem, so that's why I'm
> asking.
As I recall the problem was that uptime was not matching the elapsed wall clock.
This was because it was jiffies based and the 1/HZ assumption was made about
what a jiffie is. When jiffies became ~1/HZ instead of =1/HZ we started all the
"good times". And, this was done because 1/HZ could not be obtained with the
PIT interrupt source with enought accuracy to satisfy ntp code.
>
> At the moment, I'd like the idea I think Tim is suggesting. Where we fix
> time so we have a stable base, then we decouple xtime and jiffies from
> the timer interrupt and instead emulate them from the time code.
The can of worms here is decoupling jiffies from the timer interrupt. Jiffies
is (like it or not) the unit of measure used for timers and these _require_ and
interrupt AND it should be consistantly within a few 10s of usec of when the
jiffie changes.
>
> So rather then every tick incrementing jiffies, instead jiffies is set
> equal to (monotonic_clock()*HZ)/NSEC_PER_SEC.
As mention by me (a long time ago), this assumes you have a better source for
the clock than the interrupt. I would argue that on the x86 (which I admit is
really deficient) the best long term clock is, in fact, the PIT interrupt. The
_best_ clock on the x86, IMHO, is one that used the PIT interrupt as the gold
standard. Then one smooths this to eliminate interrupt latency issues and lost
ticks using the TSC. The pm_timer is as good as the PIT but suffers from
access time issues.
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
On Wed, 2004-10-20 at 16:52, George Anzinger wrote:
> john stultz wrote:
> > On Wed, 2004-10-20 at 07:51, George Anzinger wrote:
> >
> >>john stultz wrote:
> >>
> >>>I've begun to agree with you about this issue. It seems that until we
> >>>can catch every use of jiffies for time, doing one by one is going to
> >>>cause consistency problems. So I'd support the full backout of the
> >>>do_posix_clock_monotonic_gettime changes to the proc interface.
> >>>
> >>>George, would you protest this?
> >>
> >>It seems to me that if we do that we will stop making any changes at all. I.e.
> >>we will not see the rest of the "jiffies for time" code, as it will not "hurt"
> >>any more.
> >
> >
> > Sorry, not sure I followed that. Could you explain further?
>
> If we rip out the code folks will stop sending in bug reports on it. Simple as
> that.
So you feel that we're moving in the right direction, its just that its
going to take a few passes before everything smooths out? Thus its just
a continuation of the effort?
Tim? Is this OK with you, or you feel the immediate inconsistencies and
bug reports aren't worth the effort?
> > So rather then every tick incrementing jiffies, instead jiffies is set
> > equal to (monotonic_clock()*HZ)/NSEC_PER_SEC.
>
> As mention by me (a long time ago), this assumes you have a better source for
> the clock than the interrupt. I would argue that on the x86 (which I admit is
> really deficient) the best long term clock is, in fact, the PIT interrupt. The
> _best_ clock on the x86, IMHO, is one that used the PIT interrupt as the gold
> standard. Then one smooths this to eliminate interrupt latency issues and lost
> ticks using the TSC. The pm_timer is as good as the PIT but suffers from
> access time issues.
Well, assuming the PIT is programmed to a value it can actually run at
accurately, you might be right.
The only problem is I've started to arrive at the notion of
interpolation between multiple problematic timesources is just a rats
nest. When you can't trust timer interrupts to arrive and you can't
trust the TSC to run at the right frequency, there's no way to figure
out who's right. We already have the lost-tick compensation code, but
we still get time inconsistencies. Now maybe I'm just too dim witted to
make it work, but the more I look at it, the more corner cases appear
and the uglier the code gets.
I say pick a timesource you can trust on your machine and stick to it.
NTP is there to correct for drift, so just use it.
-john
john stultz wrote:
~
>
>>>So rather then every tick incrementing jiffies, instead jiffies is set
>>>equal to (monotonic_clock()*HZ)/NSEC_PER_SEC.
>>
>>As mention by me (a long time ago), this assumes you have a better source for
>>the clock than the interrupt. I would argue that on the x86 (which I admit is
>>really deficient) the best long term clock is, in fact, the PIT interrupt. The
>>_best_ clock on the x86, IMHO, is one that used the PIT interrupt as the gold
>>standard. Then one smooths this to eliminate interrupt latency issues and lost
>>ticks using the TSC. The pm_timer is as good as the PIT but suffers from
>>access time issues.
>
>
> Well, assuming the PIT is programmed to a value it can actually run at
> accurately, you might be right.
>
> The only problem is I've started to arrive at the notion of
> interpolation between multiple problematic timesources is just a rats
> nest. When you can't trust timer interrupts to arrive and you can't
> trust the TSC to run at the right frequency, there's no way to figure
> out who's right. We already have the lost-tick compensation code, but
> we still get time inconsistencies. Now maybe I'm just too dim witted to
> make it work, but the more I look at it, the more corner cases appear
> and the uglier the code gets.
>
> I say pick a timesource you can trust on your machine and stick to it.
> NTP is there to correct for drift, so just use it.
>
Lets try to remember that the x86 WRT time is a real pile of used hay. Even the
"fixes" the hardware folks are spinning out reflect a real lack of
understanding. A pm_timer that you can not trust is doubly bad, but then they
thought it was part of the powerdown code so... The new timer which we may see
on real machines some day, is still in I/O space (read REALLY SLOW TO ACCESS)
for starters.
Back in my days at HP we (HP) talked with intel and, to some extent, caused a
change in the IA64. That machine, and a lot of other platforms, have decent
time keeping hardware. All we have to do is wait for the x86 to die :).
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
> From: Richard B. Johnson
>
> You need that hardware interrupt for more than time-keeping.
> Without a hardware-interrupt, to force a new time-slice,
>
> for(;;)
> ;
>
> ... would allow a user to grab the CPU forever ...
But you can also schedule, before switching to the new task,
a local interrupt on the running processor to mark the end
of the timeslice. When you enter the scheduler, you just need
to remove that; devil is in the details, but it should be possible
to do in a way that doesn't take too much overhead.
I?aky P?rez-Gonz?lez -- Not speaking for Intel -- all opinions are my own (and my fault)
On Wed, 2004-10-20 at 03:47, Len Brown wrote:
> The current design with HZ=1000 gives us 1ms = 1000usec between clock
> ticks. But some platforms take nearly that long just to enter/exit low
> power states; which means that on Linux the hardware pays a long idle
> state exit latency (performance hit) but gets little or no power savings
> from the time it resides in that idle state.
My testing shows that the timer interrupt runs for about 21 usec.
That's 2.1% of its time just running the timer ISR! No wonder this
causes PM issues, 2.1% cpu load is not exactly an idle machine. This is
a 600Mhz C3, so on a slower embedded system this might be 5%.
So, any solution that would allow high res timers with Hz = 100 would be
welcome.
Lee
john stultz wrote:
> On Tue, 2004-10-19 at 17:42, Tim Schmielau wrote:
>
>>On Tue, 19 Oct 2004, john stultz wrote:
>>
>>
>>>On Tue, 2004-10-19 at 11:21, Jerome Borsboom wrote:
>>>
>>>>Starting with kernel 2.6.9 the process start time is set wrongly for
>>>>processes that get started early in the boot process. Below is a dump from
>>>>my 'ps' command. Note the start time for processes 1-12. After process 12
>>>>the start time is set right.
>>>
>>>How reproducible is this? Are the correct and incorrect time values
>>>always off by the same amount?
>>>
>>>Are you running NTP? I'm curious if you are changing your system time
>>>during boot.
>>
>>I'd bet that some process early in the boot adjusts your system time.
>
>
> He claims that's not the case (you weren't CC'ed on his reply, but its
> on lkml). He believes the time changes before NTP starts up. Might be
> something else, but I'm not sure.
>
>
>>Then this is expected behavior. This is why I would have preferred the
>>simple back-out patch for the boot times problem.
>>
>>I'm sorry I fell of the net for so long and didn't stand up for the
>>simpler change in this case. Oh well.
>>
>>I'll probably supply a back-out patch for -mm then, after wading through
>>my multi-megabyte email backlog (sorry John, still need to read your time
>>keeping proposal and all its discussion).
>
>
> I've begun to agree with you about this issue. It seems that until we
> can catch every use of jiffies for time, doing one by one is going to
> cause consistency problems. So I'd support the full backout of the
> do_posix_clock_monotonic_gettime changes to the proc interface.
>
> George, would you protest this?
It seems to me that if we do that we will stop making any changes at all. I.e.
we will not see the rest of the "jiffies for time" code, as it will not "hurt"
any more.
Also, the orgional change was made for a reason...
-g
>
> As for the timeofday overhaul, I've had zero time to work on it
> recently. I hate that I dropped code and then went missing for weeks.
> I'll have to see if I can get a few cycles at home to sync up my current
> tree and send it out.
>
> thanks
> -john
>
>
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Perez-Gonzalez, Inaky wrote:
>>From: Richard B. Johnson
>>
>>You need that hardware interrupt for more than time-keeping.
>>Without a hardware-interrupt, to force a new time-slice,
>>
>> for(;;)
>> ;
>>
>>... would allow a user to grab the CPU forever ...
>
>
> But you can also schedule, before switching to the new task,
> a local interrupt on the running processor to mark the end
> of the timeslice. When you enter the scheduler, you just need
> to remove that; devil is in the details, but it should be possible
> to do in a way that doesn't take too much overhead.
Well, that is part of the accounting overhead the increases with context switch
rate. You also need to include the time it takes to figure out which of the
time limits is closes (run time limit, profile time, slice time, etc). Then,
you also need to remove the timer when switching away. No, it is not a lot, but
it is way more than the nothing we do when we can turn it all over to the
periodic tick. The choice is load sensitive overhead vs flat overhead.
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
> From: George Anzinger [mailto:[email protected]]
>
> Perez-Gonzalez, Inaky wrote:
>
> > But you can also schedule, before switching to the new task,
> > a local interrupt on the running processor to mark the end
> > of the timeslice. When you enter the scheduler, you just need
> > to remove that; devil is in the details, but it should be possible
> > to do in a way that doesn't take too much overhead.
>
> Well, that is part of the accounting overhead the increases with context switch
> rate. You also need to include the time it takes to figure out which of the
> time limits is closes (run time limit, profile time, slice time, etc). Then,
I know these are specific examples, but:
- profile time is a periodic thingie, so if you have it, forget about
having a tickless system. Periodic interrupt for this guy, get it
out of the equation.
- slice time vs runtime limit. I don't remember what is the granularity of
the runtime limit, but it could be expressed in slice terms. If not,
we are talking (along with any other times) of min() operations, which
are just a few cycles each [granted, they add up].
> you also need to remove the timer when switching away. No, it is not a lot, but
> it is way more than the nothing we do when we can turn it all over to the
> periodic tick. The choice is load sensitive overhead vs flat overhead.
This is just talking out of my ass, but I guess that for each invocation
they will have more or less the same overhead in execution time, let's
say T. For the periodic tick, the total overhead (in a second) is T*HZ;
with tickless, it'd be T*number_of_context_switches_per_second, right?
Now, the ugly case would be if number_of_context_swiches_per_second > HZ.
In HZ = 100, this could be happening, but in HZ=1000, in a single CPU
...well, that would be TOO weird [of course, a real-time app with a
1ms period would do that, but it'd require at least an HZ of 10000 to
work more or less ok and we'd be below the watermark].
So in most cases, and given the assumptions, we'd end up winning,
beause number_of_context..., even if variable, is going to be bound
on the upper side by HZ.
Well, you know way more than I do about this, so here is the question:
what is the error in that line of reasoning?
I?aky P?rez-Gonz?lez -- Not speaking for Intel -- all opinions are my own (and my fault)
Perez-Gonzalez, Inaky wrote:
> Now, the ugly case would be if number_of_context_swiches_per_second > HZ.
> In HZ = 100, this could be happening, but in HZ=1000, in a single CPU
> ...well, that would be TOO weird [of course, a real-time app with a
> 1ms period would do that, but it'd require at least an HZ of 10000 to
> work more or less ok and we'd be below the watermark].
It's easy to have >>1000 context switches per second on a server. Consider a
web server that receives a network packet, issues a request to a database, hands
some work off to a thread so the main app doesn't block, then sends a response.
That could be a half dozen context switches per packet. If you have 20000
packets/sec coming in....
Chris
Perez-Gonzalez, Inaky wrote:
>>From: George Anzinger [mailto:[email protected]]
>>
>>Perez-Gonzalez, Inaky wrote:
>>
>>
>>>But you can also schedule, before switching to the new task,
>>>a local interrupt on the running processor to mark the end
>>>of the timeslice. When you enter the scheduler, you just need
>>>to remove that; devil is in the details, but it should be possible
>>>to do in a way that doesn't take too much overhead.
>>
>>Well, that is part of the accounting overhead the increases with context switch
>>rate. You also need to include the time it takes to figure out which of the
>>time limits is closes (run time limit, profile time, slice time, etc). Then,
>
>
> I know these are specific examples, but:
>
> - profile time is a periodic thingie, so if you have it, forget about
> having a tickless system. Periodic interrupt for this guy, get it
> out of the equation.
Not really. It is only active if the task is running. At the very least the
scheduler needs to check to see if it is on and, if so, set up a timer for it.
>
> - slice time vs runtime limit. I don't remember what is the granularity of
> the runtime limit, but it could be expressed in slice terms. If not,
> we are talking (along with any other times) of min() operations, which
> are just a few cycles each [granted, they add up].
The main issue here is accumulating the run time which is accounting work that
needs to happen on context switch (out in this case).
>
>
>>you also need to remove the timer when switching away. No, it is not a lot, but
>>it is way more than the nothing we do when we can turn it all over to the
>>periodic tick. The choice is load sensitive overhead vs flat overhead.
>
>
> This is just talking out of my ass, but I guess that for each invocation
> they will have more or less the same overhead in execution time, let's
> say T. For the periodic tick, the total overhead (in a second) is T*HZ;
> with tickless, it'd be T*number_of_context_switches_per_second, right?
>
> Now, the ugly case would be if number_of_context_swiches_per_second > HZ.
> In HZ = 100, this could be happening, but in HZ=1000, in a single CPU
> ...well, that would be TOO weird [of course, a real-time app with a
> 1ms period would do that, but it'd require at least an HZ of 10000 to
> work more or less ok and we'd be below the watermark].
??? Better look again. Context switches can and do happen as often as 10 or so
micro seconds (depends a lot on the cpu speed). I admit this is with code that
is just trying to measure the context switch time, but, often the system will
change it mind just that fast.
>
> So in most cases, and given the assumptions, we'd end up winning,
> beause number_of_context..., even if variable, is going to be bound
> on the upper side by HZ.
>
> Well, you know way more than I do about this, so here is the question:
> what is the error in that line of reasoning?
The expected number of context switches. In some real world apps it gets rather
high. The cross over of your two curves _might_ be of interest to some (it is
rather low by my measurements, done with the tickless patch that is still on
sourceforge). On the other hand, where I come from, a system which has
increasing overhead with load is one that is going to overload. We are always
better off if we can figure a way to have fixed overhead.
As for the idle system ticks, I think the VST stuff we are working on is the
right answer.
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
> From: George Anzinger [mailto:[email protected]]
>
> > This is just talking out of my ass, but I guess that for each invocation
> > they will have more or less the same overhead in execution time, let's
> > say T. For the periodic tick, the total overhead (in a second) is T*HZ;
> > with tickless, it'd be T*number_of_context_switches_per_second, right?
>
> ??? Better look again. Context switches can and do happen as often as 10 or so
> micro seconds (depends a lot on the cpu speed). I admit this is with code that
> is just trying to measure the context switch time, but, often the system will
> change it mind just that fast.
As I said, I was talking out of my ass [aka, I didn't know and was just
guesstimating for the heck of it], so I am happily proven wrong--thanks to
Chris and you--I guess I didn't take into account voluntary yielding of
the CPU by a task; I was more guiding myself for kicked out by a timer
making a task runnable, or a timeslice expiring, etc...which now are
more or less guided by the tick [and then of course, we have IRQs,
but that's another matter]
> ...
> sourceforge). On the other hand, where I come from, a system which has
> increasing overhead with load is one that is going to overload. We are always
> better off if we can figure a way to have fixed overhead.
>
> As for the idle system ticks, I think the VST stuff we are working on is the
> right answer.
Once my logic is proven wrong, then it makes full sense :]
Thanks for the heads up.
I?aky P?rez-Gonz?lez -- Not speaking for Intel -- all opinions are my own (and my fault)
George Anzinger wrote:
> Well, that is part of the accounting overhead the increases with context
> switch rate. You also need to include the time it takes to figure out
> which of the time limits is closes (run time limit, profile time, slice
> time, etc). Then, you also need to remove the timer when switching
> away. No, it is not a lot, but it is way more than the nothing we do
> when we can turn it all over to the periodic tick. The choice is load
> sensitive overhead vs flat overhead.
It should be possible to be clever about this. Most processes don't use their
timeslice, so if we have a previous timer running, just keep track of how much
beyond that timer our timeslice will be. If we context switch before the timer
expiry, well and good. If the timer expires, set it for what's left of our
timeslice.
Chris
>How reproducible is this? Are the correct and incorrect time values
>always off by the same amount?
>
>Are you running NTP? I'm curious if you are changing your system time
>during boot.
>
>thanks
>-john
At each boot, the time of the first processes seems to be off 1 hour and
11 minutes. Another system shows the same symptoms but with different
values.
I am setting the time during boot with ntp, but the start time seems to
change from incorrect to correct before I even run ntp.
Jerome
Chris Friesen wrote:
> George Anzinger wrote:
>
>> Well, that is part of the accounting overhead the increases with
>> context switch rate. You also need to include the time it takes to
>> figure out which of the time limits is closes (run time limit, profile
>> time, slice time, etc). Then, you also need to remove the timer when
>> switching away. No, it is not a lot, but it is way more than the
>> nothing we do when we can turn it all over to the periodic tick. The
>> choice is load sensitive overhead vs flat overhead.
>
>
> It should be possible to be clever about this. Most processes don't use
> their timeslice, so if we have a previous timer running, just keep track
> of how much beyond that timer our timeslice will be. If we context
> switch before the timer expiry, well and good. If the timer expires,
> set it for what's left of our timeslice.
Me thinks that rather quickly devolves to a periodic tick.
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
George Anzinger wrote:
> Chris Friesen wrote:
>> It should be possible to be clever about this. Most processes don't
>> use their timeslice, so if we have a previous timer running, just keep
>> track of how much beyond that timer our timeslice will be. If we
>> context switch before the timer expiry, well and good. If the timer
>> expires, set it for what's left of our timeslice.
>
>
> Me thinks that rather quickly devolves to a periodic tick.
In the busy case, yes. But on an idle system you get tickless behaviour.
It's still going to be load-sensitive, since you are doing additional work to
keep track of the timer/timeout values. But it saves work if reprogramming the
timer is time-consuming compared to simply reading it. On something like the
ppc, it probably doesn't buy you much since the decrementer is cheap to program.
Chris
On Tue, 19 Oct 2004, john stultz wrote:
> On Tue, 2004-10-19 at 11:21, Jerome Borsboom wrote:
> > Starting with kernel 2.6.9 the process start time is set wrongly for
> > processes that get started early in the boot process. Below is a dump from
> > my 'ps' command. Note the start time for processes 1-12. After process 12
> > the start time is set right.
>
> How reproducible is this? Are the correct and incorrect time values
> always off by the same amount?
If the problem is reproducible, does it go away with the following patch
against 2.6.9?
An untested patch against 2.6.9-mm1 is at
http://www.physik3.uni-rostock.de/tim/kernel/2.6/uptime-fix-09.patch
Tim
--- linux-2.6.9/fs/proc/array.c 2004-10-27 00:04:58.000000000 +0200
+++ linux-2.6.9-uf/fs/proc/array.c 2004-10-27 01:44:13.000000000 +0200
@@ -360,11 +360,7 @@ int proc_pid_stat(struct task_struct *ta
read_unlock(&tasklist_lock);
/* Temporary variable needed for gcc-2.96 */
- /* convert timespec -> nsec*/
- start_time = (unsigned long long)task->start_time.tv_sec * NSEC_PER_SEC
- + task->start_time.tv_nsec;
- /* convert nsec -> ticks */
- start_time = nsec_to_clock_t(start_time);
+ start_time = jiffies_64_to_clock_t(task->start_time - INITIAL_JIFFIES);
res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \
--- linux-2.6.9/fs/proc/proc_misc.c 2004-10-27 00:04:58.000000000 +0200
+++ linux-2.6.9-uf/fs/proc/proc_misc.c 2004-10-27 01:44:23.000000000 +0200
@@ -133,19 +133,36 @@ static struct vmalloc_info get_vmalloc_i
static int uptime_read_proc(char *page, char **start, off_t off,
int count, int *eof, void *data)
{
- struct timespec uptime;
- struct timespec idle;
+ u64 uptime;
+ unsigned long uptime_remainder;
int len;
- u64 idle_jiffies = init_task.utime + init_task.stime;
- do_posix_clock_monotonic_gettime(&uptime);
- jiffies_to_timespec(idle_jiffies, &idle);
- len = sprintf(page,"%lu.%02lu %lu.%02lu\n",
- (unsigned long) uptime.tv_sec,
- (uptime.tv_nsec / (NSEC_PER_SEC / 100)),
- (unsigned long) idle.tv_sec,
- (idle.tv_nsec / (NSEC_PER_SEC / 100)));
+ uptime = get_jiffies_64() - INITIAL_JIFFIES;
+ uptime_remainder = (unsigned long) do_div(uptime, HZ);
+#if HZ!=100
+ {
+ u64 idle = init_task.utime + init_task.stime;
+ unsigned long idle_remainder;
+
+ idle_remainder = (unsigned long) do_div(idle, HZ);
+ len = sprintf(page,"%lu.%02lu %lu.%02lu\n",
+ (unsigned long) uptime,
+ (uptime_remainder * 100) / HZ,
+ (unsigned long) idle,
+ (idle_remainder * 100) / HZ);
+ }
+#else
+ {
+ unsigned long idle = init_task.utime + init_task.stime;
+
+ len = sprintf(page,"%lu.%02lu %lu.%02lu\n",
+ (unsigned long) uptime,
+ uptime_remainder,
+ idle / HZ,
+ idle % HZ);
+ }
+#endif
return proc_calc_metrics(page, start, off, count, eof, len);
}
--- linux-2.6.9/include/linux/acct.h 2004-10-27 00:04:58.000000000 +0200
+++ linux-2.6.9-uf/include/linux/acct.h 2004-10-27 01:44:13.000000000 +0200
@@ -172,22 +172,15 @@ static inline u32 jiffies_to_AHZ(unsigne
#endif
}
-static inline u64 nsec_to_AHZ(u64 x)
+static inline u64 jiffies_64_to_AHZ(u64 x)
{
-#if (NSEC_PER_SEC % AHZ) == 0
- do_div(x, (NSEC_PER_SEC / AHZ));
-#elif (AHZ % 512) == 0
- x *= AHZ/512;
- do_div(x, (NSEC_PER_SEC / 512));
+#if (TICK_NSEC % (NSEC_PER_SEC / AHZ)) == 0
+#if HZ != AHZ
+ do_div(x, HZ / AHZ);
+#endif
#else
- /*
- * max relative error 5.7e-8 (1.8s per year) for AHZ <= 1024,
- * overflow after 64.99 years.
- * exact for AHZ=60, 72, 90, 120, 144, 180, 300, 600, 900, ...
- */
- x *= 9;
- do_div(x, (unsigned long)((9ull * NSEC_PER_SEC + (AHZ/2))
- / AHZ));
+ x *= TICK_NSEC;
+ do_div(x, (NSEC_PER_SEC / AHZ));
#endif
return x;
}
--- linux-2.6.9/include/linux/sched.h 2004-10-27 00:04:58.000000000 +0200
+++ linux-2.6.9-uf/include/linux/sched.h 2004-10-27 01:44:13.000000000 +0200
@@ -508,7 +508,7 @@ struct task_struct {
struct timer_list real_timer;
unsigned long utime, stime;
unsigned long nvcsw, nivcsw; /* context switch counts */
- struct timespec start_time;
+ u64 start_time;
/* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
unsigned long min_flt, maj_flt;
/* process credentials */
--- linux-2.6.9/include/linux/times.h 2004-10-27 00:04:58.000000000 +0200
+++ linux-2.6.9-uf/include/linux/times.h 2004-10-27 01:44:23.000000000 +0200
@@ -7,16 +7,11 @@
#include <asm/types.h>
#include <asm/param.h>
-static inline clock_t jiffies_to_clock_t(long x)
-{
-#if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
- return x / (HZ / USER_HZ);
+#if (HZ % USER_HZ)==0
+# define jiffies_to_clock_t(x) ((x) / (HZ / USER_HZ))
#else
- u64 tmp = (u64)x * TICK_NSEC;
- do_div(tmp, (NSEC_PER_SEC / USER_HZ));
- return (long)tmp;
+# define jiffies_to_clock_t(x) ((clock_t) jiffies_64_to_clock_t((u64) x))
#endif
-}
static inline unsigned long clock_t_to_jiffies(unsigned long x)
{
@@ -40,7 +35,7 @@ static inline unsigned long clock_t_to_j
static inline u64 jiffies_64_to_clock_t(u64 x)
{
-#if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
+#if (HZ % USER_HZ)==0
do_div(x, HZ / USER_HZ);
#else
/*
@@ -48,33 +43,13 @@ static inline u64 jiffies_64_to_clock_t(
* but even this doesn't overflow in hundreds of years
* in 64 bits, so..
*/
- x *= TICK_NSEC;
- do_div(x, (NSEC_PER_SEC / USER_HZ));
+ x *= USER_HZ;
+ do_div(x, HZ);
#endif
return x;
}
#endif
-static inline u64 nsec_to_clock_t(u64 x)
-{
-#if (NSEC_PER_SEC % USER_HZ) == 0
- do_div(x, (NSEC_PER_SEC / USER_HZ));
-#elif (USER_HZ % 512) == 0
- x *= USER_HZ/512;
- do_div(x, (NSEC_PER_SEC / 512));
-#else
- /*
- * max relative error 5.7e-8 (1.8s per year) for USER_HZ <= 1024,
- * overflow after 64.99 years.
- * exact for HZ=60, 72, 90, 120, 144, 180, 300, 600, 900, ...
- */
- x *= 9;
- do_div(x, (unsigned long)((9ull * NSEC_PER_SEC + (USER_HZ/2))
- / USER_HZ));
-#endif
- return x;
-}
-
struct tms {
clock_t tms_utime;
clock_t tms_stime;
--- linux-2.6.9/kernel/acct.c 2004-10-27 00:04:58.000000000 +0200
+++ linux-2.6.9-uf/kernel/acct.c 2004-10-27 01:44:13.000000000 +0200
@@ -384,8 +384,6 @@ static void do_acct_process(long exitcod
unsigned long vsize;
unsigned long flim;
u64 elapsed;
- u64 run_time;
- struct timespec uptime;
/*
* First check to see if there is enough free_space to continue
@@ -403,13 +401,7 @@ static void do_acct_process(long exitcod
ac.ac_version = ACCT_VERSION | ACCT_BYTEORDER;
strlcpy(ac.ac_comm, current->comm, sizeof(ac.ac_comm));
- /* calculate run_time in nsec*/
- do_posix_clock_monotonic_gettime(&uptime);
- run_time = (u64)uptime.tv_sec*NSEC_PER_SEC + uptime.tv_nsec;
- run_time -= (u64)current->start_time.tv_sec*NSEC_PER_SEC
- + current->start_time.tv_nsec;
- /* convert nsec -> AHZ */
- elapsed = nsec_to_AHZ(run_time);
+ elapsed = jiffies_64_to_AHZ(get_jiffies_64() - current->start_time);
#if ACCT_VERSION==3
ac.ac_etime = encode_float(elapsed);
#else
--- linux-2.6.9/kernel/fork.c 2004-10-27 00:04:58.000000000 +0200
+++ linux-2.6.9-uf/kernel/fork.c 2004-10-27 01:44:13.000000000 +0200
@@ -992,7 +992,7 @@ static task_t *copy_process(unsigned lon
p->utime = p->stime = 0;
p->lock_depth = -1; /* -1 = no lock */
- do_posix_clock_monotonic_gettime(&p->start_time);
+ p->start_time = get_jiffies_64();
p->security = NULL;
p->io_context = NULL;
p->io_wait = NULL;
--- linux-2.6.9/mm/oom_kill.c 2004-10-27 00:04:59.000000000 +0200
+++ linux-2.6.9-uf/mm/oom_kill.c 2004-10-27 01:44:13.000000000 +0200
@@ -26,7 +26,6 @@
/**
* oom_badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
- * @p: current uptime in seconds
*
* The formula used is relatively simple and documented inline in the
* function. The main rationale is that we want to select a good task
@@ -42,7 +41,7 @@
* of least surprise ... (be careful when you change it)
*/
-static unsigned long badness(struct task_struct *p, unsigned long uptime)
+static unsigned long badness(struct task_struct *p)
{
unsigned long points, cpu_time, run_time, s;
@@ -57,16 +56,12 @@ static unsigned long badness(struct task
points = p->mm->total_vm;
/*
- * CPU time is in tens of seconds and run time is in thousands
- * of seconds. There is no particular reason for this other than
- * that it turned out to work very well in practice.
+ * CPU time is in seconds and run time is in minutes. There is no
+ * particular reason for this other than that it turned out to work
+ * very well in practice.
*/
cpu_time = (p->utime + p->stime) >> (SHIFT_HZ + 3);
-
- if (uptime >= p->start_time.tv_sec)
- run_time = (uptime - p->start_time.tv_sec) >> 10;
- else
- run_time = 0;
+ run_time = (get_jiffies_64() - p->start_time) >> (SHIFT_HZ + 10);
s = int_sqrt(cpu_time);
if (s)
@@ -116,12 +111,10 @@ static struct task_struct * select_bad_p
unsigned long maxpoints = 0;
struct task_struct *g, *p;
struct task_struct *chosen = NULL;
- struct timespec uptime;
- do_posix_clock_monotonic_gettime(&uptime);
do_each_thread(g, p)
if (p->pid) {
- unsigned long points = badness(p, uptime.tv_sec);
+ unsigned long points = badness(p);
if (points > maxpoints) {
chosen = p;
maxpoints = points;