2005-05-04 17:24:50

by George Anzinger

[permalink] [raw]
Subject: Re: Help with the high res timers

|First I want to express my joy that we are discussing these issues at some depth.

I also think we need a wider audience and so have added the kernel list to the
cc. For that reason, I will not trim this message, thus allowing them to "catch
up".

Nishanth Aravamudan wrote:
> On 03.05.2005 [16:15:04 -0700], George Anzinger wrote:
>
>>john stultz wrote:
>>
>>>On Tue, 2005-05-03 at 14:36 -0700, George Anzinger wrote:
>>>
>>>
>>>>At the moment I am trying to understand and recast the work John Stultz
>>>>is doing. I have some very real concerns here and there is a patch by
>>>>[email protected]<Nishanth Aravamudan> that, to me, looks like a royal
>>>>pain. I think we need to understand it better and voice our concerns in
>>>>a way that is constructive. Nish's patch is here:
>>>>http://www.ussg.iu.edu/hypermail/linux/kernel/0504.3/1635.html
>
>
> I'm not on the HRT list, so I didn't get the original e-mail...keep me
> on the CC, please.
>
>
>>>Hey George,
>>> Hopefully our work isn't causing you too much trouble. I do
>>> understand
>>>the pain of having large conflicting works that try to achieve similar
>>>goals.
>>>
>>>
>>>
>>>>To summarize my concerns, the HRT code depends on correct and exact
>>>>registration of the timer interrupt with time of day. (We don't worry
>>>>about latency here.) In the latest x86 HRT patch we fine tune the system
>>>>clock to insure this alignment. This allows us to set up two stage
>>>>timers that use the run_timers code for the coarse time (to the time base
>>>>resolution) followed by a short high res hardware timer to get to the
>>>>higher resolution we want. If the coarse timer (i.e. add_timer,
>>>>run_timer and friends) do not register correctly, we will have the case
>>>>of a coarse timer expiring after the needed and requested time. Now,
>>>
>>>>from a coarse timer point of view, that may be fine, but it means we can
>>>
>>>>not depend on it for the high res first stage.
>>>
>>>
>>>
>>>The conceptual idea Nish is going after, is once we have a solid method
>>>to get a high-res timeofday (via the new monotonic_clock() function), we
>>>can use that instead of jiffies to expire soft-timers.
>>>
>>>This requires re-defining each timer-bucket entry length from a jiffy
>>>instead to a unit of time (Nish calls this a timerinterval unit). This
>>>timer-interval unit's length is flexible can be defined at compile time.
>>>Currently Nish is using ~1ms (1<<20 nanoseconds) as the interval length,
>>>but it could be redefined to be as desired to give higher-res soft-
>>>timers.
>>
>>The issue is not its length, but can we interrupt at its end. The key to
>>high res is being on time. If we want a 100usec resolution timer, we need
>>to have the low res timer hit its mark with something real close to this.
>
>
> Are you asking: Can we interrupt at the end of the timerinterval? That
> doesn't matter to me, so much, since I made the timerinerval length
> independent of the interrupt source. If you wanted to make sure they
> corresponded completely, you simply would need to change the
> nsecs_to_timerintervals() function appropriately. I was trying to stay
> somewhat efficient, especially since we're dealing with a 64-bit
> quantity, and thus used the shifting John mentioned above.

I think we are at too high a level to go into this in detail here, but part of
the HRT patch is a header called sc_math.h (Scaled math). The notion that you
need to choose and hard code the multiplier is addressed there. The only real
choice you need to make is the number of bits you want to shift. The more you
shift the more exact the answer is. For an example of choosing the shift bits
you may want to look at jiffies.h in the current tree.
>
> So, currently, my code does not provide for high-res timers,
> admittedly. It does enable some of the infrastructure that should make
> implementing high-res timing in mainline (turned on by tuning
> TIMERINTERVAL_BITS or setting a .config option, perhaps) more feasible,
> in my eyes. I think we have the same goal in mind -- make timing
> flexible and useful.
>
> To get 100 usec resolution with my system, you would need to make
> TIMERINTERVAL_BITS=16. Basically, this makes a timerinterval = 1
> nanosecond << 16, or 131.07 (131) microseconds. (Of course, you could
> not use shifting and just divide by 10. Might be slower, but is
> certainly feasible.) Admittedly, this is not perfect, but it's close.
> Now, this has nothing to do with the hardware. But, it defines the
> lowest interval, *regardless* of hardware, which the system will
> support. If you request a 100 microsecond delay, we will make it 131
> microseconds.

Leaving aside the precision issue (addressed by scaled math above), I don't
think this is the way to address high resolution. For the most part most user
and kernel applications are quite happy with 10 ms resolution and, I would
argue, we should go back to the 100 HZ value. IMNSHO high res should be asked
for on a request at a time basis. The POSIX standard provided a rather nice way
to do this in the CLOCKS & TIMERS section of the standard. In the kernel the
low res part of this is in kernel/posix-timers.c. To extend to high res all one
does is to define new clocks. In HRT we define two high res clocks
(CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR). From the user point of view, to get
high res, he has to used one of these clocks.

The trick is to muster the kernel resources to make the *_HR clock a reality.
In my HRT patches I have done this with a two phase approach. I simply add a
second, HR member to the timer structure. This allows the timer to carry the
extra bits needed to express the added resolution and also, if the bits are
zero, to indicate that it is just a low res timer. Over the life time of the
HRT patch I have worked with two different ways of handling the HR bits in the
timer list. In all cases, we stay with a periodic time tick and only incur
extra overhead when the timer is about to expire. We ask the "arch" code to
provide an interrupt source with the needed short term resolution. Since we
only use this timer at the end of the timer period it needs cover only a short
time (0 to 1 jiffie), but should have good resolution over this period. Also,
since it is a rather short time, we don't require it to have long term stability.

The point is, the standard, low res, timer code needs to be only minimally aware
of the HR bits. On the other hand, since the HR timer will be started on
expiration of the low res timer, we need that expiration to be aligned with the
clock in a stable, predictable long term way. (For what its worth, we have yet
to address NTP clock drifting in the HRT patch, so that is still a fly in this
web of time.)
>
> Now, the trick is then, to get the hardware to tick every 100
> microseconds. You have a lot more expertise in this area, and I would be
> interested to see what the actual options are. Presuming we are able to
> get a periodic interrupt at this rate, we could then do what I think HRT
> pretty much does, maintain the normal buckets for normal timers and then
> have a special HRT bucket. We could then be aware of when, in
> nanoseconds, the next HRT interrupt should be, set the hardware
> appropriately and be ok. If there are no HRT timers, then we'll never
> check that bucket, really (we could also just not configure in that
> option). I think it's feasible, regardless. There may be performance
> penalties, I agree. I'm working on figuring out how bad things get if we
> set TIMERINTERVAL_BITS=10, for instance (about every microsecond). If
> the system stays up, I'll be happy :)

I don't think we have hardware yet that can address 100usec ticks, nor, as I
argue above, do we need it. In point of fact, we found it necessary to address
the user who, using the POSIX standard timer interface asked for a repeating
timer of, say 1 usec (yes we "say" we support 1 usec resolution). Such a
request could easily consume the machine (I call these timer storms), causing it
never to be hear from again. The, I think, elegant solution to the timer storm
problem is to not restart the timer until the user picks up the prior
expiration. This dynamically adjusts the timer response to the amount of
machine available at the time.
>
>
>>>Now when the new code expires timers it does so against the timeofday
>>>subsystem's notion of time instead of jiffies. It simply goes through
>>>all the timer bucket entries between now and the last time we expired
>>>timers.
>>
>>This is not unlike what happens now. I would hope that the number of
>>buckets visited averages to to something real close to 1 per run_timers
>>entry.
>
>
> I agree, this is what we might be able to achieve by keeping high-res
> timers in their own bucket. That way, we could conceivably know when the
> next periodic interrupt is and when the next HRT interrupt is in
> absolute nanoseconds (based on do_monotonic_clock()).

For what its worth, we have such a clock in HRT. It is expressed in jiffies and
arch_cycles, but jiffies is what is used today as the "standard" for internal
system time. This gets into the area of just how to express time in areas where
you need to quickly get and compare the time. After a lot of though, I decided
to use arch_cycles rather than nanoseconds because arch_cycles are directly
readable from the hardware. Any other unit of time requires a conversion, and,
as you know, usually requires 64-bits as well. By staying with arch_cycles we
avoid "most" of the conversion overhead.
>
>
>>>Now an interesting point you bring up above is when to schedule timer
>>>interrupts. One could just have a fine-grained timer-interval unit and
>>>crank up HZ, but clearly we don't want to schedule interrupts to go off
>>>too frequently or the overhead won't be worth it. Instead to get high-
>>>res timers, the idea is to look at the timer list and schedule
>>>interrupts appropriately. Then we only take an interrupt when there is
>>>work to do, rather then at regular periodic intervals when there may or
>>>may not be anything to expire.
>>
>>This would suggest that all timers are high res. I don't think this makes
>>sense:
>>1) because most users just don't care that much about resolution and
>>high-res carries some overhead.
>>2) Not all platform hardware is able to handle high res.
>
>
> I agree, this is an issue which my TIMERINTERVAL_BITS=10 test should
> also reveal. But under the normal case where TIMERINTERVAL_BITS are 20,
> then there will never be an interrupt scheduled to go off quicker than
> 2^20 nanoseconds in the future (about a millisecond). Everything hinges
> around these bits :) (mostly because they define how we convert
> nanoseconds to timerintervals).
>
>
>>I think high res timers should only be used when the user asks for them.
>>This keeps the overhead under control.
>
>
> I agree -- the user asking for the, currently, would mean they set
> TIMERINTERVAL_BITS lower than normal and also set up the hardware to
> interrupt appropriately.

No, no, they ask by using a *_HR clock.
>
>
>>>Now from my understanding of your framework, this sounds similar to how
>>>your high-res hardware timer interrupt is used. The idea being that we
>>>use that for all soft-timers, instead of having two levels of high-res
>>>and coarse soft-timers.
>>>
>>>The interesting problem that Darren will be looking at is how to
>>>schedule interrupts accurately. This appears to be something you've
>>>already attacked, so your experience would be helpful. My idea being
>>>that the timer_interrupt code checks when it intended to fire and when
>>>it actually did fire, and that information could then be used to tune
>>>the interrupt scheduling code.
>>
>>A small, measurable latency is ok and need not be backed out by the
>>software. If you go this route you risk expiring a timer early and the
>>standard says bad things about this.
>
>
> I'm not sure what you mean about timers going off early. We round up on
> addition and round down on expiration to make sure that we don't get
> timers early. As long as do_monotonic_clock() is accurate, we should be
> ok. (Darren's part uses the timerintervals value to figure out when the
> next interrupt should be, similar to the existing
> next_timer_interrupt())

Your rounding up or down is not really different than what is done with jiffies
today...
>
>
>>What is missing in this is that the flag ship arch (x86) has piss poor
>>capability to schedule timer interrupts. I.e. there really is no commonly
>>available timer to generate interrupts at some arbitrary time. There is,
>>however, the PIT, which, if you don't mess with it, is very low overhead
>>but periodic.
>
>
> And generally, we are going to stay periodic, for now. Darren's patch is
> a further extension on mine, which requires a lot of testing.
>
>
>>Then there is the issue of what the time standard is. In the x86, the PIT
>>is it. The pm_timer is close, but the read problem adds so much overhead
>>as to make it hard to justify. Possibly the HPET does a better job, but it
>>is still an I/O access (read SLOW) and is not commonly available as yet.
>>
>>You also have to do something reasonable for the accounting subsystems.
>>Currently this is done by sampling the system at a periodic rate. What
>>ever is running gets charged with the last period. If you go non-periodic,
>>either you have to charge a variable period when the sample is taken or you
>>have to set up timers as part of context switching. This ladder is not
>>wise as the context switch overhead then gets larger. It is rather easy to
>>show that the accounting overhead in such a system with a modest load is
>>higher than in a periodic sampled system. This system is also charged with
>>doing the time slicing. With out a periodic tick, a timer needs to be set
>>up each context switch.
>
>
> We will always (at least in our current concept) have an idea of when,
> worst/best case, the next timer interrupt will be (a maximum, predefined
> interval). This would also define how long the system could go
> completely idle without expiring any timers.

I think we should leave the skipping of timer interrupts to VST code. The VST
code only needs to figure out when the next timer is to expire when the cpu is
idle, i.e. it has a time to spare to look up that timer.
>
>
>>In addition to all of this, there is the issue of how to organize the timer
>>list so that you can find the next timer. With the current organization
>>this is something you just don't want to do. On the other hand, I am
>>convinced that, for periodic timers, it is about as good as it gets.
>
>
> I think currently, we do something similar to next_timer_interrupt(),
> making sure we update the stored value whenever we add a new timer, if
> necessary. Except now, we will determine the next time to set the hw
> timer to go off every time we expire timers.

Finding the next timer is a high overhead operation....
>
> I know my proposal is a big change and does make things more difficult
> in the short-term for the HRT patch. But I think it is possible to work
> together and get a very reasonable system in place, which allows for
> both HRT and normal timers to work effectively. I may have made things
> worse in my response, but I think the discussion has to be had :)
>


--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/


2005-05-04 17:45:57

by Chris Friesen

[permalink] [raw]
Subject: Re: Help with the high res timers

George Anzinger wrote:

> The, I think, elegant solution to the timer storm problem is to
> not restart the timer until the user picks up the prior expiration.
> This dynamically adjusts the timer response to the amount of machine
> available at the time.

The disadvantage is that you then lose accuracy since each timer
interval is increased by some random amount based on system scheduling.
What about some kind of ulimit-type thing to specify the minimum
recurring interval that can be specified? If root so specifies, you
could have 1usec interval timers and the system would hang. This is
conceptually no different than busy-looping in a SCHED_FIFO task.

Chris

2005-05-04 17:53:42

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: Help with the high res timers

On 04.05.2005 [11:44:56 -0600], Chris Friesen wrote:
> George Anzinger wrote:
>
> >The, I think, elegant solution to the timer storm problem is to
> >not restart the timer until the user picks up the prior expiration.
> >This dynamically adjusts the timer response to the amount of machine
> >available at the time.
>
> The disadvantage is that you then lose accuracy since each timer
> interval is increased by some random amount based on system scheduling.
> What about some kind of ulimit-type thing to specify the minimum
> recurring interval that can be specified? If root so specifies, you
> could have 1usec interval timers and the system would hang. This is
> conceptually no different than busy-looping in a SCHED_FIFO task.

If I understand your point correctly, I think this is achieved by
TIMERINTERVAL_BITS in my patch (not to claim my patch is function, but
conceptually). No matter what you actually request, the best you can do
is 2^TIMERINTERVAL_BITS nanoseconds, and usually worse because the
tick-rate and timerinterval length do not necessarily line up.

Thanks,
Nish

2005-05-04 18:16:40

by Darren Hart

[permalink] [raw]
Subject: Re: Help with the high res timers

George Anzinger wrote:
> |First I want to express my joy that we are discussing these issues at
> some depth.
>
> I also think we need a wider audience and so have added the kernel list
> to the cc. For that reason, I will not trim this message, thus allowing
> them to "catch up".
>
> Nishanth Aravamudan wrote:
>
>> On 03.05.2005 [16:15:04 -0700], George Anzinger wrote:
>>
>>> john stultz wrote:
>>>
>>>> On Tue, 2005-05-03 at 14:36 -0700, George Anzinger wrote:
>>>>
>>>>
>>>>> At the moment I am trying to understand and recast the work John
>>>>> Stultz is doing. I have some very real concerns here and there is
>>>>> a patch by [email protected]<Nishanth Aravamudan> that, to me, looks
>>>>> like a royal pain. I think we need to understand it better and
>>>>> voice our concerns in a way that is constructive. Nish's patch is
>>>>> here: http://www.ussg.iu.edu/hypermail/linux/kernel/0504.3/1635.html
>>
>>
>>
>> I'm not on the HRT list, so I didn't get the original e-mail...keep me
>> on the CC, please.
>>
>>
>>>> Hey George, Hopefully our work isn't causing you too much
>>>> trouble. I do understand
>>>> the pain of having large conflicting works that try to achieve similar
>>>> goals.
>>>>
>>>>
>>>>
>>>>> To summarize my concerns, the HRT code depends on correct and exact
>>>>> registration of the timer interrupt with time of day. (We don't
>>>>> worry about latency here.) In the latest x86 HRT patch we fine tune
>>>>> the system clock to insure this alignment. This allows us to set
>>>>> up two stage timers that use the run_timers code for the coarse
>>>>> time (to the time base resolution) followed by a short high res
>>>>> hardware timer to get to the higher resolution we want. If the
>>>>> coarse timer (i.e. add_timer, run_timer and friends) do not
>>>>> register correctly, we will have the case of a coarse timer
>>>>> expiring after the needed and requested time. Now,
>>>>
>>>>
>>>>> from a coarse timer point of view, that may be fine, but it means
>>>>> we can
>>>>
>>>>
>>>>> not depend on it for the high res first stage.
>>>>
>>>>
>>>>
>>>>
>>>> The conceptual idea Nish is going after, is once we have a solid
method
>>>> to get a high-res timeofday (via the new monotonic_clock()
>>>> function), we
>>>> can use that instead of jiffies to expire soft-timers.
>>>> This requires re-defining each timer-bucket entry length from a jiffy
>>>> instead to a unit of time (Nish calls this a timerinterval unit). This
>>>> timer-interval unit's length is flexible can be defined at compile
>>>> time.
>>>> Currently Nish is using ~1ms (1<<20 nanoseconds) as the interval
>>>> length,
>>>> but it could be redefined to be as desired to give higher-res soft-
>>>> timers.
>>>
>>>
>>> The issue is not its length, but can we interrupt at its end. The
>>> key to high res is being on time. If we want a 100usec resolution
>>> timer, we need to have the low res timer hit its mark with something
>>> real close to this.
>>
>>
>>
>> Are you asking: Can we interrupt at the end of the timerinterval? That
>> doesn't matter to me, so much, since I made the timerinerval length
>> independent of the interrupt source. If you wanted to make sure they
>> corresponded completely, you simply would need to change the
>> nsecs_to_timerintervals() function appropriately. I was trying to stay
>> somewhat efficient, especially since we're dealing with a 64-bit
>> quantity, and thus used the shifting John mentioned above.
>
>
> I think we are at too high a level to go into this in detail here, but
> part of the HRT patch is a header called sc_math.h (Scaled math). The
> notion that you need to choose and hard code the multiplier is addressed
> there. The only real choice you need to make is the number of bits you
> want to shift. The more you shift the more exact the answer is. For an
> example of choosing the shift bits you may want to look at jiffies.h in
> the current tree.
>
>>
>> So, currently, my code does not provide for high-res timers,
>> admittedly. It does enable some of the infrastructure that should make
>> implementing high-res timing in mainline (turned on by tuning
>> TIMERINTERVAL_BITS or setting a .config option, perhaps) more feasible,
>> in my eyes. I think we have the same goal in mind -- make timing
>> flexible and useful.
>>
>> To get 100 usec resolution with my system, you would need to make
>> TIMERINTERVAL_BITS=16. Basically, this makes a timerinterval = 1
>> nanosecond << 16, or 131.07 (131) microseconds. (Of course, you could
>> not use shifting and just divide by 10. Might be slower, but is
>> certainly feasible.) Admittedly, this is not perfect, but it's close.
>> Now, this has nothing to do with the hardware. But, it defines the
>> lowest interval, *regardless* of hardware, which the system will
>> support. If you request a 100 microsecond delay, we will make it 131
>> microseconds.
>
>
> Leaving aside the precision issue (addressed by scaled math above), I
> don't think this is the way to address high resolution. For the most
> part most user and kernel applications are quite happy with 10 ms
> resolution and, I would argue, we should go back to the 100 HZ value.
> IMNSHO high res should be asked for on a request at a time basis. The
> POSIX standard provided a rather nice way to do this in the CLOCKS &
> TIMERS section of the standard. In the kernel the low res part of this
> is in kernel/posix-timers.c. To extend to high res all one does is to
> define new clocks. In HRT we define two high res clocks
> (CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR). From the user point of
> view, to get high res, he has to used one of these clocks.
>
> The trick is to muster the kernel resources to make the *_HR clock a
> reality. In my HRT patches I have done this with a two phase approach.
> I simply add a second, HR member to the timer structure. This allows
> the timer to carry the extra bits needed to express the added resolution
> and also, if the bits are zero, to indicate that it is just a low res
> timer. Over the life time of the HRT patch I have worked with two
> different ways of handling the HR bits in the timer list. In all cases,
> we stay with a periodic time tick and only incur extra overhead when the
> timer is about to expire. We ask the "arch" code to provide an
> interrupt source with the needed short term resolution. Since we only
> use this timer at the end of the timer period it needs cover only a
> short time (0 to 1 jiffie), but should have good resolution over this
> period. Also, since it is a rather short time, we don't require it to
> have long term stability.
>
> The point is, the standard, low res, timer code needs to be only
> minimally aware of the HR bits. On the other hand, since the HR timer
> will be started on expiration of the low res timer, we need that
> expiration to be aligned with the clock in a stable, predictable long
> term way. (For what its worth, we have yet to address NTP clock
> drifting in the HRT patch, so that is still a fly in this web of time.)
>

You are still dependent on jiffies which is very unreliable du to lost
ticks, HZ vs ACTHZ, etc. Even if you use a periodic tick, it seems the
HRT code would still benefit from a human-time soft-timers sub-system
that wasn't affected by such things.


>>
>> Now, the trick is then, to get the hardware to tick every 100
>> microseconds. You have a lot more expertise in this area, and I would be
>> interested to see what the actual options are. Presuming we are able to
>> get a periodic interrupt at this rate, we could then do what I think HRT
>> pretty much does, maintain the normal buckets for normal timers and then
>> have a special HRT bucket. We could then be aware of when, in
>> nanoseconds, the next HRT interrupt should be, set the hardware
>> appropriately and be ok. If there are no HRT timers, then we'll never
>> check that bucket, really (we could also just not configure in that
>> option). I think it's feasible, regardless. There may be performance
>> penalties, I agree. I'm working on figuring out how bad things get if we
>> set TIMERINTERVAL_BITS=10, for instance (about every microsecond). If
>> the system stays up, I'll be happy :)
>
>
> I don't think we have hardware yet that can address 100usec ticks, nor,
> as I argue above, do we need it. In point of fact, we found it
> necessary to address the user who, using the POSIX standard timer
> interface asked for a repeating timer of, say 1 usec (yes we "say" we
> support 1 usec resolution). Such a request could easily consume the
> machine (I call these timer storms), causing it never to be hear from
> again. The, I think, elegant solution to the timer storm problem is to
> not restart the timer until the user picks up the prior expiration.
> This dynamically adjusts the timer response to the amount of machine
> available at the time.
>
>>
>>
>>>> Now when the new code expires timers it does so against the timeofday
>>>> subsystem's notion of time instead of jiffies. It simply goes through
>>>> all the timer bucket entries between now and the last time we expired
>>>> timers.
>>>
>>>
>>> This is not unlike what happens now. I would hope that the number of
>>> buckets visited averages to to something real close to 1 per
>>> run_timers entry.
>>
>>
>>
>> I agree, this is what we might be able to achieve by keeping high-res
>> timers in their own bucket. That way, we could conceivably know when the
>> next periodic interrupt is and when the next HRT interrupt is in
>> absolute nanoseconds (based on do_monotonic_clock()).
>
>
> For what its worth, we have such a clock in HRT. It is expressed in
> jiffies and arch_cycles, but jiffies is what is used today as the
> "standard" for internal system time. This gets into the area of just
> how to express time in areas where you need to quickly get and compare
> the time. After a lot of though, I decided to use arch_cycles rather
> than nanoseconds because arch_cycles are directly readable from the
> hardware. Any other unit of time requires a conversion, and, as you
> know, usually requires 64-bits as well. By staying with arch_cycles we
> avoid "most" of the conversion overhead.
>

I can see why you might want to use arch_cycles, and at least it is an
absolute time. Jiffies though is not. Perhaps HRT could benefit from a
conversion to "milliseconds + arch_cycles" ?

>>
>>
>>>> Now an interesting point you bring up above is when to schedule timer
>>>> interrupts. One could just have a fine-grained timer-interval unit and
>>>> crank up HZ, but clearly we don't want to schedule interrupts to
go off
>>>> too frequently or the overhead won't be worth it. Instead to get high-
>>>> res timers, the idea is to look at the timer list and schedule
>>>> interrupts appropriately. Then we only take an interrupt when there is
>>>> work to do, rather then at regular periodic intervals when there
may or
>>>> may not be anything to expire.
>>>
>>>
>>> This would suggest that all timers are high res. I don't think this
>>> makes sense:
>>> 1) because most users just don't care that much about resolution and
>>> high-res carries some overhead.
>>> 2) Not all platform hardware is able to handle high res.
>>
>>
>>
>> I agree, this is an issue which my TIMERINTERVAL_BITS=10 test should
>> also reveal. But under the normal case where TIMERINTERVAL_BITS are 20,
>> then there will never be an interrupt scheduled to go off quicker than
>> 2^20 nanoseconds in the future (about a millisecond). Everything hinges
>> around these bits :) (mostly because they define how we convert
>> nanoseconds to timerintervals).
>>
>>
>>> I think high res timers should only be used when the user asks for
>>> them. This keeps the overhead under control.
>>
>>
>>
>> I agree -- the user asking for the, currently, would mean they set
>> TIMERINTERVAL_BITS lower than normal and also set up the hardware to
>> interrupt appropriately.
>
>
> No, no, they ask by using a *_HR clock.
>
>>
>>
>>>> Now from my understanding of your framework, this sounds similar
to how
>>>> your high-res hardware timer interrupt is used. The idea being that we
>>>> use that for all soft-timers, instead of having two levels of high-res
>>>> and coarse soft-timers.
>>>>
>>>> The interesting problem that Darren will be looking at is how to
>>>> schedule interrupts accurately. This appears to be something you've
>>>> already attacked, so your experience would be helpful. My idea being
>>>> that the timer_interrupt code checks when it intended to fire and when
>>>> it actually did fire, and that information could then be used to tune
>>>> the interrupt scheduling code.
>>>
>>>
>>> A small, measurable latency is ok and need not be backed out by the
>>> software. If you go this route you risk expiring a timer early and
>>> the standard says bad things about this.
>>
>>
>>
>> I'm not sure what you mean about timers going off early. We round up on
>> addition and round down on expiration to make sure that we don't get
>> timers early. As long as do_monotonic_clock() is accurate, we should be
>> ok. (Darren's part uses the timerintervals value to figure out when the
>> next interrupt should be, similar to the existing
>> next_timer_interrupt())
>
>
> Your rounding up or down is not really different than what is done with
> jiffies today...
>
>>
>>
>>> What is missing in this is that the flag ship arch (x86) has piss
>>> poor capability to schedule timer interrupts. I.e. there really is
>>> no commonly available timer to generate interrupts at some arbitrary
>>> time. There is, however, the PIT, which, if you don't mess with it,
>>> is very low overhead but periodic.
>>
>>
>>
>> And generally, we are going to stay periodic, for now. Darren's patch is
>> a further extension on mine, which requires a lot of testing.
>>
>>
>>> Then there is the issue of what the time standard is. In the x86,
>>> the PIT is it. The pm_timer is close, but the read problem adds so
>>> much overhead as to make it hard to justify. Possibly the HPET does
>>> a better job, but it is still an I/O access (read SLOW) and is not
>>> commonly available as yet.
>>>
>>> You also have to do something reasonable for the accounting
>>> subsystems. Currently this is done by sampling the system at a
>>> periodic rate. What ever is running gets charged with the last
>>> period. If you go non-periodic, either you have to charge a variable
>>> period when the sample is taken or you have to set up timers as part
>>> of context switching. This ladder is not wise as the context switch
>>> overhead then gets larger. It is rather easy to show that the
>>> accounting overhead in such a system with a modest load is higher
>>> than in a periodic sampled system. This system is also charged with
>>> doing the time slicing. With out a periodic tick, a timer needs to
>>> be set up each context switch.
>>
>>
>>
>> We will always (at least in our current concept) have an idea of when,
>> worst/best case, the next timer interrupt will be (a maximum, predefined
>> interval). This would also define how long the system could go
>> completely idle without expiring any timers.
>
>
> I think we should leave the skipping of timer interrupts to VST code.
> The VST code only needs to figure out when the next timer is to expire
> when the cpu is idle, i.e. it has a time to spare to look up that timer.
>
>>
>>
>>> In addition to all of this, there is the issue of how to organize the
>>> timer list so that you can find the next timer. With the current
>>> organization this is something you just don't want to do. On the
>>> other hand, I am convinced that, for periodic timers, it is about as
>>> good as it gets.
>>
>>
>>
>> I think currently, we do something similar to next_timer_interrupt(),
>> making sure we update the stored value whenever we add a new timer, if
>> necessary. Except now, we will determine the next time to set the hw
>> timer to go off every time we expire timers.
>
>
> Finding the next timer is a high overhead operation....

Perhaps it is now, but does it have to be? The process scheduler used
to be full of O(N) algorithms to, but they were all converted to O(1) by
doing several incremental calculations instead of one huge one. Perhaps
something similar could be done with timers by cacheing a couple of
timer expirations and modifying them as timers are added, removed, and
modified. Just brainstorming here.

>
>>
>> I know my proposal is a big change and does make things more difficult
>> in the short-term for the HRT patch. But I think it is possible to work
>> together and get a very reasonable system in place, which allows for
>> both HRT and normal timers to work effectively. I may have made things
>> worse in my response, but I think the discussion has to be had :)
>>
>
>

--Darren

2005-05-04 18:17:44

by Chris Friesen

[permalink] [raw]
Subject: Re: Help with the high res timers

Nishanth Aravamudan wrote:

> If I understand your point correctly, I think this is achieved by
> TIMERINTERVAL_BITS in my patch (not to claim my patch is function, but
> conceptually). No matter what you actually request, the best you can do
> is 2^TIMERINTERVAL_BITS nanoseconds, and usually worse because the
> tick-rate and timerinterval length do not necessarily line up.

My point is simply that the timer for the next interval should start at
the time the timer expires, not the time that userspace picks up the
prior expiration. Throttling the timer rate should be done at the time
of timer request rather than timer expiry.

If I have usec-accuracy in the timer subsystem, I should be able to set
a timer with an interval of 9.999ms and have it remain accurate over
time (subject to scheduler jitter, of course). N timer intervals later
my timer should expire at (original_time + N*9.999ms + jitter). In this
case the error is roughly constant with time.

If the timer doesn't start counting the next interval until the user
detects expiry, I'm going to get some non-zero addition to *each*
interval such that my timers will not remain accurate over long periods
of time. In this case N timer intervals later my timer will expire at
(original_time + N*(9.999ms + jitter)) which is a very different thing.
Since jitter will always be positive, the error increases with time.

Chris

2005-05-04 20:36:36

by George Anzinger

[permalink] [raw]
Subject: Re: Help with the high res timers

Chris Friesen wrote:
> George Anzinger wrote:
>
>> The, I think, elegant solution to the timer storm problem is to not
>> restart the timer until the user picks up the prior expiration. This
>> dynamically adjusts the timer response to the amount of machine
>> available at the time.
>
>
> The disadvantage is that you then lose accuracy since each timer
> interval is increased by some random amount based on system scheduling.
> What about some kind of ulimit-type thing to specify the minimum
> recurring interval that can be specified? If root so specifies, you
> could have 1usec interval timers and the system would hang. This is
> conceptually no different than busy-looping in a SCHED_FIFO task.

The standard comes to the rescue here. The standard defines timer_getoverrun()
which returns the number of additional timeouts you _would_ have seen if you had
been fast enough.

I tried a limit thing. It is MUCH too fragile for the real world.
>
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/

2005-05-04 20:41:17

by George Anzinger

[permalink] [raw]
Subject: Re: Help with the high res timers

Chris Friesen wrote:
> Nishanth Aravamudan wrote:
>
>> If I understand your point correctly, I think this is achieved by
>> TIMERINTERVAL_BITS in my patch (not to claim my patch is function, but
>> conceptually). No matter what you actually request, the best you can do
>> is 2^TIMERINTERVAL_BITS nanoseconds, and usually worse because the
>> tick-rate and timerinterval length do not necessarily line up.
>
>
> My point is simply that the timer for the next interval should start at
> the time the timer expires, not the time that userspace picks up the
> prior expiration. Throttling the timer rate should be done at the time
> of timer request rather than timer expiry.
>
> If I have usec-accuracy in the timer subsystem, I should be able to set
> a timer with an interval of 9.999ms and have it remain accurate over
> time (subject to scheduler jitter, of course). N timer intervals later
> my timer should expire at (original_time + N*9.999ms + jitter). In this
> case the error is roughly constant with time.
>
> If the timer doesn't start counting the next interval until the user
> detects expiry, I'm going to get some non-zero addition to *each*
> interval such that my timers will not remain accurate over long periods
> of time. In this case N timer intervals later my timer will expire at
> (original_time + N*(9.999ms + jitter)) which is a very different thing.
> Since jitter will always be positive, the error increases with time.

You can, and we do in the main line kernel, keep this accuracy even with missed
timer expiries. When a timer is restarted, the actual requested expire time is
recalculated. It is adjusted ONLY by the requested repeat time until it results
in a time that has yet to come and the number of repeat times it takes is logged
in the overrun count for user examination.
>
> Chris
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/

2005-05-04 21:49:42

by George Anzinger

[permalink] [raw]
Subject: Re: Help with the high res timers

Darren Hart wrote:
> George Anzinger wrote:
>> |First I want to express my joy that we are discussing these issues at
>> some depth.
>>
~
>> The point is, the standard, low res, timer code needs to be only
>> minimally aware of the HR bits. On the other hand, since the HR timer
>> will be started on expiration of the low res timer, we need that
>> expiration to be aligned with the clock in a stable, predictable long
>> term way. (For what its worth, we have yet to address NTP clock
>> drifting in the HRT patch, so that is still a fly in this web of time.)
>>
>
> You are still dependent on jiffies which is very unreliable do to lost
> ticks, HZ vs ACTHZ, etc. Even if you use a periodic tick, it seems the
> HRT code would still benefit from a human-time soft-timers sub-system
> that wasn't affected by such things.

Well, first we clean up jiffies WRT lost ticks. We just don't have them with
the HRT patch. If you mean being late, just as John's code give the right time
all the time so does the HRT time code. The notion is that the TSC or the
pm_counter can carry us over a couple of lost ticks as well as the between tick
times. So, we aren't affected by lost ticks and ACTHZ just is not part of the
POSIX clocks & timers world.
>
>
>> >
~
>> For what its worth, we have such a clock in HRT. It is expressed in
>> jiffies and arch_cycles, but jiffies is what is used today as the
>> "standard" for internal system time. This gets into the area of just
>> how to express time in areas where you need to quickly get and compare
>> the time. After a lot of though, I decided to use arch_cycles rather
>> than nanoseconds because arch_cycles are directly readable from the
>> hardware. Any other unit of time requires a conversion, and, as you
>> know, usually requires 64-bits as well. By staying with arch_cycles we
>> avoid "most" of the conversion overhead.
>>
>
> I can see why you might want to use arch_cycles, and at least it is an
> absolute time. Jiffies though is not. Perhaps HRT could benefit from a
> conversion to "milliseconds + arch_cycles" ?

Two points here, first, jiffies, once you convince yourself that the PIT is the
gold time standard, IS the time. It is the best time we have in the x86 world
and in all other archs too. That is what the arch time code is supposed to do.
Your milliseconds is, or should be, derived from jiffies or, in other archs,
the same place jiffies is derived from, i.e. the time keeping "rock" on the
mother board. To account for things like NTP, we always grab arch_cycles,
xtime, jiffies, and wall_to_monotonic under a read lock at the same time. This
allows us to peg one expression of time against another.

All that said, for HRT to work correctly, the timer must have a high resolution
time contained in its structure. Currently this is jiffies and arch_cycles.
Further, we need to know that when that jiffies value arrives, we will get
called ASAP, and not at some random time between that value and the next time it
changes. For repeating timers we use these values _after_ the timer expires to
figure the next expire time, so if they are changed by the timer code, we are in
trouble.
>
>> >
>> >
~
>> > I think currently, we do something similar to next_timer_interrupt(),
>> > making sure we update the stored value whenever we add a new timer, if
>> > necessary. Except now, we will determine the next time to set the hw
>> > timer to go off every time we expire timers.
>>
>>
>> Finding the next timer is a high overhead operation....
>
> Perhaps it is now, but does it have to be? The process scheduler used
> to be full of O(N) algorithms to, but they were all converted to O(1) by
> doing several incremental calculations instead of one huge one. Perhaps
> something similar could be done with timers by caching a couple of
> timer expirations and modifying them as timers are added, removed, and
> modified. Just brainstorming here.

Believe me I did a lot of "brainstorming" around how to keep HR timers along
with the LR timers. Until just recently I had a different timer list in timer.c
which was hashed on the low timer bits. After much thought and finally
realizing that MOST timers do NOT expire, I came to conclude that the current
list is just right given what it is currently used for. The stuff that is not
O(1) in this structure is the cascading. The reason this is not a big issue is
just that most timers are removed prior to the cascade.

In the list structure I used to use, I had to order timers by expire time,
accounting for both jiffies and arch_cycles. Now HRT uses the same old cascade
structure and the timer code, for the most part, ignores the arch_cycles member.
Only on expire is it discovered and the timer sent to the hrtime code where
the timer is entered in to a very short list of timers that are to expire some
time between now and the next jiffie. This is a very simple O(N/2) list, but it
is also really short. (This is a simplified discussion. The code is on the HRT
web site and in the HRT CVS on that site for the curious.)
>
>>
>> >
~
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/