Sending this to a wider audience:
The udelay() inline for ia64 uses the ITC. If CONFIG_PREEMPT is enabled
and the platform has unsynchronized ITCs and the calling task migrates
to another CPU while doing the udelay loop, then the effective delay may
be too short or very, very long.
The most simple fix is to disable preemption around the udelay looping.
The downside is that this inhibits realtime preemption for cases of long
udelays. One datapoint: an SGI realtime engineer reports that if
CONFIG_PREEMPT is turned off, that no significant holdoffs are
are attributed to udelay().
I am reluctant to propose a much more complicated patch (that disables
preemption only for "short" delays, and uses the global RTC as the time
base for longer, preemptible delays) unless this patch introduces
significant and unacceptable preemption delays.
Signed-off-by: John Hawkes <[email protected]>
Index: linux/include/asm-ia64/delay.h
===================================================================
--- linux.orig/include/asm-ia64/delay.h 2005-10-27 17:02:08.000000000 -0700
+++ linux/include/asm-ia64/delay.h 2005-12-14 10:30:55.000000000 -0800
@@ -87,11 +87,17 @@
static __inline__ void
udelay (unsigned long usecs)
{
- unsigned long start = ia64_get_itc();
- unsigned long cycles = usecs*local_cpu_data->cyc_per_usec;
+ unsigned long start;
+ unsigned long cycles;
+
+ preempt_disable();
+ cycles = usecs*local_cpu_data->cyc_per_usec;
+ start = ia64_get_itc();
while (ia64_get_itc() - start < cycles)
cpu_relax();
+
+ preempt_enable();
}
#endif /* _ASM_IA64_DELAY_H */
On Wed, Dec 14, 2005 at 03:25:26PM -0800, [email protected] wrote:
> Sending this to a wider audience:
>
> The udelay() inline for ia64 uses the ITC. If CONFIG_PREEMPT is enabled
> and the platform has unsynchronized ITCs and the calling task migrates
> to another CPU while doing the udelay loop, then the effective delay may
> be too short or very, very long.
>
> The most simple fix is to disable preemption around the udelay looping.
> The downside is that this inhibits realtime preemption for cases of long
> udelays. One datapoint: an SGI realtime engineer reports that if
> CONFIG_PREEMPT is turned off, that no significant holdoffs are
> are attributed to udelay().
>
> I am reluctant to propose a much more complicated patch (that disables
> preemption only for "short" delays, and uses the global RTC as the time
> base for longer, preemptible delays) unless this patch introduces
> significant and unacceptable preemption delays.
Stuck between a rock and the proverbial hard place.
I think that the more complex patch is needed though. If some crazy
driver has a pre-emptible udelay(10000), then you really don't want
to spin for that long without allowing preemption.
What is the threshold between a "long" and a "short" delay? Presumably
related to whatever your maximum hold-off value is. Also depends on
how long it takes to read the global RTC.
If you don't want to get the global RTC involved, then perhaps
you can break long delays up inside udelay()? Something like
this:
#define SMALLUSECS 250 /* randomly picked, needs some thought! */
static __inline__ void
udelay (unsigned long usecs)
{
unsigned long start;
unsigned long cycles;
unsigned long smallusecs;
while (usecs > 0) {
smallusecs = (usecs > SMALLUSECS) ? SMALLUSECS : usecs;
preempt_disable();
cycles = smallusecs*local_cpu_data->cyc_per_usec;
start = ia64_get_itc();
while (ia64_get_itc() - start < cycles)
cpu_relax();
preempt_enable();
usecs -= smallusecs;
}
}
This does make the function a bit big for an "inline" though. Does
it really need to be inline? Do we care how fast our delay loops
are?
-Tony
On Thu, Dec 15, 2005 at 02:50:40PM -0800, Luck, Tony wrote:
> This does make the function a bit big for an "inline" though. Does
> it really need to be inline? Do we care how fast our delay loops
> are?
Moving the current slim-line udelay() out of line would save 41 Kbytes
of text in the generic vmlinux, plus making any modules that use
udelay smaller too. Savings run from a 128-160 bytes for drivers
with just one call to a max of 9 Kbytes for qla2xxx.ko.
Being out-of-line would reduce accuracy, but this would only be
significant when the sleep is for a very small number of microseconds.
So if we need to add more code to udelay(), I think that it
should be moved out-of-line too (into arch/ia64/kernel/time.c).
alpha, m68knommu, powerpc and sh64 already have out of line udelay().
-Tony
On Thu, 15 Dec 2005, Luck, Tony wrote:
> On Wed, Dec 14, 2005 at 03:25:26PM -0800, [email protected] wrote:
> > Sending this to a wider audience:
> >
> > The udelay() inline for ia64 uses the ITC. If CONFIG_PREEMPT is enabled
> > and the platform has unsynchronized ITCs and the calling task migrates
> > to another CPU while doing the udelay loop, then the effective delay may
> > be too short or very, very long.
> >
> > The most simple fix is to disable preemption around the udelay looping.
> > The downside is that this inhibits realtime preemption for cases of long
> > udelays. One datapoint: an SGI realtime engineer reports that if
> > CONFIG_PREEMPT is turned off, that no significant holdoffs are
> > are attributed to udelay().
> >
> > I am reluctant to propose a much more complicated patch (that disables
> > preemption only for "short" delays, and uses the global RTC as the time
> > base for longer, preemptible delays) unless this patch introduces
> > significant and unacceptable preemption delays.
>
> Stuck between a rock and the proverbial hard place.
>
> I think that the more complex patch is needed though. If some crazy
> driver has a pre-emptible udelay(10000), then you really don't want
> to spin for that long without allowing preemption.
If it's a preemptible sleep period it should just use msleep.
On Thu, 2005-12-15 at 17:52 -0800, Zwane Mwaikambo wrote:
> On Thu, 15 Dec 2005, Luck, Tony wrote:
>
> > On Wed, Dec 14, 2005 at 03:25:26PM -0800, [email protected] wrote:
> > > Sending this to a wider audience:
> > >
> > > The udelay() inline for ia64 uses the ITC. If CONFIG_PREEMPT is enabled
> > > and the platform has unsynchronized ITCs and the calling task migrates
> > > to another CPU while doing the udelay loop, then the effective delay may
> > > be too short or very, very long.
> > >
> > > The most simple fix is to disable preemption around the udelay looping.
> > > The downside is that this inhibits realtime preemption for cases of long
> > > udelays. One datapoint: an SGI realtime engineer reports that if
> > > CONFIG_PREEMPT is turned off, that no significant holdoffs are
> > > are attributed to udelay().
> > >
> > > I am reluctant to propose a much more complicated patch (that disables
> > > preemption only for "short" delays, and uses the global RTC as the time
> > > base for longer, preemptible delays) unless this patch introduces
> > > significant and unacceptable preemption delays.
> >
> > Stuck between a rock and the proverbial hard place.
> >
> > I think that the more complex patch is needed though. If some crazy
> > driver has a pre-emptible udelay(10000), then you really don't want
> > to spin for that long without allowing preemption.
>
> If it's a preemptible sleep period it should just use msleep.
There are 10 drivers that udelay(10000) or more and a TON that
udelay(1000). Turning those all into 1ms+ non preemptible sections will
be very bad.
Lee
From: "Lee Revell" <[email protected]>
> There are 10 drivers that udelay(10000) or more and a TON that
> udelay(1000). Turning those all into 1ms+ non preemptible sections will
> be very bad.
What about 100usec non-preemptible sections?
John Hawkes
On Thu, 15 Dec 2005, Lee Revell wrote:
> On Thu, 2005-12-15 at 17:52 -0800, Zwane Mwaikambo wrote:
> >
> > If it's a preemptible sleep period it should just use msleep.
>
> There are 10 drivers that udelay(10000) or more and a TON that
> udelay(1000). Turning those all into 1ms+ non preemptible sections will
> be very bad.
What i said was _msleep_, if the driver is sleeping for 10ms and it can
handle being preempted, you can switch it over to _msleep_. The original
post was regarding making udelay non-preemptible.
On Thu, 15 Dec 2005, John Hawkes wrote:
> From: "Lee Revell" <[email protected]>
> > There are 10 drivers that udelay(10000) or more and a TON that
> > udelay(1000). Turning those all into 1ms+ non preemptible sections will
> > be very bad.
>
> What about 100usec non-preemptible sections?
> John Hawkes
100usec should be ok as a non-preemptible udelay is this 1 specific
driver?
Resubmitting using Tony Luck's suggested alternative.
The udelay() inline for ia64 uses the ITC. If CONFIG_PREEMPT is enabled
and the platform has unsynchronized ITCs and the calling task migrates
to another CPU while doing the udelay loop, then the effective delay may
be too short or very, very long.
This patch disables preemption around 100 usec chunks of the overall
desired udelay time. This minimizes preemption-holdoffs.
Signed-off-by: John Hawkes <[email protected]>
Index: linux/include/asm-ia64/delay.h
===================================================================
--- linux.orig/include/asm-ia64/delay.h 2005-12-15 15:34:16.000000000 -0800
+++ linux/include/asm-ia64/delay.h 2005-12-15 15:34:57.000000000 -0800
@@ -84,14 +84,6 @@
ia64_delay_loop (loops - 1);
}
-static __inline__ void
-udelay (unsigned long usecs)
-{
- unsigned long start = ia64_get_itc();
- unsigned long cycles = usecs*local_cpu_data->cyc_per_usec;
-
- while (ia64_get_itc() - start < cycles)
- cpu_relax();
-}
+extern void udelay (unsigned long usecs);
#endif /* _ASM_IA64_DELAY_H */
Index: linux/arch/ia64/kernel/time.c
===================================================================
--- linux.orig/arch/ia64/kernel/time.c 2005-12-15 15:34:16.000000000 -0800
+++ linux/arch/ia64/kernel/time.c 2005-12-15 15:34:57.000000000 -0800
@@ -249,3 +249,31 @@
*/
set_normalized_timespec(&wall_to_monotonic, -xtime.tv_sec, -xtime.tv_nsec);
}
+
+#define SMALLUSECS 100
+
+void
+udelay (unsigned long usecs)
+{
+ unsigned long start;
+ unsigned long cycles;
+ unsigned long smallusecs;
+
+ /*
+ * Execute the non-preemptible delay loop (because the ITC might
+ * not be synchronized between CPUS) in relatively short time
+ * chunks, allowing preemption between the chunks.
+ */
+ while (usecs > 0) {
+ smallusecs = (usecs > SMALLUSECS) ? SMALLUSECS : usecs;
+ preempt_disable();
+ cycles = smallusecs*local_cpu_data->cyc_per_usec;
+ start = ia64_get_itc();
+
+ while (ia64_get_itc() - start < cycles)
+ cpu_relax();
+
+ preempt_enable();
+ usecs -= smallusecs;
+ }
+}
On Thu, 2005-12-15 at 18:12 -0800, John Hawkes wrote:
> From: "Lee Revell" <[email protected]>
> > There are 10 drivers that udelay(10000) or more and a TON that
> > udelay(1000). Turning those all into 1ms+ non preemptible sections will
> > be very bad.
>
> What about 100usec non-preemptible sections?
That will disappear into the noise, in normal usage these happen all the
time. 500usec non preemptible regions are rare (~1 hour to show up) and
1ms very rare (24 hours). My tests show that 300 usec or so is a good
place to draw the line if you don't want it to show up in latency tests.
Lee
Luck, Tony schrieb:
>Moving the current slim-line udelay() out of line would save 41 Kbytes
>of text in the generic vmlinux, plus making any modules that use
>udelay smaller too. Savings run from a 128-160 bytes for drivers
>with just one call to a max of 9 Kbytes for qla2xxx.ko.
>
Tony,
we should really take the chance to make the kernel smaller since we
don't have to become active to make it bigger. This happens every day by
itself. Furthermore I'd guess that for the current and future IA64
implementations we might even win performance by having functions out of
line since we have a really fast call mechanism here. Of course any
difference heavily depends on the cache utilization so there would be
some benchmark needed. At least for the udelay the answer is easy and
must be "do it out of line".
Did anyone already run a benchmark for comparison of inline vs. out of
line for IA64? Or does anybody want to do it?
Christian
On Thu, Dec 15, 2005 at 06:42:52PM -0800, [email protected] wrote:
> +
> +#define SMALLUSECS 100
John, I did not see your posts until this had already made it out.
I would think that the folks running realtime applications would expect
udelay to hold off for even shorter periods of time. I would expect
something along the line of 20 or 25 uSec.
> +
> +void
> +udelay (unsigned long usecs)
> +{
> + unsigned long start;
> + unsigned long cycles;
> + unsigned long smallusecs;
> +
> + /*
> + * Execute the non-preemptible delay loop (because the ITC might
> + * not be synchronized between CPUS) in relatively short time
> + * chunks, allowing preemption between the chunks.
> + */
> + while (usecs > 0) {
> + smallusecs = (usecs > SMALLUSECS) ? SMALLUSECS : usecs;
> + preempt_disable();
> + cycles = smallusecs*local_cpu_data->cyc_per_usec;
> + start = ia64_get_itc();
> +
> + while (ia64_get_itc() - start < cycles)
> + cpu_relax();
> +
> + preempt_enable();
> + usecs -= smallusecs;
> + }
> +}
How much drift would you expect from this? I have not tried this, but
what about something more along the lines of:
#define MAX_USECS_WHILE_NOT_PREMPTIBLE 20
void
udelay (unsigned long usecs)
{
unsigned long next, timeout;
long last_processor = -1;
/*
* Execute the non-preemptible delay loop (because the ITC might
* not be synchronized between CPUS) in relatively short time
* chunks, allowing preemption between the chunks.
*/
while (usecs > 0) {
next = min(usecs, MAX_USECS_WHILE_NOT_PREMPTIBLE);
preempt_disable;
if (last_processor != smp_processor_id()) {
last_processor = smp_processor_id();
timeout = ia64_get_itc();
}
timeout += next * local_cpu_data->cyc_per_usec;
while (ia64_get_itc() < timeout)
cpu_relax();
preempt_enable;
usecs -= next;
}
}
On Iau, 2005-12-15 at 17:04 -0800, Luck, Tony wrote:
> Being out-of-line would reduce accuracy, but this would only be
> significant when the sleep is for a very small number of microseconds.
Even then its a predicted branch to code that will probably be in cache
so really ought to have no material impact.
On Fri, Dec 16, 2005 at 06:28:54AM -0600, Robin Holt wrote:
> > +#define SMALLUSECS 100
>
> John, I did not see your posts until this had already made it out.
> I would think that the folks running realtime applications would expect
> udelay to hold off for even shorter periods of time. I would expect
> something along the line of 20 or 25 uSec.
A good question ... I'm going to put John's change in as-is for now so
that 2.6.15 can benefit from the reduced code size of the out-of-line
and avoid the ugly bug when preemption is enabled on a drifty system.
We can make fine tune changes to the udelay() implementation after we
get some data on what is needed.
> How much drift would you expect from this? I have not tried this, but
> what about something more along the lines of:
>
> #define MAX_USECS_WHILE_NOT_PREMPTIBLE 20
As we reduce the non-preemtible window drift in my version of udelay()
would get worse ... but I haven't done any measurements on how much worse.
> timeout += next * local_cpu_data->cyc_per_usec;
> while (ia64_get_itc() < timeout)
> cpu_relax();
Bad news if your ar.itc wraps around (less than four centuries of uptime
at 1.6GHz :-)
-Tony
From: "Luck, Tony" <[email protected]>
...
> A good question ... I'm going to put John's change in as-is for now so
> that 2.6.15 can benefit from the reduced code size of the out-of-line
> and avoid the ugly bug when preemption is enabled on a drifty system.
Okay, but I'll propose some changes to that soon, probably using Robin Holt's
suggestions. I'm concerned about the impact of interrupts outside of the
inner loop and inside the outer loop, which would increase the effective delay
time.
John Hawkes
Lee Revell wrote:
> On Thu, 2005-12-15 at 18:12 -0800, John Hawkes wrote:
>
>>From: "Lee Revell" <[email protected]>
>>
>>>There are 10 drivers that udelay(10000) or more and a TON that
>>>udelay(1000). Turning those all into 1ms+ non preemptible sections will
>>>be very bad.
>>
>>What about 100usec non-preemptible sections?
>
>
> That will disappear into the noise, in normal usage these happen all the
> time. 500usec non preemptible regions are rare (~1 hour to show up) and
> 1ms very rare (24 hours). My tests show that 300 usec or so is a good
> place to draw the line if you don't want it to show up in latency tests.
I may be misreading the original post, but the problem is described as
one where the TSC is not syncronised and a CPU switch takes place. Would
the correct solution be to somehow set CPU affinity temporarily in such
a way as to avoid disabling preempt at all?
The preempt doesn't seem to be the root problem, so it's unlikely to be
the best solution...
--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
On Thu, 22 Dec 2005 16:45:08 -0500,
Bill Davidsen <[email protected]> wrote:
>Lee Revell wrote:
>> On Thu, 2005-12-15 at 18:12 -0800, John Hawkes wrote:
>>
>>>From: "Lee Revell" <[email protected]>
>>>
>>>>There are 10 drivers that udelay(10000) or more and a TON that
>>>>udelay(1000). Turning those all into 1ms+ non preemptible sections will
>>>>be very bad.
>>>
>>>What about 100usec non-preemptible sections?
>>
>>
>> That will disappear into the noise, in normal usage these happen all the
>> time. 500usec non preemptible regions are rare (~1 hour to show up) and
>> 1ms very rare (24 hours). My tests show that 300 usec or so is a good
>> place to draw the line if you don't want it to show up in latency tests.
>
>I may be misreading the original post, but the problem is described as
>one where the TSC is not syncronised and a CPU switch takes place. Would
>the correct solution be to somehow set CPU affinity temporarily in such
>a way as to avoid disabling preempt at all?
>
>The preempt doesn't seem to be the root problem, so it's unlikely to be
>the best solution...
Agreed. See [RFC] Add thread_info flag for "no cpu migration"[1] on
lkml. It got no response.
[1] http://marc.theaimsgroup.com/?l=linux-kernel&m=113471059115107&w=2
On Fri, 23 Dec 2005, Keith Owens wrote:
> Agreed. See [RFC] Add thread_info flag for "no cpu migration"[1] on
> lkml. It got no response.
>
> [1] http://marc.theaimsgroup.com/?l=linux-kernel&m=113471059115107&w=2
I read it and it would make fixing cpuhotplug + cpufreq also easier.
However you could also argue that the whole locking in cpufreq should be
fixed instead.