2004-06-05 07:12:43

by john stultz

[permalink] [raw]
Subject: Too much error in __const_udelay() ?

[Resending due to lkml bouncing me for having html attachments]

Hey all,
I've been hunting a bug on one of our systems and it seems to closely
resemble the __delay() issues we saw w/ the ACPI PM time source.

Earlier when implementing the ACPI PM time source, we found that we
couldn't use the actual ACPI PM time source for the __delay function, or
else various subsystems would break. I chalked it up to the 8Mhz counter
being too low res for some callers of delay(), causing the waits to be
far too long. We backed off to just using the TSC for delay() and things
seemed to get better.

However I've started to see some problems w/ 2.6 and USB on x440/x445s,
both of which use the 100Mhz cyclone time source. Further digging has
pointed to the fact that certain important udelay()s in the USB
subsystem aren't actually waiting long enough.

So far I've narrowed it down to the scaled math bits in
__const_udelay() causing too much error for loops_per_jiffy values
around the 100,000 level the cyclone timesource uses.

To demonstrate this, I wrote the attached demo app using the
__const_udelay code. For those not wanting to run it themselves, its
output (for HZ=1000) looks like:

1 usec: LPJ: 100000 __udelay: 0 vs my_udelay: 100
1 usec: LPJ: 1500000 __udelay: 1000 vs my_udelay: 1500

2 usec: LPJ: 100000 __udelay: 0 vs my_udelay: 200
2 usec: LPJ: 1500000 __udelay: 2000 vs my_udelay: 3000

5 usec: LPJ: 100000 __udelay: 0 vs my_udelay: 500
5 usec: LPJ: 1500000 __udelay: 7000 vs my_udelay: 7500

10 usec: LPJ: 100000 __udelay: 0 vs my_udelay: 1000
10 usec: LPJ: 1500000 __udelay: 14000 vs my_udelay: 15000

20 usec: LPJ: 100000 __udelay: 1000 vs my_udelay: 2000
20 usec: LPJ: 1500000 __udelay: 29000 vs my_udelay: 30000

50 usec: LPJ: 100000 __udelay: 4000 vs my_udelay: 5000
50 usec: LPJ: 1500000 __udelay: 74000 vs my_udelay: 75000

100 usec: LPJ: 100000 __udelay: 9000 vs my_udelay: 10000
100 usec: LPJ: 1500000 __udelay: 149000 vs my_udelay: 150000



Here you can see __udelay() fails to be even close to accurate until
~50usec and returns zero for values less then 20usec.

I then went and measured the same udelay() values via rdtsc in the
kernel for both the cyclone based delay() as well as the tsc based
delay. The results are attached in the html file. You'll notice this
closely matches the results from the demo app.

This issue hasn't bitten me before w/ 2.4 because (as you can show w/
the demo app) __const_udelay() is more accurate w/ HZ=100.

I tried replacing __const_udelay w/ my_delay() but it didn't boot
(overflow issues, I'm guessing).

So I'm no math wiz. What's the proper fix here?

thanks
-john


Attachments:
test.c (1.26 kB)
delay-differences.html.gz (1.31 kB)
Download all attachments

2004-06-05 15:23:38

by Dominik Brodowski

[permalink] [raw]
Subject: Re: Too much error in __const_udelay() ?

Hi,

> However I've started to see some problems w/ 2.6 and USB on x440/x445s,
> both of which use the 100Mhz cyclone time source. Further digging has
> pointed to the fact that certain important udelay()s in the USB
> subsystem aren't actually waiting long enough.

Certain? AFAICS _no_ call to a delay routine actually passed a big enough
argument. Or am I missing something? Also, __ndelay seems to be affected
as well: it returns zero for 550 nsec even for the TSC variant in your
test.c.

> So I'm no math wiz. What's the proper fix here?

Below are three changes I'd like to discuss. I'll build a fresh kernel with
all three changes enabled + PM_TIMER soon.

Change 1:

Move the multiplication with HZ up into the mull instruction:

unsigned long __const_udelay(unsigned long xloops)
{
int d0;
__asm__("mull %0"
:"=d" (xloops), "=&a" (d0)
:"1" (xloops),"0" (LPJ * HZ));
return __delay(xloops);
}

1 usec: LPJ: 100000 __udelay: 0 vs my_udelay: 99
1 usec: LPJ: 1500000 __udelay: 1000 vs my_udelay: 1499

2 usec: LPJ: 100000 __udelay: 0 vs my_udelay: 199
2 usec: LPJ: 1500000 __udelay: 2000 vs my_udelay: 2999

5 usec: LPJ: 100000 __udelay: 0 vs my_udelay: 499
5 usec: LPJ: 1500000 __udelay: 7000 vs my_udelay: 7498

10 usec: LPJ: 100000 __udelay: 0 vs my_udelay: 999
10 usec: LPJ: 1500000 __udelay: 14000 vs my_udelay: 14996

20 usec: LPJ: 100000 __udelay: 1000 vs my_udelay: 1999
20 usec: LPJ: 1500000 __udelay: 29000 vs my_udelay: 29993

50 usec: LPJ: 100000 __udelay: 4000 vs my_udelay: 4998
50 usec: LPJ: 1500000 __udelay: 74000 vs my_udelay: 74983

100 usec: LPJ: 100000 __udelay: 9000 vs my_udelay: 9997
100 usec: LPJ: 1500000 __udelay: 149000 vs my_udelay: 149966

20000 usec: LPJ: 100000 __udelay: 1999000 vs my_udelay: 1999549
20000 usec: LPJ: 1500000 __udelay: 29993000 vs my_udelay: 29993243


Change 2:

Round up in __udelay. While it can be argued that some time is also
spent in the delay functions, it's better to spend _at least_ the specified
time sleeping, in my humble opinion.


- return __const_udelay2(usecs * 0x000010c6); /* 2**32 / 1000000 */
+ return __const_udelay2(usecs * 0x000010c7); /* 2**32 / 1000000 (rounded up)*/

1 usec: LPJ: 100000 __udelay: 0 vs my_udelay: 100
1 usec: LPJ: 1500000 __udelay: 1000 vs my_udelay: 1500

2 usec: LPJ: 100000 __udelay: 0 vs my_udelay: 200
2 usec: LPJ: 1500000 __udelay: 2000 vs my_udelay: 3000

5 usec: LPJ: 100000 __udelay: 0 vs my_udelay: 500
5 usec: LPJ: 1500000 __udelay: 7000 vs my_udelay: 7500

10 usec: LPJ: 100000 __udelay: 0 vs my_udelay: 1000
10 usec: LPJ: 1500000 __udelay: 14000 vs my_udelay: 15000

20 usec: LPJ: 100000 __udelay: 1000 vs my_udelay: 2000
20 usec: LPJ: 1500000 __udelay: 29000 vs my_udelay: 30000

50 usec: LPJ: 100000 __udelay: 4000 vs my_udelay: 5000
50 usec: LPJ: 1500000 __udelay: 74000 vs my_udelay: 75000

100 usec: LPJ: 100000 __udelay: 9000 vs my_udelay: 10000
100 usec: LPJ: 1500000 __udelay: 149000 vs my_udelay: 150001

20000 usec: LPJ: 100000 __udelay: 1999000 vs my_udelay: 2000015
20000 usec: LPJ: 1500000 __udelay: 29993000 vs my_udelay: 30000228


Change 3:

Asserting at least 1 loop is spent: in really small ndelay() calls to
low-mhz timers, this might be better.

return __delay(xloops ? xloops : 1);

Before:
1 nsec: LPJ: 100000 __ndelay: 0 vs my_udelay: 0
2 nsec: LPJ: 100000 __ndelay: 0 vs my_udelay: 0
5 nsec: LPJ: 100000 __ndelay: 0 vs my_udelay: 0
10 nsec: LPJ: 100000 __ndelay: 0 vs my_udelay: 1
20 nsec: LPJ: 100000 __udelay: 0 vs my_udelay: 2
50 nsec: LPJ: 100000 __ndelay: 0 vs my_udelay: 5

After:
1 nsec: LPJ: 100000 __udelay: 0 vs my_udelay: 1
2 nsec: LPJ: 100000 __udelay: 0 vs my_udelay: 1
5 nsec: LPJ: 100000 __udelay: 0 vs my_udelay: 1
10 nsec: LPJ: 100000 __udelay: 0 vs my_udelay: 1
20 nsec: LPJ: 100000 __udelay: 0 vs my_udelay: 2
50 nsec: LPJ: 100000 __udelay: 0 vs my_udelay: 5



Dominik

2004-06-06 19:46:47

by Pavel Machek

[permalink] [raw]
Subject: Re: Too much error in __const_udelay() ?

Hi!

> Change 3:
>
> Asserting at least 1 loop is spent: in really small ndelay() calls to
> low-mhz timers, this might be better.
>
> return __delay(xloops ? xloops : 1);

Should not you always round up? If user asks you to delay 1900 usec,
delaying 1000 usec is a bug. If you do this, make-it-one-when-its-zero
hack should be unneccessary.
Pavel

--
934a471f20d6580d5aad759bf0d97ddc

2004-06-07 19:13:10

by john stultz

[permalink] [raw]
Subject: Re: Too much error in __const_udelay() ?

On Sat, 2004-06-05 at 08:23, Dominik Brodowski wrote:
> Hi,
>
> > However I've started to see some problems w/ 2.6 and USB on x440/x445s,
> > both of which use the 100Mhz cyclone time source. Further digging has
> > pointed to the fact that certain important udelay()s in the USB
> > subsystem aren't actually waiting long enough.
>
> Certain? AFAICS _no_ call to a delay routine actually passed a big enough
> argument. Or am I missing something? Also, __ndelay seems to be affected
> as well: it returns zero for 550 nsec even for the TSC variant in your
> test.c.

Indeed its likely.

> > So I'm no math wiz. What's the proper fix here?
>
> Below are three changes I'd like to discuss. I'll build a fresh kernel with
> all three changes enabled + PM_TIMER soon.

Ah, your test output is a bit confusing (changes to __const_udealy
affect the output of my_udelay), but I think I understand it. Forgive me
if I miss-interpret.

> Change 1:
>
> Move the multiplication with HZ up into the mull instruction:
>
> unsigned long __const_udelay(unsigned long xloops)
> {
> int d0;
> __asm__("mull %0"
> :"=d" (xloops), "=&a" (d0)
> :"1" (xloops),"0" (LPJ * HZ));
> return __delay(xloops);
> }

This does make a good bit of difference! Good catch!


> Change 2:
>
> Round up in __udelay. While it can be argued that some time is also
> spent in the delay functions, it's better to spend _at least_ the specified
> time sleeping, in my humble opinion.
>
>
> - return __const_udelay2(usecs * 0x000010c6); /* 2**32 / 1000000 */
> + return __const_udelay2(usecs * 0x000010c7); /* 2**32 / 1000000 (rounded up)*/
>

This change looks right to me.


> Change 3:
>
> Asserting at least 1 loop is spent: in really small ndelay() calls to
> low-mhz timers, this might be better.
>
> return __delay(xloops ? xloops : 1);

I agree w/ Pavel that rounding up sounds better, but I can't get the
math to work, so this may be the best solution.

I'm also spinning up a patch w/ these changes to test, let me know how
your testing went and I'll do the same.
-john


2004-06-07 20:27:19

by john stultz

[permalink] [raw]
Subject: Re: Too much error in __const_udelay() ?

On Mon, 2004-06-07 at 12:12, john stultz wrote:
> I'm also spinning up a patch w/ these changes to test, let me know how
> your testing went and I'll do the same.

The following patch of your suggestions resolves the USB problem for me.
I'd be interested if anyone else could test this to insure it doesn't
open up any other issues.

thanks
-john

===== arch/i386/lib/delay.c 1.5 vs edited =====
--- 1.5/arch/i386/lib/delay.c Wed Jul 2 21:21:32 2003
+++ edited/arch/i386/lib/delay.c Mon Jun 7 12:17:26 2004
@@ -33,13 +33,13 @@
int d0;
__asm__("mull %0"
:"=d" (xloops), "=&a" (d0)
- :"1" (xloops),"0" (current_cpu_data.loops_per_jiffy));
- __delay(xloops * HZ);
+ :"1" (xloops),"0" (HZ*current_cpu_data.loops_per_jiffy));
+ __delay(xloops?xloops:1);
}

void __udelay(unsigned long usecs)
{
- __const_udelay(usecs * 0x000010c6); /* 2**32 / 1000000 */
+ __const_udelay(usecs * 0x000010c7); /* 2**32 / 1000000 */
}

void __ndelay(unsigned long nsecs)
===== include/asm-i386/delay.h 1.2 vs edited =====
--- 1.2/include/asm-i386/delay.h Tue Feb 18 06:40:31 2003
+++ edited/include/asm-i386/delay.h Mon Jun 7 12:16:42 2004
@@ -16,7 +16,7 @@
extern void __delay(unsigned long loops);

#define udelay(n) (__builtin_constant_p(n) ? \
- ((n) > 20000 ? __bad_udelay() : __const_udelay((n) * 0x10c6ul)) : \
+ ((n) > 20000 ? __bad_udelay() : __const_udelay((n) * 0x10c7ul)) : \
__udelay(n))

#define ndelay(n) (__builtin_constant_p(n) ? \





2004-06-07 21:22:39

by Dominik Brodowski

[permalink] [raw]
Subject: [PATCH 3/3] fix for small xloops [Was: Re: Too much error in __const_udelay() ?]

The const_udelay calculation relies on the "overflow" of the lower 32 bits
of the mull operation. What's in the lower 32 bits is "cut off", so that a
"rounding down" phenomenon exists. For large arguments to {n,u}delay, this does
not matter, as udelay and ndelay round _up_ themselves. However, for small
delays (for cyclone timer: up to 20ns; for pmtmr-based delay timer it's even
up to 1500ns or 1us) it _is_ a critical error. Empirical testing has shown that
it happens only (for usual values of loops_per_jiffies) if xloops is lower or
equal to six. Let's be safe, and double that value, and add one xloop if
xloop is smaller than 13.

Signed-off-by: Dominik Brodowski <[email protected]>

diff -ruN linux-original/arch/i386/lib/delay.c linux/arch/i386/lib/delay.c
--- linux-original/arch/i386/lib/delay.c 2004-06-07 23:02:02.472656160 +0200
+++ linux/arch/i386/lib/delay.c 2004-06-07 22:55:40.063791144 +0200
@@ -34,6 +34,8 @@
__asm__("mull %0"
:"=d" (xloops), "=&a" (d0)
:"1" (xloops),"0" (current_cpu_data.loops_per_jiffy * HZ));
+ if (unlikely(xloops < 13))
+ xloops++;
__delay(xloops);
}

2004-06-07 21:22:34

by Dominik Brodowski

[permalink] [raw]
Subject: [PATCH 2/3] round up in __udelay() [Was: Re: Too much error in __const_udelay() ?]

Round up in __udelay(): 2**32 / 100000 is 4294.97, so it's more intuitive
to round up, and it causes more predictable results:

n usec delay on a 1500000 BogoMIPS system:

n before -mull after
1 1000 ticks 1499 ticks 1500 ticks
10 14000 ticks 14999 ticks 15000 ticks

n usec delay on a 100000 BogoMIPS system:

n before -mull after
1 0 ticks 99 ticks 100 ticks
10 0 ticks 999 ticks 1000 ticks
100 9000 ticks 9999 ticks 10000 ticks


While it can be argued that some time is also spent in the delay
functions, it's better to spend _at least_ the specified time sleeping,
in my humble opinion. And the overhead of a specific ->delay() implementation
should be substracted in the specific ->delay() implementation.


Signed-off-by: Dominik Brodowski <[email protected]>


diff -ruN linux-original/arch/i386/lib/delay.c linux/arch/i386/lib/delay.c
--- linux-original/arch/i386/lib/delay.c 2004-06-07 22:10:42.053950984 +0200
+++ linux/arch/i386/lib/delay.c 2004-06-07 22:12:58.789164072 +0200
@@ -39,7 +39,7 @@

void __udelay(unsigned long usecs)
{
- __const_udelay(usecs * 0x000010c6); /* 2**32 / 1000000 */
+ __const_udelay(usecs * 0x000010c7); /* 2**32 / 1000000 (rounded up) */
}

void __ndelay(unsigned long nsecs)
diff -ruN linux-original/include/asm-i386/delay.h linux/include/asm-i386/delay.h
--- linux-original/include/asm-i386/delay.h 2004-06-07 22:01:48.901002552 +0200
+++ linux/include/asm-i386/delay.h 2004-06-07 22:14:00.393798744 +0200
@@ -16,7 +16,7 @@
extern void __delay(unsigned long loops);

#define udelay(n) (__builtin_constant_p(n) ? \
- ((n) > 20000 ? __bad_udelay() : __const_udelay((n) * 0x10c6ul)) : \
+ ((n) > 20000 ? __bad_udelay() : __const_udelay((n) * 0x10c7ul)) : \
__udelay(n))

#define ndelay(n) (__builtin_constant_p(n) ? \

2004-06-07 21:24:12

by Dominik Brodowski

[permalink] [raw]
Subject: [PATCH 1/3] mull'ify multiplication with HZ in __const_udelay() [Was: Re: Too much error in __const_udelay() ?]

Move the multiplication of (loops_per_jiffy * xloops) with HZ into
the "mull" asm operation. This increases the accuracy of the delay functions
largely:

n usec delay on a 1500000 BogoMIPS system:

n before after
1 1000 ticks 1499 ticks
10 14000 ticks 14999 ticks

n usec delay on a 100000 BogoMIPS system:

n before after
1 0 ticks 99 ticks
10 0 ticks 999 ticks
100 9000 ticks 9999 ticks


Signed-off-by: Dominik Brodowski <[email protected]>


diff -ruN linux-original/arch/i386/lib/delay.c linux/arch/i386/lib/delay.c
--- linux-original/arch/i386/lib/delay.c 2004-06-07 22:01:46.608351088 +0200
+++ linux/arch/i386/lib/delay.c 2004-06-07 22:05:03.299449496 +0200
@@ -33,8 +33,8 @@
int d0;
__asm__("mull %0"
:"=d" (xloops), "=&a" (d0)
- :"1" (xloops),"0" (current_cpu_data.loops_per_jiffy));
- __delay(xloops * HZ);
+ :"1" (xloops),"0" (current_cpu_data.loops_per_jiffy * HZ));
+ __delay(xloops);
}

void __udelay(unsigned long usecs)

2004-06-07 21:24:34

by Dominik Brodowski

[permalink] [raw]
Subject: Re: Too much error in __const_udelay() ?

On Mon, Jun 07, 2004 at 12:12:48PM -0700, john stultz wrote:
> I agree w/ Pavel that rounding up sounds better, but I can't get the
> math to work, so this may be the best solution.

It's some strange sort of rounding, see my patch "3"...

> I'm also spinning up a patch w/ these changes to test, let me know how
> your testing went and I'll do the same.

Testing went fine -- even for the PMTMR-based delay case [*].

Dominik

[*] though I noticed the cpufreq notifier breaks then: it updates
loops_per_jiffy without evaluating if it's indeed TSC- or even
frequency-based. It'll fail on cyclone, too, I think...

2004-06-07 22:00:26

by john stultz

[permalink] [raw]
Subject: Re: [PATCH 1/3] mull'ify multiplication with HZ in __const_udelay() [Was: Re: Too much error in __const_udelay() ?]

On Mon, 2004-06-07 at 14:20, Dominik Brodowski wrote:
> Move the multiplication of (loops_per_jiffy * xloops) with HZ into
> the "mull" asm operation. This increases the accuracy of the delay functions
> largely:
>
[snip]
> diff -ruN linux-original/arch/i386/lib/delay.c linux/arch/i386/lib/delay.c
> --- linux-original/arch/i386/lib/delay.c 2004-06-07 22:01:46.608351088 +0200
> +++ linux/arch/i386/lib/delay.c 2004-06-07 22:05:03.299449496 +0200
> @@ -33,8 +33,8 @@
> int d0;
> __asm__("mull %0"
> :"=d" (xloops), "=&a" (d0)
> - :"1" (xloops),"0" (current_cpu_data.loops_per_jiffy));
> - __delay(xloops * HZ);
> + :"1" (xloops),"0" (current_cpu_data.loops_per_jiffy * HZ));
> + __delay(xloops);
> }

Kurt Garloff brought up a good point that loops_per_jiffy*HZ is only
good up to 4Ghz time sources. The workaround he suggested was to
multiply xloops by 4 first and divide HZ by 4. This will allow for
frequencies up to 16Ghz.

So something like:

xloops *= 4;
__asm__("mull %0"
:"=d" (xloops), "=&a" (d0)
:"1" (xloops),"0" (LPJ*(HZ/4)));

-john

2004-06-09 10:04:09

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 3/3] fix for small xloops [Was: Re: Too much error in __const_udelay() ?]

Hi!

> The const_udelay calculation relies on the "overflow" of the lower 32 bits
> of the mull operation. What's in the lower 32 bits is "cut off", so that a
> "rounding down" phenomenon exists. For large arguments to {n,u}delay, this does
> not matter, as udelay and ndelay round _up_ themselves. However, for small
> delays (for cyclone timer: up to 20ns; for pmtmr-based delay timer it's even
> up to 1500ns or 1us) it _is_ a critical error. Empirical testing has shown that
> it happens only (for usual values of loops_per_jiffies) if xloops is lower or
> equal to six. Let's be safe, and double that value, and add one xloop if
> xloop is smaller than 13.

Should not you just xloops++, always? Better safe than sorry. Plus you
have one less test and branch...

Pavel

> Signed-off-by: Dominik Brodowski <[email protected]>
>
> diff -ruN linux-original/arch/i386/lib/delay.c linux/arch/i386/lib/delay.c
> --- linux-original/arch/i386/lib/delay.c 2004-06-07 23:02:02.472656160 +0200
> +++ linux/arch/i386/lib/delay.c 2004-06-07 22:55:40.063791144 +0200
> @@ -34,6 +34,8 @@
> __asm__("mull %0"
> :"=d" (xloops), "=&a" (d0)
> :"1" (xloops),"0" (current_cpu_data.loops_per_jiffy * HZ));
> + if (unlikely(xloops < 13))
> + xloops++;
> __delay(xloops);
> }
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
934a471f20d6580d5aad759bf0d97ddc

2004-06-15 06:13:18

by Dominik Brodowski

[permalink] [raw]
Subject: [PATCH 1/3] mull'ify multiplication with HZ in __const_udelay() [Was: Re: Too much error in __const_udelay() ?]

John Stultz mentioned on lkml ( http://lkml.org/lkml/2004/6/5/15 ) that
calls to udelay() don't delay long enough, causing trouble e.g. in the USB
subsystem. The following patches address this issue.

Move the multiplication of (loops_per_jiffy * xloops) with HZ into
the "mull" asm operation. This increases the accuracy of the delay functions
largely:

n usec delay on a system with loops_per_jiffy = 1500000 :

n before after
1 1000 ticks 1499 ticks
10 14000 ticks 14999 ticks

n usec delay on a system with loops_per_jiffy = 100000 :

n before after
1 0 ticks 99 ticks
10 0 ticks 999 ticks
100 9000 ticks 9999 ticks

As noted by Kurt Garloff, it's necessary to adjust for large loops_per_jiffies,
as the multiplication of it with HZ fails for 4GHz or larger. So, John Stultz
suggested multiplying xloops with 4 first, and multiplying with (HZ/4).

Signed-off-by: Dominik Brodowski <[email protected]>


diff -ruN linux-original/arch/i386/lib/delay.c linux/arch/i386/lib/delay.c
--- linux-original/arch/i386/lib/delay.c 2004-06-14 18:20:27.000000000 +0200
+++ linux/arch/i386/lib/delay.c 2004-06-15 07:48:57.302279400 +0200
@@ -31,10 +31,11 @@
inline void __const_udelay(unsigned long xloops)
{
int d0;
+ xloops *= 4;
__asm__("mull %0"
:"=d" (xloops), "=&a" (d0)
- :"1" (xloops),"0" (current_cpu_data.loops_per_jiffy));
- __delay(xloops * HZ);
+ :"1" (xloops),"0" (current_cpu_data.loops_per_jiffy * (HZ/4)));
+ __delay(xloops);
}

void __udelay(unsigned long usecs)