2013-08-28 18:30:05

by Mischa Jonker

[permalink] [raw]
Subject: [PATCH] ARC: Fix __udelay parentheses

Make sure that usecs is casted to long long, to ensure that the
(usecs * 4295 * HZ) multiplication is 64 bit.

Initially, the (usecs * 4295 * HZ) part was done as a 32 bit
multiplication, with the result casted to 64 bit. This led to some bits
falling off.

Signed-off-by: Mischa Jonker <[email protected]>
---
arch/arc/include/asm/delay.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arc/include/asm/delay.h b/arch/arc/include/asm/delay.h
index 442ce5d..8d35fe1 100644
--- a/arch/arc/include/asm/delay.h
+++ b/arch/arc/include/asm/delay.h
@@ -56,8 +56,8 @@ static inline void __udelay(unsigned long usecs)
/* (long long) cast ensures 64 bit MPY - real or emulated
* HZ * 4295 is pre-evaluated by gcc - hence only 2 mpy ops
*/
- loops = ((long long)(usecs * 4295 * HZ) *
- (long long)(loops_per_jiffy)) >> 32;
+ loops = (((long long) usecs) * 4295 * HZ *
+ (long long) loops_per_jiffy) >> 32;

__delay(loops);
}
--
1.7.9.5


2013-08-28 18:45:41

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] ARC: Fix __udelay parentheses

On Wed, 2013-08-28 at 20:29 +0200, Mischa Jonker wrote:
> Make sure that usecs is casted to long long, to ensure that the
> (usecs * 4295 * HZ) multiplication is 64 bit.
>
> Initially, the (usecs * 4295 * HZ) part was done as a 32 bit
> multiplication, with the result casted to 64 bit. This led to some bits
> falling off.
>
> Signed-off-by: Mischa Jonker <[email protected]>
> ---
> arch/arc/include/asm/delay.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arc/include/asm/delay.h b/arch/arc/include/asm/delay.h
> index 442ce5d..8d35fe1 100644
> --- a/arch/arc/include/asm/delay.h
> +++ b/arch/arc/include/asm/delay.h
> @@ -56,8 +56,8 @@ static inline void __udelay(unsigned long usecs)
> /* (long long) cast ensures 64 bit MPY - real or emulated
> * HZ * 4295 is pre-evaluated by gcc - hence only 2 mpy ops
> */
> - loops = ((long long)(usecs * 4295 * HZ) *
> - (long long)(loops_per_jiffy)) >> 32;
> + loops = (((long long) usecs) * 4295 * HZ *
> + (long long) loops_per_jiffy) >> 32;

Shouldn't this be unsigned long long or u64?

Why is it >> 32 again?

The comment above it doesn't seem to match the code.

2013-08-28 18:55:52

by Mischa Jonker

[permalink] [raw]
Subject: RE: [PATCH] ARC: Fix __udelay parentheses

> > Make sure that usecs is casted to long long, to ensure that the (usecs
> > * 4295 * HZ) multiplication is 64 bit.
> >
> > Initially, the (usecs * 4295 * HZ) part was done as a 32 bit
> > multiplication, with the result casted to 64 bit. This led to some
> > bits falling off.
> >
> > Signed-off-by: Mischa Jonker <[email protected]>
> > ---
> > arch/arc/include/asm/delay.h | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arc/include/asm/delay.h
> > b/arch/arc/include/asm/delay.h index 442ce5d..8d35fe1 100644
> > --- a/arch/arc/include/asm/delay.h
> > +++ b/arch/arc/include/asm/delay.h
> > @@ -56,8 +56,8 @@ static inline void __udelay(unsigned long usecs)
> > /* (long long) cast ensures 64 bit MPY - real or emulated
> > * HZ * 4295 is pre-evaluated by gcc - hence only 2 mpy ops
> > */
> > - loops = ((long long)(usecs * 4295 * HZ) *
> > - (long long)(loops_per_jiffy)) >> 32;
> > + loops = (((long long) usecs) * 4295 * HZ *
> > + (long long) loops_per_jiffy) >> 32;
>
> Shouldn't this be unsigned long long or u64?
Yes, it should, but that is not directly related to the issue:)

> Why is it >> 32 again?
>
> The comment above it doesn't seem to match the code.
>
The original code explains about the >> 32:

/*
* Normal Math for computing loops in "N" usecs
* -we have precomputed @loops_per_jiffy
* -1 sec has HZ jiffies
* loops per "N" usecs = ((loops_per_jiffy * HZ / 1000000) * N)
*
* Approximate Division by multiplication:
* -Mathematically if we multiply and divide a number by same value the
* result remains unchanged: In this case, we use 2^32
* -> (loops_per_N_usec * 2^32 ) / 2^32
* -> (((loops_per_jiffy * HZ / 1000000) * N) * 2^32) / 2^32
* -> (loops_per_jiffy * HZ * N * 4295) / 2^32
*
* -Divide by 2^32 is very simply right shift by 32
* -We simply need to ensure that the multiply per above eqn happens in
* 64-bit precision (if CPU doesn't support it - gcc can emaulate it)
*/

The problem is that the original code already _tried_ to cast to 64 bits (to ensure a 64 bit multiply), but only did so after the multiply (so 32 bit multiply).

Mischa

2013-08-28 19:01:48

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] ARC: Fix __udelay parentheses

On Wed, 2013-08-28 at 18:53 +0000, Mischa Jonker wrote:
> > > Make sure that usecs is casted to long long, to ensure that the (usecs
> > > * 4295 * HZ) multiplication is 64 bit.
> > >
> > > Initially, the (usecs * 4295 * HZ) part was done as a 32 bit
> > > multiplication, with the result casted to 64 bit. This led to some
> > > bits falling off.
> > >
> > > Signed-off-by: Mischa Jonker <[email protected]>
> > > ---
> > > arch/arc/include/asm/delay.h | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/arc/include/asm/delay.h
> > > b/arch/arc/include/asm/delay.h index 442ce5d..8d35fe1 100644
> > > --- a/arch/arc/include/asm/delay.h
> > > +++ b/arch/arc/include/asm/delay.h
> > > @@ -56,8 +56,8 @@ static inline void __udelay(unsigned long usecs)
> > > /* (long long) cast ensures 64 bit MPY - real or emulated
> > > * HZ * 4295 is pre-evaluated by gcc - hence only 2 mpy ops
> > > */
> > > - loops = ((long long)(usecs * 4295 * HZ) *
> > > - (long long)(loops_per_jiffy)) >> 32;
> > > + loops = (((long long) usecs) * 4295 * HZ *
> > > + (long long) loops_per_jiffy) >> 32;
> >
> > Shouldn't this be unsigned long long or u64?
> Yes, it should, but that is not directly related to the issue:)
>
> > Why is it >> 32 again?
> >
> > The comment above it doesn't seem to match the code.
> >
> The original code explains about the >> 32:
>
> /*
> * Normal Math for computing loops in "N" usecs
> * -we have precomputed @loops_per_jiffy
> * -1 sec has HZ jiffies
> * loops per "N" usecs = ((loops_per_jiffy * HZ / 1000000) * N)
> *
> * Approximate Division by multiplication:
> * -Mathematically if we multiply and divide a number by same value the
> * result remains unchanged: In this case, we use 2^32
> * -> (loops_per_N_usec * 2^32 ) / 2^32
> * -> (((loops_per_jiffy * HZ / 1000000) * N) * 2^32) / 2^32
> * -> (loops_per_jiffy * HZ * N * 4295) / 2^32
> *
> * -Divide by 2^32 is very simply right shift by 32
> * -We simply need to ensure that the multiply per above eqn happens in
> * 64-bit precision (if CPU doesn't support it - gcc can emaulate it)
> */

I don't see the loops_per_jiffy initial shift << 32.

> The problem is that the original code already _tried_ to cast to 64 bits (to ensure a 64 bit multiply), but only did so after the multiply (so 32 bit multiply).

I know that. It's the use of a signed long long
vs the unsigned long long that I think wrong.

Why cast a unsigned to a signed?

2013-08-28 19:12:28

by Mischa Jonker

[permalink] [raw]
Subject: RE: [PATCH] ARC: Fix __udelay parentheses

Hello Joe,

> I don't see the loops_per_jiffy initial shift << 32.

loops_per_jiffy * HZ = loops_per_second
loops_per_jiffy * HZ = 1,000,000 * loops_per_us
loops_per_jiffy * HZ * 4295 = 4,295,000 * loops_per_us

loops_per_jiffy * HZ * 4294.967296 = 2^32 * loops_per_us

> > > > - loops = ((long long)(usecs * 4295 * HZ) *
> > > > - (long long)(loops_per_jiffy)) >> 32;
> > > > + loops = (((long long) usecs) * 4295 * HZ *
> > > > + (long long) loops_per_jiffy) >> 32;

>
> I know that. It's the use of a signed long long vs the unsigned long long
> that I think wrong.

Yes that is wrong too.

>
> Why cast a unsigned to a signed?

I don't know, this was in the original file. The issue that I was trying to solve, was that usleep didn't sleep long enough, and that is fixed by this patch.

Wrt signed/unsigned: would you like me to update this patch or create a separate one?

Mischa

2013-08-29 01:43:53

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] ARC: Fix __udelay parentheses

On Wed, 2013-08-28 at 19:12 +0000, Mischa Jonker wrote:
> Hello Joe,
>
> > I don't see the loops_per_jiffy initial shift << 32.
>
> loops_per_jiffy * HZ = loops_per_second
> loops_per_jiffy * HZ = 1,000,000 * loops_per_us
> loops_per_jiffy * HZ * 4295 = 4,295,000 * loops_per_us
>
> loops_per_jiffy * HZ * 4294.967296 = 2^32 * loops_per_us
>
> > > > > - loops = ((long long)(usecs * 4295 * HZ) *
> > > > > - (long long)(loops_per_jiffy)) >> 32;
> > > > > + loops = (((long long) usecs) * 4295 * HZ *
> > > > > + (long long) loops_per_jiffy) >> 32;
>
> >
> > I know that. It's the use of a signed long long vs the unsigned long long
> > that I think wrong.
>
> Yes that is wrong too.
>
> >
> > Why cast a unsigned to a signed?
>
> I don't know, this was in the original file. The issue that I was trying to solve, was that usleep didn't sleep long enough, and that is fixed by this patch.
>
> Wrt signed/unsigned: would you like me to update this patch or create a separate one?

I think the whole thing is odd and it should simply be

loops = loops_per_jiffy * usecs_to_jiffies(usecs);

and if it's really necessary to have a u64 delay
then __delay should be rewritten to take an one
as an argument and this calc should be:

u64 loops = (u64)loops_per_jiffy * usecs_to_jiffies(usecs);


2013-08-29 05:56:08

by Vineet Gupta

[permalink] [raw]
Subject: Re: [PATCH] ARC: Fix __udelay parentheses

On 08/29/2013 12:00 AM, Mischa Jonker wrote:
> Make sure that usecs is casted to long long, to ensure that the
> (usecs * 4295 * HZ) multiplication is 64 bit.
>
> Initially, the (usecs * 4295 * HZ) part was done as a 32 bit
> multiplication, with the result casted to 64 bit. This led to some bits
> falling off.
>
> Signed-off-by: Mischa Jonker <[email protected]>
> ---
> arch/arc/include/asm/delay.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arc/include/asm/delay.h b/arch/arc/include/asm/delay.h
> index 442ce5d..8d35fe1 100644
> --- a/arch/arc/include/asm/delay.h
> +++ b/arch/arc/include/asm/delay.h
> @@ -56,8 +56,8 @@ static inline void __udelay(unsigned long usecs)
> /* (long long) cast ensures 64 bit MPY - real or emulated
> * HZ * 4295 is pre-evaluated by gcc - hence only 2 mpy ops
> */
> - loops = ((long long)(usecs * 4295 * HZ) *
> - (long long)(loops_per_jiffy)) >> 32;
> + loops = (((long long) usecs) * 4295 * HZ *
> + (long long) loops_per_jiffy) >> 32;
>
> __delay(loops);
> }

The intent of writing orig code was to generate only 1 MPYHU insn (32*32 =
high-part-64) for the whole math, at any optimization level whatsoever. If the
first MPY is overflowing, u r likely spinning for > 10,000 usec (10ms) which is 1
scheduling tick on ARC - not good - presumably for hardware debug. It would be
better to use a tight loop there and throw it out later.

The API abuse would only be caught for const @usecs case. Maybe we need to add a
WARN_ON() there.

OTOH, if we really want to fix this, it would be cleaner to rewrite this as
loops = ((u64)usecs * 4295 * HZ * loops_per_jiffy) >> 32;

Since one factor is upcasted, all are promoted to 64 bit. And we leave the
optimizations to whims of gcc.

@Joern, I would assume that long long vs u64 (or unsigned long long) doesn't
matter in this particular case.

-Vineet

2013-08-29 06:01:19

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] ARC: Fix __udelay parentheses

On Thu, 2013-08-29 at 05:55 +0000, Vineet Gupta wrote:

> The intent of writing orig code was to generate only 1 MPYHU insn (32*32 =
> high-part-64) for the whole math, at any optimization level whatsoever. If the
> first MPY is overflowing, u r likely spinning for > 10,000 usec (10ms) which is 1
> scheduling tick on ARC - not good - presumably for hardware debug. It would be
> better to use a tight loop there and throw it out later.

It's a delay loop. Does it matter whether
or not a multiply or division is used?

2013-08-29 06:24:29

by Vineet Gupta

[permalink] [raw]
Subject: Re: [PATCH] ARC: Fix __udelay parentheses

On 08/29/2013 11:31 AM, Joe Perches wrote:
> On Thu, 2013-08-29 at 05:55 +0000, Vineet Gupta wrote:
>
>> The intent of writing orig code was to generate only 1 MPYHU insn (32*32 =
>> high-part-64) for the whole math, at any optimization level whatsoever. If the
>> first MPY is overflowing, u r likely spinning for > 10,000 usec (10ms) which is 1
>> scheduling tick on ARC - not good - presumably for hardware debug. It would be
>> better to use a tight loop there and throw it out later.
> It's a delay loop. Does it matter whether
> or not a multiply or division is used?

I know what you mean here. Your suggestion from a different mail,

> I think the whole thing is odd and it should simply be
>
> loops = loops_per_jiffy * usecs_to_jiffies(usecs)


This adds an additional MPYHU (ignoring the large limms and check for max jiffies).
FWIW, most arches do optimize this routine a bit - so ARC not using a standard
kernel API is not that big a sin ;-)

On the topic of multiply vs. divide (which probably is not relevant to topic at
hand though), since ARCompact doesn't have native divide, we end up emulating it
using libgcc routines. That makes it slightly non-deterministic (not a big deal)
and also adds to boot time (which those delays sprinkled all over the place in
crazy device probes and such). Seriously we got hammered by a customer for that once.

-Vineet

2013-08-29 06:37:08

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] ARC: Fix __udelay parentheses

On Thu, 2013-08-29 at 06:24 +0000, Vineet Gupta wrote:
> On 08/29/2013 11:31 AM, Joe Perches wrote:
> > I think the whole thing is odd and it should simply be
> >
> > loops = loops_per_jiffy * usecs_to_jiffies(usecs)
[]
> On the topic of multiply vs. divide (which probably is not relevant to topic at
> hand though), since ARCompact doesn't have native divide, we end up emulating it
> using libgcc routines. That makes it slightly non-deterministic (not a big deal)
> and also adds to boot time (which those delays sprinkled all over the place in
> crazy device probes and such). Seriously we got hammered by a customer for that once.

That argues more for reducing the uses of hard delays
than making the hard delay count calculation simpler.

2013-08-29 06:49:13

by Vineet Gupta

[permalink] [raw]
Subject: Re: [PATCH] ARC: Fix __udelay parentheses

On 08/29/2013 12:07 PM, Joe Perches wrote:
>> On the topic of multiply vs. divide (which probably is not relevant to topic at
>> hand though), since ARCompact doesn't have native divide, we end up emulating it
>> using libgcc routines. That makes it slightly non-deterministic (not a big deal)
>> and also adds to boot time (which those delays sprinkled all over the place in
>> crazy device probes and such). Seriously we got hammered by a customer for that once.
> That argues more for reducing the uses of hard delays
> than making the hard delay count calculation simpler.

Right, but most often the drivers are written/maintained by non arch people.

I'm with you in terms of simplification and I think the following

loops = ((u64)usecs * 4295 * HZ * loops_per_jiffy) >> 32;

is fairly simple, well commented, and optimal for ARC.

-Vineet