2004-03-20 00:40:25

by Roland Dreier

[permalink] [raw]
Subject: Fast 64-bit atomic writes (SSE?)

Hi, I'm trying to find the best (fastest) way to write a 64-bit
quantity atomically to a device (on i386). Taking a spinlock around
two 32-bit accesses seems to be rather expensive. I'm mostly
concerned about newish CPUs, so I'm willing to use SSE or SSE2
instructions (of course falling back to a slower locked path if the
kernel is built for an old CPU).

I guess I have two questions, a general one and a specific one.

General question:
What's the best way to do this?

Specific question:
I've come up with the below function. However, you may notice that
I'm forced to use movdqu (instead of movdqa, which is presumably
faster). This is because even with the __attribute__((aligned(16))),
my xmmsave array is not guaranteed to be aligned to 16 bytes. I could
just allocate 31 bytes for xmmsave and align it to 16 bytes by hand,
but that seems a little ugly. Is there some magic I'm missing?

Thanks,
Roland

static inline void mywrite64(u32 *val, void *dest)
{
u8 xmmsave[16] __attribute__((aligned(16)));

/* The #ifs are a hack to deal with 2.4 kernels without preempt. */
#if CONFIG_PREEMPT
preempt_disable();
#endif

/* We use movdqu for the moment, because even
__attribute__((aligned(16))) doesn't seem to guarantee
xmmsave is aligned to a 16 byte boundary. */
__asm__ __volatile__ (
"movdqu %%xmm0,(%0); \n\t"
"movq (%1),%%xmm0; \n\t"
"movq %%xmm0,(%2); \n\t"
"movdqu (%0),%%xmm0; \n\t"
:
: "r" (xmmsave), "r" (val), "r" (dest)
: "memory" );

#if CONFIG_PREEMPT
preempt_enable();
#endif
}


2004-03-20 07:37:50

by Andi Kleen

[permalink] [raw]
Subject: Re: Fast 64-bit atomic writes (SSE?)

Roland Dreier <[email protected]> writes:
>
> General question:
> What's the best way to do this?

Definitely not how you do it ;-) You corrupt the user space FPU context.
Also you didn't do a CPUID check, so it would just crash on machines

The RAID code has some examples on how to use SSE2 in the kernel correctly.

Better is probably to use CMPXCHG8, which avoids all of this.

-Andi

2004-03-20 14:44:26

by Roland Dreier

[permalink] [raw]
Subject: Re: Fast 64-bit atomic writes (SSE?)

Andi> Definitely not how you do it ;-) You corrupt the user space
Andi> FPU context. Also you didn't do a CPUID check, so it would
Andi> just crash on machines

I'm not an asm expert, so could you explain how it corrupts the FPU
context? I tried to save off the value of the XMM register I used,
and the docs I have say that the movq and movdq instructions don't
affect any flags.

As far as the CPUID, you're right... I left that part of the code out
but I am definitely planning on using this only if the machine has SSE2.

Andi> The RAID code has some examples on how to use SSE2 in the
Andi> kernel correctly.

Hmm, they save cr0 and do a clts, and then restore cr0 when they're
done. For my education, can you explain why?

Andi> Better is probably to use CMPXCHG8, which avoids all of
Andi> this.

OK, thanks.

- Roland

2004-03-20 14:49:22

by Roland Dreier

[permalink] [raw]
Subject: Re: Fast 64-bit atomic writes (SSE?)

Andi> Better is probably to use CMPXCHG8, which avoids all of
Andi> this.

Sorry to follow up again so soon, but I just looked up CMPXCHG8, and I
don't see how to use it to write to write-only device memory. Can you
elaborate a little?

Thanks,
Roland