2009-07-06 19:20:53

by John Gregor

[permalink] [raw]
Subject: [PATCH] x86_64/__iowrite32_copy: don't use string move for PIO writes

From: John Gregor <[email protected]>

Some processors can write the same word more than once if the movs
instruction is used. This version uses normal memory move instructions
and gets the same performance since the speed is limited by PCIe and
write combining.

Signed-off-by: Ralph Campbell <[email protected]>
Signed-off-by: John Gregor <[email protected]>
---
arch/x86/lib/iomap_copy_64.S | 20 ++++++++++++++++++--
1 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/iomap_copy_64.S b/arch/x86/lib/iomap_copy_64.S
index 05a95e7..344b00e 100644
--- a/arch/x86/lib/iomap_copy_64.S
+++ b/arch/x86/lib/iomap_copy_64.S
@@ -1,4 +1,5 @@
/*
+ * Copyright 2009 QLogic Corporation. All rights reserved.
* Copyright 2006 PathScale, Inc. All Rights Reserved.
*
* This file is free software; you can redistribute it and/or modify
@@ -23,8 +24,23 @@
*/
ENTRY(__iowrite32_copy)
CFI_STARTPROC
- movl %edx,%ecx
- rep movsd
+ movl %edx, %ecx
+ andl $-2, %edx
+ je .L2
+ leaq (%rsi,%rdx,4), %rdx
+.L1:
+ movq (%rsi), %rax
+ addq $8, %rsi
+ movq %rax, (%rdi)
+ addq $8, %rdi
+ cmpq %rsi, %rdx
+ ja .L1
+.L2:
+ bt $0, %ecx
+ jae .L4
+ movl (%rsi), %eax
+ movl %eax, (%rdi)
+.L4:
ret
CFI_ENDPROC
ENDPROC(__iowrite32_copy)
--
1.6.0.6


2009-07-06 21:30:33

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86_64/__iowrite32_copy: don't use string move for PIO writes

John Gregor wrote:
> From: John Gregor <[email protected]>
>
> Some processors can write the same word more than once if the movs
> instruction is used. This version uses normal memory move instructions
> and gets the same performance since the speed is limited by PCIe and
> write combining.
>

Which processors do that *when addressing I/O memory*?

-hpa

2009-07-06 21:37:01

by John Gregor

[permalink] [raw]
Subject: Re: [PATCH] x86_64/__iowrite32_copy: don't use string move for PIO writes

"H. Peter Anvin" <[email protected]> wrote:

> John Gregor wrote:
> > From: John Gregor <[email protected]>
> >
> > Some processors can write the same word more than once if the movs
> > instruction is used. This version uses normal memory move instructions
> > and gets the same performance since the speed is limited by PCIe and
> > write combining.
> >
>
> Which processors do that *when addressing I/O memory*?

Nehalem.

-John Gregor

2009-07-06 21:41:53

by John Gregor

[permalink] [raw]
Subject: Re: [PATCH] x86_64/__iowrite32_copy: don't use string move for PIO writes

"H. Peter Anvin" <[email protected]> wrote:

> John Gregor wrote:
> > From: John Gregor <[email protected]>
> >
> > Some processors can write the same word more than once if the movs
> > instruction is used. This version uses normal memory move instructions
> > and gets the same performance since the speed is limited by PCIe and
> > write combining.
> >
>
> Which processors do that *when addressing I/O memory*?

More specifically, check out:

http://www.intel.com/Assets/PDF/specupdate/321324.pdf

Errata AAK6

-John Gregor

2009-07-06 22:47:20

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86_64/__iowrite32_copy: don't use string move for PIO writes

John A. Gregor wrote:
> "H. Peter Anvin" <[email protected]> wrote:
>
>> John Gregor wrote:
>>> From: John Gregor <[email protected]>
>>>
>>> Some processors can write the same word more than once if the movs
>>> instruction is used. This version uses normal memory move instructions
>>> and gets the same performance since the speed is limited by PCIe and
>>> write combining.
>>>
>> Which processors do that *when addressing I/O memory*?
>
> Nehalem.
>

Thanks. That should go into the patch header with a reference to the
erratum, rather than saying "some processors". Additionally, there
should be a comment in the code to that effect. Otherwise when 5-10
years from now we're trying to figure out what is going on, there is
absolutely no hope, and we'll either re-trigger the bug or end up being
stuck with cargo-cult programming.

It's probably also worth nothing that with this change, there probably
is little point in writing this in assembly at all. This is especially
so since your implementation is buggy: by definition, __iowrite32_copy()
does the I/O operations 32 bits at a time, but you are using 64-bit
operations.

This gets even more ironic when you look at the erratum, which states
that this happens only when crossing from WB/WC memory into UC/WP/WT
memory, and in the case of UC memory (the only one of those types which
is used in Linux):

* UC the data size of each write will now always be 8 bytes, as
opposed to the original data size.

A workaround is also described, which is to avoid page-crossing copies.
No mention of doubled stores or anything like that.

Note the irony in that your patch actually takes the problem described
in the erratum and implementing it in software -- with your change,
*all* stores would be 8 bytes, which of course violates the definition
entirely.

Note how critical detailed information of the erratum was to being able
to make an actual analysis of the patch, too. Again, this is not just a
case immediately, but possibly down the line.

Hence:

- your patch is wrong and actually introduces the problem it is supposed
to solve;
- your description of the problem was both incomplete and incorrect (it
might be correct if there is another erratum, but if so please
describe it, it is not Nehalem AAK6);
- the current code is fine as long as it doesn't cross from cached to
uncached memory, which it never should in any kind of sensible driver;
- if you are worried about that, introduce code which breaks the string
copy at page boundaries.

Thanks,

-hpa

2009-07-06 23:44:10

by John Gregor

[permalink] [raw]
Subject: Re: [PATCH] x86_64/__iowrite32_copy: don't use string move for PIO writes

"H. Peter Anvin" <[email protected]> wrote:
> Thanks. That should go into the patch header with a reference to the
> erratum, rather than saying "some processors". Additionally, there
> should be a comment in the code to that effect. Otherwise when 5-10
> years from now we're trying to figure out what is going on, there is
> absolutely no hope, and we'll either re-trigger the bug or end up being
> stuck with cargo-cult programming.

I agree, there can be better info in the patch submission. I'll take
a 2nd stab at writing it up.

> It's probably also worth nothing that with this change, there probably
> is little point in writing this in assembly at all. This is especially
> so since your implementation is buggy: by definition, __iowrite32_copy()
> does the I/O operations 32 bits at a time, but you are using 64-bit
> operations.

The point of the routine is to make sure that transfers are *at least*
32-bit aligned and *at least* multiples of 32-bits in length. 64-bit
transfers meet those conditions.

Another, less clearly stated, condition (which is imposed by the adapter
the code was written for(*)) is that we can't take multiple writes
to the same memory location (i.e. the number of dwords written to the
adapter's buffer must == the number of dwords in the transfer). It is
this condition that rep mov doesn't always respect. It appears that
(possibly due to the processor taking an interrupt) the rep mov can
"back up" and redo part of the transfer. We've gotten PCIe analyzer
traces that confirm this behavior.

(*) Now you might wonder why such an adapter-specific routine is living in
arch/x86/lib. And I might agree. But years ago, when we submitted ipath
for inclusion into the kernel, it was felt by some that the arch-specific
.S down in the driver was the wrong place and that it should be made
into a generally-available routine with an optimized x86_64 variant.
So here we are.

> Note the irony in that your patch actually takes the problem described
> in the erratum and implementing it in software -- with your change,
> *all* stores would be 8 bytes, which of course violates the definition
> entirely.

See above, 8-byte alignment and transfers (being a subset of 4-byte
alignments) are fine. Since we are writing into a WC space, we don't
have control over how many dwords get stuck together in a transfer
anyway.

> Note how critical detailed information of the erratum was to being able
> to make an actual analysis of the patch, too. Again, this is not just a
> case immediately, but possibly down the line.

> - your description of the problem was both incomplete and incorrect (it
> might be correct if there is another erratum, but if so please
> describe it, it is not Nehalem AAK6);

You are right, AAK6 was what got us to investigate the rep mov sequences
when we started having problems, but it is not an exact match for what
we encountered. As far as I know, there is no errata published directly
detailing the conditions where we found the problem. We do have PCIe
traces confirming that, when doing write-combining PIO writes using
rep mov, that some dword locations were being written to multiple times.
This appears to be associated with taking interrupts and not with crossing
any page boundaries.

-John Gregor

2009-07-07 00:03:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86_64/__iowrite32_copy: don't use string move for PIO writes

John A. Gregor wrote:
>
>> It's probably also worth nothing that with this change, there probably
>> is little point in writing this in assembly at all. This is especially
>> so since your implementation is buggy: by definition, __iowrite32_copy()
>> does the I/O operations 32 bits at a time, but you are using 64-bit
>> operations.
>
> The point of the routine is to make sure that transfers are *at least*
> 32-bit aligned and *at least* multiples of 32-bits in length. 64-bit
> transfers meet those conditions.

No. That may be what *you* want, but that's not what the function is
documented or specified to do.

/**
* __iowrite32_copy - copy data to MMIO space, in 32-bit units
* @to: destination, in MMIO space (must be 32-bit aligned)
* @from: source (must be 32-bit aligned)
* @count: number of 32-bit quantities to copy
*
* Copy data from kernel space to MMIO space, in units of 32 bits at a
* time. Order of access is not guaranteed, nor is a memory barrier
* performed afterwards.
*/

That is the generic version and the documented ABI, which the x86
version must mimic. This is not the function to use if you don't care
about access size, which you better not if you're accessing WC memory.

> Another, less clearly stated, condition (which is imposed by the adapter
> the code was written for(*)) is that we can't take multiple writes
> to the same memory location (i.e. the number of dwords written to the
> adapter's buffer must == the number of dwords in the transfer). It is
> this condition that rep mov doesn't always respect. It appears that
> (possibly due to the processor taking an interrupt) the rep mov can
> "back up" and redo part of the transfer. We've gotten PCIe analyzer
> traces that confirm this behavior.
[...]
> See above, 8-byte alignment and transfers (being a subset of 4-byte
> alignments) are fine. Since we are writing into a WC space, we don't
> have control over how many dwords get stuck together in a transfer
> anyway.

If you're doing this to WC memory, I believe your driver (or hardware)
is broken. WC does not guarantee this condition to the best of my
knowledge, although I haven't found anything that explicitly states that
WC memory has to be idempotent (it is documented, though, that it can
collapse stores to the same address, so it might have been considered a
resulting issue.)

Anyway, I'm fine simply removing the x86 assembly version of this
routine and simply using the generic routine. The checkin should
reference the observed behavior (which I would also suggest escalating
via your Intel FAEs if you can; I can try to investigate internally as
well.) If you want 64-bit transfers, you should use __iowrite64_copy();
if you want completely private behavior you should use your own routine
(and just write it in C) rather than changing the documented behavior of
an existing routine from one that might be useful for other drivers
(although currently aren't used as far as I can tell) to one which is
guaranteed not to be.

-hpa


-hpa

2009-07-07 16:06:25

by John Gregor

[permalink] [raw]
Subject: Re: [PATCH] x86_64/__iowrite32_copy: don't use string move for PIO writes

"H. Peter Anvin" <[email protected]> wrote:
> WC does not guarantee this condition to the best of my
> knowledge, although I haven't found anything that explicitly states that
> WC memory has to be idempotent (it is documented, though, that it can
> collapse stores to the same address, so it might have been considered a
> resulting issue.)

Yes, dealing with WC IO memory is (and has been) a pain. But it is
also the only way to get the kind of performance IB requires. We aren't
the only hardware that does this. Protestations that our ASIC is just
wrong really aren't helpful at this stage. We're well aware of how much
we are at the mercy of the processors and chipsets and there is little
architectural guarantee for much of anything. Welcome to HPC.

> Anyway, I'm fine simply removing the x86 assembly version of this
> routine and simply using the generic routine.

And, I'm fine with taking the asm back into our driver as long as people
then don't just turn around and give us grief for having an #ifdef for
X86_64 and inline asm.

> The checkin should
> reference the observed behavior (which I would also suggest escalating
> via your Intel FAEs if you can; I can try to investigate internally as
> well.)

I haven't felt at liberty to disclose what we have or haven't discussed
with Intel; I'm just sharing our direct findings - rep movs to WC MMIO
leads to multiple writes to the same address in some instances on some
processors (at least Nehalem). So this is an issue for us (and anyone
doing similar io). And since the C compiler is free to emit a rep movs
sequence, this needs to be done as asm.

Since the __iowrite_* routines are meant to do, well, I/O, they should be
as robust as possible against things that might break I/O for particular
hardware. I suggest that the issue with rep movs is just such a case.

> If you want 64-bit transfers, you should use __iowrite64_copy();

What we need would probably be most accurately called
__iowrite_atleast32_copy().

> if you want completely private behavior you should use your own routine
> (and just write it in C) rather than changing the documented behavior of
> an existing routine from one that might be useful for other drivers
> (although currently aren't used as far as I can tell) to one which is
> guaranteed not to be.

Again, just as a bit of history, the routine exists explicitly because
of our driver. See commits c27a0d75b33 and 0f07496144c. We just goofed
up a bit by not being more clear about the requirements of the routines
we submitted. In fact, in 2006 it probably wasn't even clear to us
what the precise requirements were going to turn out to be.

Anyway, we're very likely going to have to stick with asm for X86_64 no
matter how it gets implemented both for performance and as protection
against the compiler generating a rep movs sequence.

So, I see at least a couple possible courses of action:

1. Redefine __iowrite32_copy(), document it better, and take the
new asm.
2. Delete the .S file, warn about the possibility of the compiler
generating a rep movs sequence in the C code, and the ipath driver
will have a private copy routine again.

Thoughts?

-John Gregor

2009-07-08 00:46:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86_64/__iowrite32_copy: don't use string move for PIO writes

John A. Gregor wrote:
>
> Anyway, we're very likely going to have to stick with asm for X86_64 no
> matter how it gets implemented both for performance and as protection
> against the compiler generating a rep movs sequence.
>
> So, I see at least a couple possible courses of action:
>
> 1. Redefine __iowrite32_copy(), document it better, and take the
> new asm.
> 2. Delete the .S file, warn about the possibility of the compiler
> generating a rep movs sequence in the C code, and the ipath driver
> will have a private copy routine again.
>

I discussed this with H.J. Lu (x86 gcc maintainer), and he stated rather
unequivocally that gcc will not generate string instructions for a
volatile store, and that the gcc team considers that part of the
definition for volatile. So the C version is guaranteed to be good; we
should just remove the assembly version.

Redefining a public API -- no matter if you were the original authors --
is not an option. If you need something other than __iowrite32_copy()
or __iowrite64_copy() you should define a new API or make it a private
one. Either which way, a C implementation is guaranteed to be safe.

-hpa

2009-07-10 12:06:10

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] x86_64/__iowrite32_copy: don't use string move for PIO writes


> - the current code is fine as long as it doesn't cross from cached to
> uncached memory, which it never should in any kind of sensible driver;

Should we add a comment somewhere saying that is not permitted (and why)?

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-07-10 15:07:33

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86_64/__iowrite32_copy: don't use string move for PIO writes

Pavel Machek wrote:
>> - the current code is fine as long as it doesn't cross from cached to
>> uncached memory, which it never should in any kind of sensible driver;
>
> Should we add a comment somewhere saying that is not permitted (and why)?

It sounds like what we should do is simply remove
arch/x86/lib/iomap_copy_64.S, and probably add a comment to
lib/iomap_copy.c stating that rep string instructions have been reported
to cause anomalies when accessing WC memory, and that we rely on gcc not
generating them due thanks to the volatiles in the implementation of
__raw_writel(). H.J. Lu tells me that gcc will not generate string
instructions for volatiles, and that the x86 gcc team considers that
part of the definition for volatile. (HJL: please correct if I got that
incorrect.)

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-07-10 16:03:10

by Lu, Hongjiu

[permalink] [raw]
Subject: RE: [PATCH] x86_64/__iowrite32_copy: don't use string move for PIO writes

> Pavel Machek wrote:
> >> - the current code is fine as long as it doesn't cross from cached to
> >> uncached memory, which it never should in any kind of sensible driver;
> >
> > Should we add a comment somewhere saying that is not permitted (and why)?
>
> It sounds like what we should do is simply remove
> arch/x86/lib/iomap_copy_64.S, and probably add a comment to
> lib/iomap_copy.c stating that rep string instructions have been reported
> to cause anomalies when accessing WC memory, and that we rely on gcc not
> generating them due thanks to the volatiles in the implementation of
> __raw_writel(). H.J. Lu tells me that gcc will not generate string
> instructions for volatiles, and that the x86 gcc team considers that
> part of the definition for volatile. (HJL: please correct if I got that
> incorrect.)
>

That is correct. If it isn't the case, please open a gcc bug report.


H.J.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?