There are some generic drivers in the kernel, which make use of the
q-accessors or their derivatives. While at current asm/io.h the accessors
are defined, their implementation is only applicable either for 64bit
systems, or for systems with cpu_has_64bits flag set. Obviously there
are MIPS systems which are neither of these, but still need to have
those drivers supported. In this case the solution is to define some
generic versions of the q-accessors, but with a limitation to be
non-atomic. Such accessors are defined in the
io-64-nonatomic-{hi-lo,lo-hi}.h file. The drivers which utilize the
q-suffixed IO-methods are supposed to include the header file, so
in case if these accessors aren't defined for the platform, the generic
non-atomic versions are utilized. Currently the MIPS-specific asm/io.h
file provides the q-accessors for any MIPS system even for ones, which
in fact don't support them and raise BUG() in case if any of them is
called. Due to this the generic versions of the accessors are never
used while an attempt to call the IO-methods causes the kernel BUG().
In order to fix this we need to define the q-accessors only for
the MIPS systems, which actually support them, and don't define them
otherwise, so to let the corresponding drivers to use the non-atomic
q-suffixed accessors.
Signed-off-by: Serge Semin <[email protected]>
Suggested-by: Arnd Bergmann <[email protected]>
Cc: Vadim V. Vlasov <[email protected]>
---
arch/mips/include/asm/io.h | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h
index 29997e42480e..4597017f147b 100644
--- a/arch/mips/include/asm/io.h
+++ b/arch/mips/include/asm/io.h
@@ -462,7 +462,12 @@ __BUILD_MEMORY_PFX(, bwlq, type, 0)
BUILDIO_MEM(b, u8)
BUILDIO_MEM(w, u16)
BUILDIO_MEM(l, u32)
+#ifdef CONFIG_64BIT
BUILDIO_MEM(q, u64)
+#else
+__BUILD_MEMORY_PFX(__raw_, q, u64, 0)
+__BUILD_MEMORY_PFX(__mem_, q, u64, 0)
+#endif
#define __BUILD_IOPORT_PFX(bus, bwlq, type) \
__BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0,) \
@@ -488,12 +493,16 @@ __BUILDIO(q, u64)
#define readb_relaxed __relaxed_readb
#define readw_relaxed __relaxed_readw
#define readl_relaxed __relaxed_readl
+#ifdef CONFIG_64BIT
#define readq_relaxed __relaxed_readq
+#endif
#define writeb_relaxed __relaxed_writeb
#define writew_relaxed __relaxed_writew
#define writel_relaxed __relaxed_writel
+#ifdef CONFIG_64BIT
#define writeq_relaxed __relaxed_writeq
+#endif
#define readb_be(addr) \
__raw_readb((__force unsigned *)(addr))
@@ -516,8 +525,10 @@ __BUILDIO(q, u64)
/*
* Some code tests for these symbols
*/
+#ifdef CONFIG_64BIT
#define readq readq
#define writeq writeq
+#endif
#define __BUILD_MEMORY_STRING(bwlq, type) \
\
--
2.21.0
Hi Serge,
On Fri, Jun 14, 2019 at 09:33:42AM +0300, Serge Semin wrote:
> There are some generic drivers in the kernel, which make use of the
> q-accessors or their derivatives. While at current asm/io.h the accessors
> are defined, their implementation is only applicable either for 64bit
> systems, or for systems with cpu_has_64bits flag set. Obviously there
> are MIPS systems which are neither of these, but still need to have
> those drivers supported. In this case the solution is to define some
> generic versions of the q-accessors, but with a limitation to be
> non-atomic. Such accessors are defined in the
> io-64-nonatomic-{hi-lo,lo-hi}.h file. The drivers which utilize the
> q-suffixed IO-methods are supposed to include the header file, so
> in case if these accessors aren't defined for the platform, the generic
> non-atomic versions are utilized. Currently the MIPS-specific asm/io.h
> file provides the q-accessors for any MIPS system even for ones, which
> in fact don't support them and raise BUG() in case if any of them is
> called. Due to this the generic versions of the accessors are never
> used while an attempt to call the IO-methods causes the kernel BUG().
> In order to fix this we need to define the q-accessors only for
> the MIPS systems, which actually support them, and don't define them
> otherwise, so to let the corresponding drivers to use the non-atomic
> q-suffixed accessors.
>
> Signed-off-by: Serge Semin <[email protected]>
> Suggested-by: Arnd Bergmann <[email protected]>
> Cc: Vadim V. Vlasov <[email protected]>
> ---
> arch/mips/include/asm/io.h | 11 +++++++++++
> 1 file changed, 11 insertions(+)
So this seems pretty reasonable. Build testing all our defconfigs only
showed up one issue for decstation_defconfig & decstation_r4k_defconfig:
drivers/net/fddi/defza.c: In function 'fza_reads':
drivers/net/fddi/defza.c:88:17: error: implicit declaration of
function 'readq_relaxed'; did you mean 'readw_relaxed'?
[-Werror=implicit-function-declaration]
#define readq_u readq_relaxed
^~~~~~~~~~~~~
drivers/net/fddi/defza.c:126:13: note: in expansion of macro 'readq_u'
*dst++ = readq_u(src++);
^~~~~~~
drivers/net/fddi/defza.c: In function 'fza_writes':
drivers/net/fddi/defza.c:92:18: error: implicit declaration of
function 'writeq_relaxed'; did you mean 'writel_relaxed'?
[-Werror=implicit-function-declaration]
#define writeq_u writeq_relaxed
^~~~~~~~~~~~~~
drivers/net/fddi/defza.c:151:4: note: in expansion of macro 'writeq_u'
writeq_u(*src++, dst++);
^~~~~~~~
CC net/core/scm.o
cc1: some warnings being treated as errors
make[4]: *** [scripts/Makefile.build:279: drivers/net/fddi/defza.o] Error 1
These uses of readq_relaxed & writeq_relaxed are both conditional upon
sizeof(unsigned long) == 8, ie. upon CONFIG_64BIT=y so they're not going
to present a runtime issue but we need to provide some implementation of
the *q accessors to keep the compiler happy.
I see a few options:
1) We could just have defza.c include <io-64-nonatomic-lo-hi.h> to get
the appropriate declarations, which should then get optimized away by
the compiler anyway & never actually be used.
2) We could have defza.h #define its readq_u & writeq_u macros
differently for CONFIG_32BIT=y kernels, perhaps using
__compiletime_error to catch any bogus use of them.
3) We could do the same in a generic header, though if nobody else has
needed it so far & this is the only place we need it then maybe it's
not worth it.
So I'm thinking option 2 might be best, as below. Having said that I
don't mind option 1 either - it's simple. Maciej do you have any
preference?
Thanks,
Paul
---
diff --git a/drivers/net/fddi/defza.c b/drivers/net/fddi/defza.c
index c5cae8e74dc4..85d6a7f22fe7 100644
--- a/drivers/net/fddi/defza.c
+++ b/drivers/net/fddi/defza.c
@@ -85,11 +85,21 @@ static u8 hw_addr_beacon[8] = { 0x01, 0x80, 0xc2, 0x00, 0x01, 0x00 };
*/
#define readw_u readw_relaxed
#define readl_u readl_relaxed
-#define readq_u readq_relaxed
#define writew_u writew_relaxed
#define writel_u writel_relaxed
-#define writeq_u writeq_relaxed
+
+#ifdef CONFIG_32BIT
+extern u64 defza_readq_u(const void *ptr)
+ __compiletime_error("readq_u should not be used by 32b kernels");
+extern void defza_writeq_u(u64 val, void *ptr)
+ __compiletime_error("writeq_u should not be used by 32b kernels");
+# define readq_u defza_readq_u
+# define writeq_u defza_writeq_u
+#else
+# define readq_u readq_relaxed
+# define writeq_u writeq_relaxed
+#endif
static inline struct sk_buff *fza_alloc_skb_irq(struct net_device *dev,
unsigned int length)
On Thu, 20 Jun 2019, Paul Burton wrote:
> So this seems pretty reasonable. Build testing all our defconfigs only
> showed up one issue for decstation_defconfig & decstation_r4k_defconfig:
>
> drivers/net/fddi/defza.c: In function 'fza_reads':
> drivers/net/fddi/defza.c:88:17: error: implicit declaration of
> function 'readq_relaxed'; did you mean 'readw_relaxed'?
> [-Werror=implicit-function-declaration]
> #define readq_u readq_relaxed
> ^~~~~~~~~~~~~
> drivers/net/fddi/defza.c:126:13: note: in expansion of macro 'readq_u'
> *dst++ = readq_u(src++);
> ^~~~~~~
> drivers/net/fddi/defza.c: In function 'fza_writes':
> drivers/net/fddi/defza.c:92:18: error: implicit declaration of
> function 'writeq_relaxed'; did you mean 'writel_relaxed'?
> [-Werror=implicit-function-declaration]
> #define writeq_u writeq_relaxed
> ^~~~~~~~~~~~~~
> drivers/net/fddi/defza.c:151:4: note: in expansion of macro 'writeq_u'
> writeq_u(*src++, dst++);
> ^~~~~~~~
> CC net/core/scm.o
> cc1: some warnings being treated as errors
> make[4]: *** [scripts/Makefile.build:279: drivers/net/fddi/defza.o] Error 1
>
> These uses of readq_relaxed & writeq_relaxed are both conditional upon
> sizeof(unsigned long) == 8, ie. upon CONFIG_64BIT=y so they're not going
> to present a runtime issue but we need to provide some implementation of
> the *q accessors to keep the compiler happy.
>
> I see a few options:
>
> 1) We could just have defza.c include <io-64-nonatomic-lo-hi.h> to get
> the appropriate declarations, which should then get optimized away by
> the compiler anyway & never actually be used.
This, definitely.
> 2) We could have defza.h #define its readq_u & writeq_u macros
> differently for CONFIG_32BIT=y kernels, perhaps using
> __compiletime_error to catch any bogus use of them.
>
> 3) We could do the same in a generic header, though if nobody else has
> needed it so far & this is the only place we need it then maybe it's
> not worth it.
>
> So I'm thinking option 2 might be best, as below. Having said that I
> don't mind option 1 either - it's simple. Maciej do you have any
> preference?
The use of 64-bit operations to access option's packet memory, which is
true SRAM, i.e. no side effects, is to improve throughput only and there's
no need for atomicity here nor also any kind of barriers, except at the
conclusion. Splitting 64-bit accesses into 32-bit halves in software
would not be a functional error here.
I benchmarked it back in the day and the difference was noticeable with
actual network transmissions between loops using 32-bit (LW/SW) and 64-bit
(LD/SD) accesses respectively, though I may not be able to track down the
figures anymore as it must have been some 18 years ago. The performance
of the MB ASIC used to interface the R4400 CPU to DRAM and TURBOchannel
with the 5000/260 systems was stellar as it was specifically designed with
high throughput in mind, as an upgrade to the exiting R3400-based 5000/240
systems (the CPU and the ASIC are both on a daughtercard), though new such
machines used to be sold as well.
For the record the CPU and TURBOchannel run at 60MHz (40MHz with the
R3400) and 25MHz respectively with these systems.
Thanks for the heads-up!
Maciej
Hello Paul,
On Thu, Jun 20, 2019 at 05:40:04PM +0000, Paul Burton wrote:
> Hi Serge,
>
> On Fri, Jun 14, 2019 at 09:33:42AM +0300, Serge Semin wrote:
> > There are some generic drivers in the kernel, which make use of the
> > q-accessors or their derivatives. While at current asm/io.h the accessors
> > are defined, their implementation is only applicable either for 64bit
> > systems, or for systems with cpu_has_64bits flag set. Obviously there
> > are MIPS systems which are neither of these, but still need to have
> > those drivers supported. In this case the solution is to define some
> > generic versions of the q-accessors, but with a limitation to be
> > non-atomic. Such accessors are defined in the
> > io-64-nonatomic-{hi-lo,lo-hi}.h file. The drivers which utilize the
> > q-suffixed IO-methods are supposed to include the header file, so
> > in case if these accessors aren't defined for the platform, the generic
> > non-atomic versions are utilized. Currently the MIPS-specific asm/io.h
> > file provides the q-accessors for any MIPS system even for ones, which
> > in fact don't support them and raise BUG() in case if any of them is
> > called. Due to this the generic versions of the accessors are never
> > used while an attempt to call the IO-methods causes the kernel BUG().
> > In order to fix this we need to define the q-accessors only for
> > the MIPS systems, which actually support them, and don't define them
> > otherwise, so to let the corresponding drivers to use the non-atomic
> > q-suffixed accessors.
> >
> > Signed-off-by: Serge Semin <[email protected]>
> > Suggested-by: Arnd Bergmann <[email protected]>
> > Cc: Vadim V. Vlasov <[email protected]>
> > ---
> > arch/mips/include/asm/io.h | 11 +++++++++++
> > 1 file changed, 11 insertions(+)
>
> So this seems pretty reasonable. Build testing all our defconfigs only
> showed up one issue for decstation_defconfig & decstation_r4k_defconfig:
>
> drivers/net/fddi/defza.c: In function 'fza_reads':
> drivers/net/fddi/defza.c:88:17: error: implicit declaration of
> function 'readq_relaxed'; did you mean 'readw_relaxed'?
> [-Werror=implicit-function-declaration]
> #define readq_u readq_relaxed
> ^~~~~~~~~~~~~
> drivers/net/fddi/defza.c:126:13: note: in expansion of macro 'readq_u'
> *dst++ = readq_u(src++);
> ^~~~~~~
> drivers/net/fddi/defza.c: In function 'fza_writes':
> drivers/net/fddi/defza.c:92:18: error: implicit declaration of
> function 'writeq_relaxed'; did you mean 'writel_relaxed'?
> [-Werror=implicit-function-declaration]
> #define writeq_u writeq_relaxed
> ^~~~~~~~~~~~~~
> drivers/net/fddi/defza.c:151:4: note: in expansion of macro 'writeq_u'
> writeq_u(*src++, dst++);
> ^~~~~~~~
> CC net/core/scm.o
> cc1: some warnings being treated as errors
> make[4]: *** [scripts/Makefile.build:279: drivers/net/fddi/defza.o] Error 1
>
Thanks for review and testing this for each target. I see you and Maciej already
agreed regarding the solution, and you even sent the fixup. So I don't have
to send the v2 patch.)
Regards,
-Sergey
> These uses of readq_relaxed & writeq_relaxed are both conditional upon
> sizeof(unsigned long) == 8, ie. upon CONFIG_64BIT=y so they're not going
> to present a runtime issue but we need to provide some implementation of
> the *q accessors to keep the compiler happy.
>
> I see a few options:
>
> 1) We could just have defza.c include <io-64-nonatomic-lo-hi.h> to get
> the appropriate declarations, which should then get optimized away by
> the compiler anyway & never actually be used.
>
> 2) We could have defza.h #define its readq_u & writeq_u macros
> differently for CONFIG_32BIT=y kernels, perhaps using
> __compiletime_error to catch any bogus use of them.
>
> 3) We could do the same in a generic header, though if nobody else has
> needed it so far & this is the only place we need it then maybe it's
> not worth it.
>
> So I'm thinking option 2 might be best, as below. Having said that I
> don't mind option 1 either - it's simple. Maciej do you have any
> preference?
>
> Thanks,
> Paul
>
> ---
> diff --git a/drivers/net/fddi/defza.c b/drivers/net/fddi/defza.c
> index c5cae8e74dc4..85d6a7f22fe7 100644
> --- a/drivers/net/fddi/defza.c
> +++ b/drivers/net/fddi/defza.c
> @@ -85,11 +85,21 @@ static u8 hw_addr_beacon[8] = { 0x01, 0x80, 0xc2, 0x00, 0x01, 0x00 };
> */
> #define readw_u readw_relaxed
> #define readl_u readl_relaxed
> -#define readq_u readq_relaxed
>
> #define writew_u writew_relaxed
> #define writel_u writel_relaxed
> -#define writeq_u writeq_relaxed
> +
> +#ifdef CONFIG_32BIT
> +extern u64 defza_readq_u(const void *ptr)
> + __compiletime_error("readq_u should not be used by 32b kernels");
> +extern void defza_writeq_u(u64 val, void *ptr)
> + __compiletime_error("writeq_u should not be used by 32b kernels");
> +# define readq_u defza_readq_u
> +# define writeq_u defza_writeq_u
> +#else
> +# define readq_u readq_relaxed
> +# define writeq_u writeq_relaxed
> +#endif
>
> static inline struct sk_buff *fza_alloc_skb_irq(struct net_device *dev,
> unsigned int length)
>
On Thu, Jun 20, 2019 at 8:19 PM Maciej W. Rozycki <[email protected]> wrote:
>
> On Thu, 20 Jun 2019, Paul Burton wrote:
>
> > So this seems pretty reasonable. Build testing all our defconfigs only
> > showed up one issue for decstation_defconfig & decstation_r4k_defconfig:
> >
> > drivers/net/fddi/defza.c: In function 'fza_reads':
> > drivers/net/fddi/defza.c:88:17: error: implicit declaration of
> > function 'readq_relaxed'; did you mean 'readw_relaxed'?
> > [-Werror=implicit-function-declaration]
> > #define readq_u readq_relaxed
> > ^~~~~~~~~~~~~
> > drivers/net/fddi/defza.c:126:13: note: in expansion of macro 'readq_u'
> > *dst++ = readq_u(src++);
> > ^~~~~~~
> > drivers/net/fddi/defza.c: In function 'fza_writes':
> > drivers/net/fddi/defza.c:92:18: error: implicit declaration of
> > function 'writeq_relaxed'; did you mean 'writel_relaxed'?
> > [-Werror=implicit-function-declaration]
> > #define writeq_u writeq_relaxed
> > ^~~~~~~~~~~~~~
> > drivers/net/fddi/defza.c:151:4: note: in expansion of macro 'writeq_u'
> > writeq_u(*src++, dst++);
> > ^~~~~~~~
> > CC net/core/scm.o
> > cc1: some warnings being treated as errors
> > make[4]: *** [scripts/Makefile.build:279: drivers/net/fddi/defza.o] Error 1
> >
> > These uses of readq_relaxed & writeq_relaxed are both conditional upon
> > sizeof(unsigned long) == 8, ie. upon CONFIG_64BIT=y so they're not going
> > to present a runtime issue but we need to provide some implementation of
> > the *q accessors to keep the compiler happy.
> >
> > I see a few options:
> >
> > 1) We could just have defza.c include <io-64-nonatomic-lo-hi.h> to get
> > the appropriate declarations, which should then get optimized away by
> > the compiler anyway & never actually be used.
>
> This, definitely.
The compiler should generally not be allowed to combine two adjacent
readl_relaxed() back into a 64-bit load. Only __raw_readl() can be
combined or split up. If the mips version of the *_relaxed() accessors
allows the compiler to do this, I would consider that a bug.
> > 2) We could have defza.h #define its readq_u & writeq_u macros
> > differently for CONFIG_32BIT=y kernels, perhaps using
> > __compiletime_error to catch any bogus use of them.
> >
> > 3) We could do the same in a generic header, though if nobody else has
> > needed it so far & this is the only place we need it then maybe it's
> > not worth it.
> >
> > So I'm thinking option 2 might be best, as below. Having said that I
> > don't mind option 1 either - it's simple. Maciej do you have any
> > preference?
>
> The use of 64-bit operations to access option's packet memory, which is
> true SRAM, i.e. no side effects, is to improve throughput only and there's
> no need for atomicity here nor also any kind of barriers, except at the
> conclusion. Splitting 64-bit accesses into 32-bit halves in software
> would not be a functional error here.
The other property of packet memory and similar things is that you
basically want memcpy()-behavior with no byteswaps. This is one
of the few cases in which __raw_readq() is actually the right accessor
in (mostly) portable code.
Arnd
On Fri, 21 Jun 2019, Arnd Bergmann wrote:
> > The use of 64-bit operations to access option's packet memory, which is
> > true SRAM, i.e. no side effects, is to improve throughput only and there's
> > no need for atomicity here nor also any kind of barriers, except at the
> > conclusion. Splitting 64-bit accesses into 32-bit halves in software
> > would not be a functional error here.
>
> The other property of packet memory and similar things is that you
> basically want memcpy()-behavior with no byteswaps. This is one
> of the few cases in which __raw_readq() is actually the right accessor
> in (mostly) portable code.
Correct, but we're missing an `__raw_readq_relaxed', etc. interface and
having additional barriers applied on every access would hit performance
very badly; in fact even the barriers `*_relaxed' accessors imply would
best be removed in this use (which is why defza.c uses `readw_o' vs
`readw_u', etc. internally), but after all the struggles over the years
for weakly ordered internal APIs x86 people are so averse to I'm not sure
if I want to start another one. We can get away with `readq_relaxed' in
this use though as all the systems this device can be used with are
little-endian as is TURBOchannel, so no byte-swapping will ever actually
occur.
Maciej
On Fri, Jun 21, 2019 at 12:09 PM Maciej W. Rozycki <[email protected]> wrote:
>
> On Fri, 21 Jun 2019, Arnd Bergmann wrote:
>
> > > The use of 64-bit operations to access option's packet memory, which is
> > > true SRAM, i.e. no side effects, is to improve throughput only and there's
> > > no need for atomicity here nor also any kind of barriers, except at the
> > > conclusion. Splitting 64-bit accesses into 32-bit halves in software
> > > would not be a functional error here.
> >
> > The other property of packet memory and similar things is that you
> > basically want memcpy()-behavior with no byteswaps. This is one
> > of the few cases in which __raw_readq() is actually the right accessor
> > in (mostly) portable code.
>
> Correct, but we're missing an `__raw_readq_relaxed', etc. interface and
> having additional barriers applied on every access would hit performance
> very badly;
How so? __raw_readq() by definition has the least barriers of
all, you can't make it more relaxed than it already is.
> in fact even the barriers `*_relaxed' accessors imply would
> best be removed in this use (which is why defza.c uses `readw_o' vs
> `readw_u', etc. internally), but after all the struggles over the years
> for weakly ordered internal APIs x86 people are so averse to I'm not sure
> if I want to start another one. We can get away with `readq_relaxed' in
> this use though as all the systems this device can be used with are
> little-endian as is TURBOchannel, so no byte-swapping will ever actually
> occur.
I still don't see any downside of using __raw_readq() here, while the
upsides are:
- makes the driver portable to big-endian kernels (even though we don't
care)
- avoids all barriers
- fixes the build regression.
Arnd
On Fri, 21 Jun 2019, Arnd Bergmann wrote:
> > > The other property of packet memory and similar things is that you
> > > basically want memcpy()-behavior with no byteswaps. This is one
> > > of the few cases in which __raw_readq() is actually the right accessor
> > > in (mostly) portable code.
> >
> > Correct, but we're missing an `__raw_readq_relaxed', etc. interface and
> > having additional barriers applied on every access would hit performance
> > very badly;
>
> How so? __raw_readq() by definition has the least barriers of
> all, you can't make it more relaxed than it already is.
Well, `__raw_readq' has all the barriers plain `readq' has except it does
not ever do byte-swapping (which may be bad where address swizzling is
also present). Whereas `readq_relaxed' at least avoids the trailing DMA
barrier.
This is what the MIPS version has:
#define __BUILD_MEMORY_SINGLE(pfx, bwlq, type, barrier, relax, irq) \
[...]
#define __BUILD_MEMORY_PFX(bus, bwlq, type, relax) \
\
__BUILD_MEMORY_SINGLE(bus, bwlq, type, 1, relax, 1)
#define BUILDIO_MEM(bwlq, type) \
\
__BUILD_MEMORY_PFX(__raw_, bwlq, type, 0) \
__BUILD_MEMORY_PFX(__relaxed_, bwlq, type, 1) \
__BUILD_MEMORY_PFX(__mem_, bwlq, type, 0) \
__BUILD_MEMORY_PFX(, bwlq, type, 0)
So `barrier' is always passed 1 and consequently all the accessors have a
leading MMIO ordering barrier inserted and only `__relaxed_*' ones have
`relax' set to 0 making them skip the trailing MMIO read vs DMA ordering
barrier. This is in accordance to Documentation/memory-barriers.txt I
believe.
NB I got one part wrong in the previous e-mail, sorry, as for packet
memory accesses etc. the correct accessors are actually `__mem_*' rather
than `__raw_*' ones, but the former ones are not portable. I always
forget about this peculiarity and it took us years to get it right with
the MIPS port and the old IDE subsystem when doing PIO.
The `__mem_*' handlers still do whetever system-specific transformation
is required to present data in the memory rather than CPU byte ordering.
See arch/mips/include/asm/mach-ip27/mangle-port.h for a non-trivial
example and arch/mips/include/asm/mach-generic/mangle-port.h for the
general case. Whereas `__raw_*' pass raw data unchanged and are generally
only suitable for accesses to onchip SOC MMIO or similar resources that do
not traverse any external bus where a system's endianness may be observed.
So contrary to what I have written before for the theoretical case of a
big-endian system possibly doing address swizzling we'd have to define and
use `__mem_readq_unordered', etc. here rather than `__raw_readq_relaxed',
etc.
> > in fact even the barriers `*_relaxed' accessors imply would
> > best be removed in this use (which is why defza.c uses `readw_o' vs
> > `readw_u', etc. internally), but after all the struggles over the years
> > for weakly ordered internal APIs x86 people are so averse to I'm not sure
> > if I want to start another one. We can get away with `readq_relaxed' in
> > this use though as all the systems this device can be used with are
> > little-endian as is TURBOchannel, so no byte-swapping will ever actually
> > occur.
>
> I still don't see any downside of using __raw_readq() here, while the
> upsides are:
>
> - makes the driver portable to big-endian kernels (even though we don't
> care)
> - avoids all barriers
> - fixes the build regression.
Giving my observations above it would only address item #3 on your list,
while addressing #1 and #2 would require defining `__mem_readq_unordered',
etc. I am afraid.
Have I missed anything?
Maciej
On Fri, Jun 21, 2019 at 2:24 PM Maciej W. Rozycki <[email protected]> wrote:
> On Fri, 21 Jun 2019, Arnd Bergmann wrote:
> > > > The other property of packet memory and similar things is that you
> > > > basically want memcpy()-behavior with no byteswaps. This is one
> > > > of the few cases in which __raw_readq() is actually the right accessor
> > > > in (mostly) portable code.
> > >
> > > Correct, but we're missing an `__raw_readq_relaxed', etc. interface and
> > > having additional barriers applied on every access would hit performance
> > > very badly;
> >
> > How so? __raw_readq() by definition has the least barriers of
> > all, you can't make it more relaxed than it already is.
>
> Well, `__raw_readq' has all the barriers plain `readq' has except it does
> not ever do byte-swapping (which may be bad where address swizzling is
> also present). Whereas `readq_relaxed' at least avoids the trailing DMA
> barrier.
>
> This is what the MIPS version has:
>
> #define __BUILD_MEMORY_SINGLE(pfx, bwlq, type, barrier, relax, irq) \
> [...]
>
> #define __BUILD_MEMORY_PFX(bus, bwlq, type, relax) \
> \
> __BUILD_MEMORY_SINGLE(bus, bwlq, type, 1, relax, 1)
>
> #define BUILDIO_MEM(bwlq, type) \
> \
> __BUILD_MEMORY_PFX(__raw_, bwlq, type, 0) \
> __BUILD_MEMORY_PFX(__relaxed_, bwlq, type, 1) \
> __BUILD_MEMORY_PFX(__mem_, bwlq, type, 0) \
> __BUILD_MEMORY_PFX(, bwlq, type, 0)
>
> So `barrier' is always passed 1 and consequently all the accessors have a
> leading MMIO ordering barrier inserted and only `__relaxed_*' ones have
> `relax' set to 0 making them skip the trailing MMIO read vs DMA ordering
> barrier. This is in accordance to Documentation/memory-barriers.txt I
> believe.
It is definitely not what other architectures do here. In particular, the
asm-generic implementation that is now used on most of them
defines raw_readl() as
static inline u32 __raw_readl(const volatile void __iomem *addr)
{
return *(const volatile u32 __force *)addr;
}
and there are a number of drivers that depend on this behavior.
readl_relaxed() typically adds the byteswap on this, and readl() adds
the barriers on top of readl_relaxed().
> NB I got one part wrong in the previous e-mail, sorry, as for packet
> memory accesses etc. the correct accessors are actually `__mem_*' rather
> than `__raw_*' ones, but the former ones are not portable. I always
> forget about this peculiarity and it took us years to get it right with
> the MIPS port and the old IDE subsystem when doing PIO.
>
> The `__mem_*' handlers still do whetever system-specific transformation
> is required to present data in the memory rather than CPU byte ordering.
> See arch/mips/include/asm/mach-ip27/mangle-port.h for a non-trivial
> example and arch/mips/include/asm/mach-generic/mangle-port.h for the
> general case. Whereas `__raw_*' pass raw data unchanged and are generally
> only suitable for accesses to onchip SOC MMIO or similar resources that do
> not traverse any external bus where a system's endianness may be observed.
Ok, so what you have for __mem_* is actually what I had expected from
__raw_* for an architecture, except for the barriers that should have been
left out.
> So contrary to what I have written before for the theoretical case of a
> big-endian system possibly doing address swizzling we'd have to define and
> use `__mem_readq_unordered', etc. here rather than `__raw_readq_relaxed',
> etc.
Right.
> > > in fact even the barriers `*_relaxed' accessors imply would
> > > best be removed in this use (which is why defza.c uses `readw_o' vs
> > > `readw_u', etc. internally), but after all the struggles over the years
> > > for weakly ordered internal APIs x86 people are so averse to I'm not sure
> > > if I want to start another one. We can get away with `readq_relaxed' in
> > > this use though as all the systems this device can be used with are
> > > little-endian as is TURBOchannel, so no byte-swapping will ever actually
> > > occur.
> >
> > I still don't see any downside of using __raw_readq() here, while the
> > upsides are:
> >
> > - makes the driver portable to big-endian kernels (even though we don't
> > care)
> > - avoids all barriers
> > - fixes the build regression.
>
> Giving my observations above it would only address item #3 on your list,
> while addressing #1 and #2 would require defining `__mem_readq_unordered',
> etc. I am afraid.
>
> Have I missed anything?
No, I think you are right based on how mips defines __raw_readl().
Unfortunately, this also means that all portable drivers using the
__raw_ accessors to do what you want here are broken on mips
(at least on big-endian), while mips drivers using __raw_* are not
portable to anything else.
Arnd
Arnd,
We're getting into MMIO and barriers again, sigh. Cc-ing people recently
involved then.
On Fri, 21 Jun 2019, Arnd Bergmann wrote:
> > > > > The other property of packet memory and similar things is that you
> > > > > basically want memcpy()-behavior with no byteswaps. This is one
> > > > > of the few cases in which __raw_readq() is actually the right accessor
> > > > > in (mostly) portable code.
> > > >
> > > > Correct, but we're missing an `__raw_readq_relaxed', etc. interface and
> > > > having additional barriers applied on every access would hit performance
> > > > very badly;
> > >
> > > How so? __raw_readq() by definition has the least barriers of
> > > all, you can't make it more relaxed than it already is.
> >
> > Well, `__raw_readq' has all the barriers plain `readq' has except it does
> > not ever do byte-swapping (which may be bad where address swizzling is
> > also present). Whereas `readq_relaxed' at least avoids the trailing DMA
> > barrier.
> >
> > This is what the MIPS version has:
> >
> > #define __BUILD_MEMORY_SINGLE(pfx, bwlq, type, barrier, relax, irq) \
> > [...]
> >
> > #define __BUILD_MEMORY_PFX(bus, bwlq, type, relax) \
> > \
> > __BUILD_MEMORY_SINGLE(bus, bwlq, type, 1, relax, 1)
> >
> > #define BUILDIO_MEM(bwlq, type) \
> > \
> > __BUILD_MEMORY_PFX(__raw_, bwlq, type, 0) \
> > __BUILD_MEMORY_PFX(__relaxed_, bwlq, type, 1) \
> > __BUILD_MEMORY_PFX(__mem_, bwlq, type, 0) \
> > __BUILD_MEMORY_PFX(, bwlq, type, 0)
> >
> > So `barrier' is always passed 1 and consequently all the accessors have a
> > leading MMIO ordering barrier inserted and only `__relaxed_*' ones have
> > `relax' set to 0 making them skip the trailing MMIO read vs DMA ordering
> > barrier. This is in accordance to Documentation/memory-barriers.txt I
> > believe.
>
> It is definitely not what other architectures do here. In particular, the
> asm-generic implementation that is now used on most of them
> defines raw_readl() as
>
> static inline u32 __raw_readl(const volatile void __iomem *addr)
> {
> return *(const volatile u32 __force *)addr;
> }
>
> and there are a number of drivers that depend on this behavior.
> readl_relaxed() typically adds the byteswap on this, and readl() adds
> the barriers on top of readl_relaxed().
If the lack of any ordering barriers with `__raw_*' accessors is indeed
intended, then we need to document it in `memory-barriers.txt' as the
expected semantics.
This however does not make the interface suitable for accesses to
memory-like MMIO resources, e.g. string I/O, but also the ATA data port,
unless you want to make/keep the nomenclature more confusing than it has
to be.
First some backgound information.
1. For systems that have a uniform endianness throughout there's obviously
no issue. We only need to choose between strongly-ordered and, if
applicable weakly-ordered accesses. As the minimum and at the cost of
performance we can enforce strong MMIO ordering everywhere.
2. For systems parts of which have the opposite endianness we can have two
variants (or endianness policies) implemented in hardware for accesses
that cross the boundary of the two endiannesses:
a. Byte lane matching.
In this case individual byte addresses are preserved while accessing
a device and byte enables are generated directly from the least
significant bits of the CPU address, meaning that the device is
addressed within its decoding window directly as documented.
The data quantity accessed, however, has to be byte-swapped to be
interpreted within the domain of the opposite endianness as a
numerical value or a bit pattern if wider than 8 bits.
b. Bit lane matching.
In this case individual byte addresses are mangled while accessing a
device and byte enables are generated from the inverse of the
leastsignificant bits of the CPU address, meaning that to calculate
a device address the CPU address has to be transformed, unless
accessing an aligned data quantity of the bus width.
The data quantity accessed, however, can be used directly to be
interpreted within the domain of the opposite endianness as a
numerical value or a bit pattern regardless of its width.
Like with the systems of a uniform endianness ordering is to be
considered separately.
Existing computer systems based on the MIPS processor implement either
endianness policy or both. The existence of systems that do bit lane
matching is why we have all the arch/mips/include/asm/mach-*/mangle-port.h
headers.
I actually own a SOC-based system that implements both, when strapped at
reset for the big endianness, by having two MMIO windows defined in its
address space, which assume each of the policies respectively, and the
boundary between the two endianness domains is at the PCI host bridge.
This means that device accesses that cross the PCI bridge need address
mangling or byte-swapping as necessary while device accesses that access
onchip SOC devices or external devices wired to SOC's interfaces other
than PCI require straight through accesses.
The implementers of our Linux port for this system chose to use the byte
lane matching policy exclusively, although for performance reason it might
make sense to switch between the two policies depending on whether data is
to be interpreted as a value or memory contents as the offset between the
two windows is fixed.
Anyway, for a system like this and regardless of any ordering requirement
we clearly need to have three kinds of accessors:
1. Native (pass-through data, never with address mangling).
2. Value (data byte-swapped if required, possibly with address mangling).
3. Memory (data byte-swapped if required, possibly with address mangling).
And I think the natural nomenclature for the three kinds of accessors
respectively is:
1. `__raw_*'.
2. `*' (no prefix by long-established practice, although `__val_*' would
do too, IMO).
3. `__mem_*'.
So I think the MIPS port got it right (and the rest is confused, IMHO).
Furthermore, going back to the ordering requirement, I think we also need
to have strongly-ordered native accessors, so that driver writers do not
have to be bothered about sprinkling barriers in the common case.
The MIPS port currently meets all the requirements, although it does not
provide weakly ordered accessors.
> > NB I got one part wrong in the previous e-mail, sorry, as for packet
> > memory accesses etc. the correct accessors are actually `__mem_*' rather
> > than `__raw_*' ones, but the former ones are not portable. I always
> > forget about this peculiarity and it took us years to get it right with
> > the MIPS port and the old IDE subsystem when doing PIO.
> >
> > The `__mem_*' handlers still do whetever system-specific transformation
> > is required to present data in the memory rather than CPU byte ordering.
> > See arch/mips/include/asm/mach-ip27/mangle-port.h for a non-trivial
> > example and arch/mips/include/asm/mach-generic/mangle-port.h for the
> > general case. Whereas `__raw_*' pass raw data unchanged and are generally
> > only suitable for accesses to onchip SOC MMIO or similar resources that do
> > not traverse any external bus where a system's endianness may be observed.
>
> Ok, so what you have for __mem_* is actually what I had expected from
> __raw_* for an architecture, except for the barriers that should have been
> left out.
Please see above for an explanation as to why `__mem_*' is equivalent to
neither `__raw_*' nor plain `*'.
> > So contrary to what I have written before for the theoretical case of a
> > big-endian system possibly doing address swizzling we'd have to define and
> > use `__mem_readq_unordered', etc. here rather than `__raw_readq_relaxed',
> > etc.
>
> Right.
>
> > > > in fact even the barriers `*_relaxed' accessors imply would
> > > > best be removed in this use (which is why defza.c uses `readw_o' vs
> > > > `readw_u', etc. internally), but after all the struggles over the years
> > > > for weakly ordered internal APIs x86 people are so averse to I'm not sure
> > > > if I want to start another one. We can get away with `readq_relaxed' in
> > > > this use though as all the systems this device can be used with are
> > > > little-endian as is TURBOchannel, so no byte-swapping will ever actually
> > > > occur.
> > >
> > > I still don't see any downside of using __raw_readq() here, while the
> > > upsides are:
> > >
> > > - makes the driver portable to big-endian kernels (even though we don't
> > > care)
> > > - avoids all barriers
> > > - fixes the build regression.
> >
> > Giving my observations above it would only address item #3 on your list,
> > while addressing #1 and #2 would require defining `__mem_readq_unordered',
> > etc. I am afraid.
> >
> > Have I missed anything?
>
> No, I think you are right based on how mips defines __raw_readl().
>
> Unfortunately, this also means that all portable drivers using the
> __raw_ accessors to do what you want here are broken on mips
> (at least on big-endian), while mips drivers using __raw_* are not
> portable to anything else.
I guess your conclusion is correct then, but I maintain that the
nomenclature chosen for the MIPS port years ago is the correct one. This
goes back to commit 4912ba72d6e2 ("Define mem_*() I/O accessory functions
that preserve byte addresses.") which defined `mem_*' accessors later
renamed to `__mem_*' with commit 290f10ae4230 ("mips: namespace pollution
- mem_... -> __mem_... in io.h"). I don't know at what stage of awareness
of the three scenarios other ports were back then.
Questions, comments, thoughts?
Maciej
Hello,
Serge Semin wrote:
> There are some generic drivers in the kernel, which make use of the
> q-accessors or their derivatives. While at current asm/io.h the accessors
> are defined, their implementation is only applicable either for 64bit
> systems, or for systems with cpu_has_64bits flag set. Obviously there
> are MIPS systems which are neither of these, but still need to have
> those drivers supported. In this case the solution is to define some
> generic versions of the q-accessors, but with a limitation to be
> non-atomic. Such accessors are defined in the
> io-64-nonatomic-{hi-lo,lo-hi}.h file. The drivers which utilize the
> q-suffixed IO-methods are supposed to include the header file, so
> in case if these accessors aren't defined for the platform, the generic
> non-atomic versions are utilized. Currently the MIPS-specific asm/io.h
> file provides the q-accessors for any MIPS system even for ones, which
> in fact don't support them and raise BUG() in case if any of them is
> called. Due to this the generic versions of the accessors are never
> used while an attempt to call the IO-methods causes the kernel BUG().
> In order to fix this we need to define the q-accessors only for
> the MIPS systems, which actually support them, and don't define them
> otherwise, so to let the corresponding drivers to use the non-atomic
> q-suffixed accessors.
>
> Signed-off-by: Serge Semin <[email protected]>
> Suggested-by: Arnd Bergmann <[email protected]>
> Cc: Vadim V. Vlasov <[email protected]>
Applied to mips-next.
Thanks,
Paul
[ This message was auto-generated; if you believe anything is incorrect
then please email [email protected] to report it. ]