LinuxLists.cc - [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-18 11:36:49

by Alexey Kardashevskiy

[permalink] [raw]

Subject: [PATCH] vfio: Fix endianness handling for emulated BARs

VFIO exposes BARs to user space as a byte stream so userspace can
read it using pread()/pwrite(). Since this is a byte stream, VFIO should
not do byte swapping and simply return values as it gets them from
PCI device.

Instead, the existing code assumes that byte stream in read/write is
little-endian and it fixes endianness for values which it passes to
ioreadXX/iowriteXX helpers. This works for little-endian as PCI is
little endian and le32_to_cpu/... are stubs.

This also works for big endian but rather by an accident: it reads 4 bytes
from the stream (@val is big endian), converts to CPU format (which should
be big endian) as it was little endian (@val becomes actually little
endian) and calls iowrite32() which does not do swapping on big endian
system.

This removes byte swapping and makes use ioread32be/iowrite32be
(and 16bit versions) on big-endian systems. The "be" helpers take
native endian values and do swapping at the moment of writing to a PCI
register using one of "store byte-reversed" instructions.

Suggested-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 210db24..f363b5a 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -21,6 +21,18 @@

#include "vfio_pci_private.h"

+#ifdef __BIG_ENDIAN__
+#define ioread16_native ioread16be
+#define ioread32_native ioread32be
+#define iowrite16_native iowrite16be
+#define iowrite32_native iowrite32be
+#else
+#define ioread16_native ioread16
+#define ioread32_native ioread32
+#define iowrite16_native iowrite16
+#define iowrite32_native iowrite32
+#endif
+
/*
* Read or write from an __iomem region (MMIO or I/O port) with an excluded
* range which is inaccessible. The excluded range drops writes and fills
@@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
if (copy_from_user(&val, buf, 4))
return -EFAULT;

- iowrite32(le32_to_cpu(val), io + off);
+ iowrite32_native(val, io + off);
} else {
- val = cpu_to_le32(ioread32(io + off));
+ val = ioread32_native(io + off);

if (copy_to_user(buf, &val, 4))
return -EFAULT;
@@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
if (copy_from_user(&val, buf, 2))
return -EFAULT;

- iowrite16(le16_to_cpu(val), io + off);
+ iowrite16_native(val, io + off);
} else {
- val = cpu_to_le16(ioread16(io + off));
+ val = ioread16_native(io + off);

if (copy_to_user(buf, &val, 2))
return -EFAULT;
--
2.0.0

2014-06-18 18:35:28

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote:
> VFIO exposes BARs to user space as a byte stream so userspace can
> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
> not do byte swapping and simply return values as it gets them from
> PCI device.
>
> Instead, the existing code assumes that byte stream in read/write is
> little-endian and it fixes endianness for values which it passes to
> ioreadXX/iowriteXX helpers. This works for little-endian as PCI is
> little endian and le32_to_cpu/... are stubs.

vfio read32:

val = cpu_to_le32(ioread32(io + off));

Where the typical x86 case, ioread32 is:

#define ioread32(addr) readl(addr)

and readl is:

__le32_to_cpu(__raw_readl(addr));

So we do canceling byte swaps, which are both nops on x86, and end up
returning device endian, which we assume is little endian.

vfio write32 is similar:

iowrite32(le32_to_cpu(val), io + off);

The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel
out, so input data is device endian, which is assumed little.

> This also works for big endian but rather by an accident: it reads 4 bytes
> from the stream (@val is big endian), converts to CPU format (which should
> be big endian) as it was little endian (@val becomes actually little
> endian) and calls iowrite32() which does not do swapping on big endian
> system.

Really?

In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around
writel(), which seems to use the generic implementation, which does
include a cpu_to_le32.

I also see other big endian archs like parisc doing cpu_to_le32 on
iowrite32, so I don't think this statement is true. I imagine it's
probably working for you because the swap cancel.

> This removes byte swapping and makes use ioread32be/iowrite32be
> (and 16bit versions) on big-endian systems. The "be" helpers take
> native endian values and do swapping at the moment of writing to a PCI
> register using one of "store byte-reversed" instructions.

So now you want iowrite32() on little endian and iowrite32be() on big
endian, the former does a cpu_to_le32 (which is a nop on little endian)
and the latter does a cpu_to_be32 (which is a nop on big endian)...
should we just be using __raw_writel() on both? There doesn't actually
seem to be any change in behavior here, it just eliminates back-to-back
byte swaps, which are a nop on x86, but not power, right?

> Suggested-by: Benjamin Herrenschmidt <[email protected]>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 210db24..f363b5a 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -21,6 +21,18 @@
>
> #include "vfio_pci_private.h"
>
> +#ifdef __BIG_ENDIAN__
> +#define ioread16_native ioread16be
> +#define ioread32_native ioread32be
> +#define iowrite16_native iowrite16be
> +#define iowrite32_native iowrite32be
> +#else
> +#define ioread16_native ioread16
> +#define ioread32_native ioread32
> +#define iowrite16_native iowrite16
> +#define iowrite32_native iowrite32
> +#endif
> +
> /*
> * Read or write from an __iomem region (MMIO or I/O port) with an excluded
> * range which is inaccessible. The excluded range drops writes and fills
> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
> if (copy_from_user(&val, buf, 4))
> return -EFAULT;
>
> - iowrite32(le32_to_cpu(val), io + off);
> + iowrite32_native(val, io + off);
> } else {
> - val = cpu_to_le32(ioread32(io + off));
> + val = ioread32_native(io + off);
>
> if (copy_to_user(buf, &val, 4))
> return -EFAULT;
> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
> if (copy_from_user(&val, buf, 2))
> return -EFAULT;
>
> - iowrite16(le16_to_cpu(val), io + off);
> + iowrite16_native(val, io + off);
> } else {
> - val = cpu_to_le16(ioread16(io + off));
> + val = ioread16_native(io + off);
>
> if (copy_to_user(buf, &val, 2))
> return -EFAULT;

2014-06-19 00:51:05

by Alexey Kardashevskiy

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 06/19/2014 04:35 AM, Alex Williamson wrote:
> On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote:
>> VFIO exposes BARs to user space as a byte stream so userspace can
>> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
>> not do byte swapping and simply return values as it gets them from
>> PCI device.
>>
>> Instead, the existing code assumes that byte stream in read/write is
>> little-endian and it fixes endianness for values which it passes to
>> ioreadXX/iowriteXX helpers. This works for little-endian as PCI is
>> little endian and le32_to_cpu/... are stubs.
>
> vfio read32:
>
> val = cpu_to_le32(ioread32(io + off));
>
> Where the typical x86 case, ioread32 is:
>
> #define ioread32(addr) readl(addr)
>
> and readl is:
>
> __le32_to_cpu(__raw_readl(addr));
>
> So we do canceling byte swaps, which are both nops on x86, and end up
> returning device endian, which we assume is little endian.
>
> vfio write32 is similar:
>
> iowrite32(le32_to_cpu(val), io + off);
>
> The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel
> out, so input data is device endian, which is assumed little.
>
>> This also works for big endian but rather by an accident: it reads 4 bytes
>> from the stream (@val is big endian), converts to CPU format (which should
>> be big endian) as it was little endian (@val becomes actually little
>> endian) and calls iowrite32() which does not do swapping on big endian
>> system.
>
> Really?
>
> In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around
> writel(), which seems to use the generic implementation, which does
> include a cpu_to_le32.

Ouch, wrong comment. iowrite32() does swapping. My bad.

>
> I also see other big endian archs like parisc doing cpu_to_le32 on
> iowrite32, so I don't think this statement is true. I imagine it's
> probably working for you because the swap cancel.
>
>> This removes byte swapping and makes use ioread32be/iowrite32be
>> (and 16bit versions) on big-endian systems. The "be" helpers take
>> native endian values and do swapping at the moment of writing to a PCI
>> register using one of "store byte-reversed" instructions.
>
> So now you want iowrite32() on little endian and iowrite32be() on big
> endian, the former does a cpu_to_le32 (which is a nop on little endian)
> and the latter does a cpu_to_be32 (which is a nop on big endian)...
> should we just be using __raw_writel() on both?

We can do that too. The beauty of iowrite32be on ppc64 is that it does not
swap and write separately, it is implemented via the "Store Word
Byte-Reverse Indexed X-form" single instruction.

And some archs (do not know which ones) may add memory barriers in their
implementations of ioread/iowrite. __raw_writel is too raw :)

> There doesn't actually
> seem to be any change in behavior here, it just eliminates back-to-back
> byte swaps, which are a nop on x86, but not power, right?

Exactly. No dependency for QEMU.

>
>> Suggested-by: Benjamin Herrenschmidt <[email protected]>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++----
>> 1 file changed, 16 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
>> index 210db24..f363b5a 100644
>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
>> @@ -21,6 +21,18 @@
>>
>> #include "vfio_pci_private.h"
>>
>> +#ifdef __BIG_ENDIAN__
>> +#define ioread16_native ioread16be
>> +#define ioread32_native ioread32be
>> +#define iowrite16_native iowrite16be
>> +#define iowrite32_native iowrite32be
>> +#else
>> +#define ioread16_native ioread16
>> +#define ioread32_native ioread32
>> +#define iowrite16_native iowrite16
>> +#define iowrite32_native iowrite32
>> +#endif
>> +
>> /*
>> * Read or write from an __iomem region (MMIO or I/O port) with an excluded
>> * range which is inaccessible. The excluded range drops writes and fills
>> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>> if (copy_from_user(&val, buf, 4))
>> return -EFAULT;
>>
>> - iowrite32(le32_to_cpu(val), io + off);
>> + iowrite32_native(val, io + off);
>> } else {
>> - val = cpu_to_le32(ioread32(io + off));
>> + val = ioread32_native(io + off);
>>
>> if (copy_to_user(buf, &val, 4))
>> return -EFAULT;
>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>> if (copy_from_user(&val, buf, 2))
>> return -EFAULT;
>>
>> - iowrite16(le16_to_cpu(val), io + off);
>> + iowrite16_native(val, io + off);
>> } else {
>> - val = cpu_to_le16(ioread16(io + off));
>> + val = ioread16_native(io + off);
>>
>> if (copy_to_user(buf, &val, 2))
>> return -EFAULT;
>
>
>

--
Alexey

2014-06-19 01:50:40

by Alexey Kardashevskiy

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 06/19/2014 10:50 AM, Alexey Kardashevskiy wrote:
> On 06/19/2014 04:35 AM, Alex Williamson wrote:
>> On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote:
>>> VFIO exposes BARs to user space as a byte stream so userspace can
>>> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
>>> not do byte swapping and simply return values as it gets them from
>>> PCI device.
>>>
>>> Instead, the existing code assumes that byte stream in read/write is
>>> little-endian and it fixes endianness for values which it passes to
>>> ioreadXX/iowriteXX helpers. This works for little-endian as PCI is
>>> little endian and le32_to_cpu/... are stubs.
>>
>> vfio read32:
>>
>> val = cpu_to_le32(ioread32(io + off));
>>
>> Where the typical x86 case, ioread32 is:
>>
>> #define ioread32(addr) readl(addr)
>>
>> and readl is:
>>
>> __le32_to_cpu(__raw_readl(addr));
>>
>> So we do canceling byte swaps, which are both nops on x86, and end up
>> returning device endian, which we assume is little endian.
>>
>> vfio write32 is similar:
>>
>> iowrite32(le32_to_cpu(val), io + off);
>>
>> The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel
>> out, so input data is device endian, which is assumed little.
>>
>>> This also works for big endian but rather by an accident: it reads 4 bytes
>>> from the stream (@val is big endian), converts to CPU format (which should
>>> be big endian) as it was little endian (@val becomes actually little
>>> endian) and calls iowrite32() which does not do swapping on big endian
>>> system.
>>
>> Really?
>>
>> In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around
>> writel(), which seems to use the generic implementation, which does
>> include a cpu_to_le32.
>
>
> Ouch, wrong comment. iowrite32() does swapping. My bad.
>
>
>>
>> I also see other big endian archs like parisc doing cpu_to_le32 on
>> iowrite32, so I don't think this statement is true. I imagine it's
>> probably working for you because the swap cancel.
>>
>>> This removes byte swapping and makes use ioread32be/iowrite32be
>>> (and 16bit versions) on big-endian systems. The "be" helpers take
>>> native endian values and do swapping at the moment of writing to a PCI
>>> register using one of "store byte-reversed" instructions.
>>
>> So now you want iowrite32() on little endian and iowrite32be() on big
>> endian, the former does a cpu_to_le32 (which is a nop on little endian)
>> and the latter does a cpu_to_be32 (which is a nop on big endian)...
>> should we just be using __raw_writel() on both?
>
>
> We can do that too. The beauty of iowrite32be on ppc64 is that it does not
> swap and write separately, it is implemented via the "Store Word
> Byte-Reverse Indexed X-form" single instruction.
>
> And some archs (do not know which ones) may add memory barriers in their
> implementations of ioread/iowrite. __raw_writel is too raw :)
>
>> There doesn't actually
>> seem to be any change in behavior here, it just eliminates back-to-back
>> byte swaps, which are a nop on x86, but not power, right?
>
> Exactly. No dependency for QEMU.

How about that:
===

VFIO exposes BARs to user space as a byte stream so userspace can
read it using pread()/pwrite(). Since this is a byte stream, VFIO should
not do byte swapping and simply return values as it gets them from
PCI device.

Instead, the existing code assumes that byte stream in read/write is
little-endian and it fixes endianness for values which it passes to
ioreadXX/iowriteXX helpers in native format. The IO helpers do swapping
again. Since both byte swaps are nops on little-endian host, this works.

This also works for big endian but rather by an accident: it reads 4 bytes
from the stream (@val is big endian), converts to CPU format (which should
be big endian) as it was little endian (and @val becomes actually little
endian) and calls iowrite32() which does swapping on big endian
system again. So byte swap gets cancelled, __raw_writel() receives
a native value and then
*(volatile unsigned int __force *)PCI_FIX_ADDR(addr) = v;
just does the right thing.

This removes byte swaps and makes use of ioread32be/iowrite32be
(and 16bit versions) which do explicit byte swapping at the moment
of write to a PCI register. PPC64 uses a special "Store Word
Byte-Reverse Indexed X-form" instruction which does swap and store.
===

any better?

>>> Suggested-by: Benjamin Herrenschmidt <[email protected]>
>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>> ---
>>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++----
>>> 1 file changed, 16 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
>>> index 210db24..f363b5a 100644
>>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
>>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
>>> @@ -21,6 +21,18 @@
>>>
>>> #include "vfio_pci_private.h"
>>>
>>> +#ifdef __BIG_ENDIAN__
>>> +#define ioread16_native ioread16be
>>> +#define ioread32_native ioread32be
>>> +#define iowrite16_native iowrite16be
>>> +#define iowrite32_native iowrite32be
>>> +#else
>>> +#define ioread16_native ioread16
>>> +#define ioread32_native ioread32
>>> +#define iowrite16_native iowrite16
>>> +#define iowrite32_native iowrite32
>>> +#endif
>>> +
>>> /*
>>> * Read or write from an __iomem region (MMIO or I/O port) with an excluded
>>> * range which is inaccessible. The excluded range drops writes and fills
>>> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>>> if (copy_from_user(&val, buf, 4))
>>> return -EFAULT;
>>>
>>> - iowrite32(le32_to_cpu(val), io + off);
>>> + iowrite32_native(val, io + off);
>>> } else {
>>> - val = cpu_to_le32(ioread32(io + off));
>>> + val = ioread32_native(io + off);
>>>
>>> if (copy_to_user(buf, &val, 4))
>>> return -EFAULT;
>>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>>> if (copy_from_user(&val, buf, 2))
>>> return -EFAULT;
>>>
>>> - iowrite16(le16_to_cpu(val), io + off);
>>> + iowrite16_native(val, io + off);
>>> } else {
>>> - val = cpu_to_le16(ioread16(io + off));
>>> + val = ioread16_native(io + off);
>>>
>>> if (copy_to_user(buf, &val, 2))
>>> return -EFAULT;
>>
>>
>>
>
>

--
Alexey

2014-06-19 01:51:13

by Alexey Kardashevskiy

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 06/19/2014 10:50 AM, Alexey Kardashevskiy wrote:
> On 06/19/2014 04:35 AM, Alex Williamson wrote:
>> On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote:
>>> VFIO exposes BARs to user space as a byte stream so userspace can
>>> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
>>> not do byte swapping and simply return values as it gets them from
>>> PCI device.
>>>
>>> Instead, the existing code assumes that byte stream in read/write is
>>> little-endian and it fixes endianness for values which it passes to
>>> ioreadXX/iowriteXX helpers. This works for little-endian as PCI is
>>> little endian and le32_to_cpu/... are stubs.
>>
>> vfio read32:
>>
>> val = cpu_to_le32(ioread32(io + off));
>>
>> Where the typical x86 case, ioread32 is:
>>
>> #define ioread32(addr) readl(addr)
>>
>> and readl is:
>>
>> __le32_to_cpu(__raw_readl(addr));
>>
>> So we do canceling byte swaps, which are both nops on x86, and end up
>> returning device endian, which we assume is little endian.
>>
>> vfio write32 is similar:
>>
>> iowrite32(le32_to_cpu(val), io + off);
>>
>> The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel
>> out, so input data is device endian, which is assumed little.
>>
>>> This also works for big endian but rather by an accident: it reads 4 bytes
>>> from the stream (@val is big endian), converts to CPU format (which should
>>> be big endian) as it was little endian (@val becomes actually little
>>> endian) and calls iowrite32() which does not do swapping on big endian
>>> system.
>>
>> Really?
>>
>> In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around
>> writel(), which seems to use the generic implementation, which does
>> include a cpu_to_le32.
>
>
> Ouch, wrong comment. iowrite32() does swapping. My bad.
>
>
>>
>> I also see other big endian archs like parisc doing cpu_to_le32 on
>> iowrite32, so I don't think this statement is true. I imagine it's
>> probably working for you because the swap cancel.
>>
>>> This removes byte swapping and makes use ioread32be/iowrite32be
>>> (and 16bit versions) on big-endian systems. The "be" helpers take
>>> native endian values and do swapping at the moment of writing to a PCI
>>> register using one of "store byte-reversed" instructions.
>>
>> So now you want iowrite32() on little endian and iowrite32be() on big
>> endian, the former does a cpu_to_le32 (which is a nop on little endian)
>> and the latter does a cpu_to_be32 (which is a nop on big endian)...
>> should we just be using __raw_writel() on both?
>
>
> We can do that too. The beauty of iowrite32be on ppc64 is that it does not
> swap and write separately, it is implemented via the "Store Word
> Byte-Reverse Indexed X-form" single instruction.
>
> And some archs (do not know which ones) may add memory barriers in their
> implementations of ioread/iowrite. __raw_writel is too raw :)
>
>> There doesn't actually
>> seem to be any change in behavior here, it just eliminates back-to-back
>> byte swaps, which are a nop on x86, but not power, right?
>
> Exactly. No dependency for QEMU.

How about that:
===

VFIO exposes BARs to user space as a byte stream so userspace can
read it using pread()/pwrite(). Since this is a byte stream, VFIO should
not do byte swapping and simply return values as it gets them from
PCI device.

Instead, the existing code assumes that byte stream in read/write is
little-endian and it fixes endianness for values which it passes to
ioreadXX/iowriteXX helpers in native format. The IO helpers do swapping
again. Since both byte swaps are nops on little-endian host, this works.

This also works for big endian but rather by an accident: it reads 4 bytes
from the stream (@val is big endian), converts to CPU format (which should
be big endian) as it was little endian (and @val becomes actually little
endian) and calls iowrite32() which does swapping on big endian
system again. So byte swap gets cancelled, __raw_writel() receives
a native value and then
*(volatile unsigned int __force *)PCI_FIX_ADDR(addr) = v;
just does the right thing.

This removes byte swaps and makes use of ioread32be/iowrite32be
(and 16bit versions) which do explicit byte swapping at the moment
of write to a PCI register. PPC64 uses a special "Store Word
Byte-Reverse Indexed X-form" instruction which does swap and store.
===

any better?

>>> Suggested-by: Benjamin Herrenschmidt <[email protected]>
>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>> ---
>>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++----
>>> 1 file changed, 16 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
>>> index 210db24..f363b5a 100644
>>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
>>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
>>> @@ -21,6 +21,18 @@
>>>
>>> #include "vfio_pci_private.h"
>>>
>>> +#ifdef __BIG_ENDIAN__
>>> +#define ioread16_native ioread16be
>>> +#define ioread32_native ioread32be
>>> +#define iowrite16_native iowrite16be
>>> +#define iowrite32_native iowrite32be
>>> +#else
>>> +#define ioread16_native ioread16
>>> +#define ioread32_native ioread32
>>> +#define iowrite16_native iowrite16
>>> +#define iowrite32_native iowrite32
>>> +#endif
>>> +
>>> /*
>>> * Read or write from an __iomem region (MMIO or I/O port) with an excluded
>>> * range which is inaccessible. The excluded range drops writes and fills
>>> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>>> if (copy_from_user(&val, buf, 4))
>>> return -EFAULT;
>>>
>>> - iowrite32(le32_to_cpu(val), io + off);
>>> + iowrite32_native(val, io + off);
>>> } else {
>>> - val = cpu_to_le32(ioread32(io + off));
>>> + val = ioread32_native(io + off);
>>>
>>> if (copy_to_user(buf, &val, 4))
>>> return -EFAULT;
>>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>>> if (copy_from_user(&val, buf, 2))
>>> return -EFAULT;
>>>
>>> - iowrite16(le16_to_cpu(val), io + off);
>>> + iowrite16_native(val, io + off);
>>> } else {
>>> - val = cpu_to_le16(ioread16(io + off));
>>> + val = ioread16_native(io + off);
>>>
>>> if (copy_to_user(buf, &val, 2))
>>> return -EFAULT;
>>
>>
>>
>
>

--
Alexey

2014-06-19 03:48:12

by Alexey Kardashevskiy

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 06/19/2014 11:50 AM, Alexey Kardashevskiy wrote:
> On 06/19/2014 10:50 AM, Alexey Kardashevskiy wrote:
>> On 06/19/2014 04:35 AM, Alex Williamson wrote:
>>> On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote:
>>>> VFIO exposes BARs to user space as a byte stream so userspace can
>>>> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
>>>> not do byte swapping and simply return values as it gets them from
>>>> PCI device.
>>>>
>>>> Instead, the existing code assumes that byte stream in read/write is
>>>> little-endian and it fixes endianness for values which it passes to
>>>> ioreadXX/iowriteXX helpers. This works for little-endian as PCI is
>>>> little endian and le32_to_cpu/... are stubs.
>>>
>>> vfio read32:
>>>
>>> val = cpu_to_le32(ioread32(io + off));
>>>
>>> Where the typical x86 case, ioread32 is:
>>>
>>> #define ioread32(addr) readl(addr)
>>>
>>> and readl is:
>>>
>>> __le32_to_cpu(__raw_readl(addr));
>>>
>>> So we do canceling byte swaps, which are both nops on x86, and end up
>>> returning device endian, which we assume is little endian.
>>>
>>> vfio write32 is similar:
>>>
>>> iowrite32(le32_to_cpu(val), io + off);
>>>
>>> The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel
>>> out, so input data is device endian, which is assumed little.
>>>
>>>> This also works for big endian but rather by an accident: it reads 4 bytes
>>>> from the stream (@val is big endian), converts to CPU format (which should
>>>> be big endian) as it was little endian (@val becomes actually little
>>>> endian) and calls iowrite32() which does not do swapping on big endian
>>>> system.
>>>
>>> Really?
>>>
>>> In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around
>>> writel(), which seems to use the generic implementation, which does
>>> include a cpu_to_le32.
>>
>>
>> Ouch, wrong comment. iowrite32() does swapping. My bad.
>>
>>
>>>
>>> I also see other big endian archs like parisc doing cpu_to_le32 on
>>> iowrite32, so I don't think this statement is true. I imagine it's
>>> probably working for you because the swap cancel.
>>>
>>>> This removes byte swapping and makes use ioread32be/iowrite32be
>>>> (and 16bit versions) on big-endian systems. The "be" helpers take
>>>> native endian values and do swapping at the moment of writing to a PCI
>>>> register using one of "store byte-reversed" instructions.
>>>
>>> So now you want iowrite32() on little endian and iowrite32be() on big
>>> endian, the former does a cpu_to_le32 (which is a nop on little endian)
>>> and the latter does a cpu_to_be32 (which is a nop on big endian)...
>>> should we just be using __raw_writel() on both?
>>
>>
>> We can do that too. The beauty of iowrite32be on ppc64 is that it does not
>> swap and write separately, it is implemented via the "Store Word
>> Byte-Reverse Indexed X-form" single instruction.
>>
>> And some archs (do not know which ones) may add memory barriers in their
>> implementations of ioread/iowrite. __raw_writel is too raw :)
>>
>>> There doesn't actually
>>> seem to be any change in behavior here, it just eliminates back-to-back
>>> byte swaps, which are a nop on x86, but not power, right?
>>
>> Exactly. No dependency for QEMU.
>
> How about that:
> ===
>
> VFIO exposes BARs to user space as a byte stream so userspace can
> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
> not do byte swapping and simply return values as it gets them from
> PCI device.
>
> Instead, the existing code assumes that byte stream in read/write is
> little-endian and it fixes endianness for values which it passes to
> ioreadXX/iowriteXX helpers in native format. The IO helpers do swapping
> again. Since both byte swaps are nops on little-endian host, this works.
>
> This also works for big endian but rather by an accident: it reads 4 bytes
> from the stream (@val is big endian), converts to CPU format (which should
> be big endian) as it was little endian (and @val becomes actually little
> endian) and calls iowrite32() which does swapping on big endian
> system again. So byte swap gets cancelled, __raw_writel() receives
> a native value and then
> *(volatile unsigned int __force *)PCI_FIX_ADDR(addr) = v;
> just does the right thing.

I am wrong here, sorry. This is what happens when you watch soccer between
2am and 4am :)

>
> This removes byte swaps and makes use of ioread32be/iowrite32be
> (and 16bit versions) which do explicit byte swapping at the moment
> of write to a PCI register. PPC64 uses a special "Store Word
> Byte-Reverse Indexed X-form" instruction which does swap and store.

No swapping is done here if we use ioread32be as it calls in_be32 and that
animal does "lwz" which is simple load from memory.

So @val (16/32 bit variable on stack) will have different values on LE and
BE but since we do not handle it the host and just memcpy it to the buffer,
nothing breaks here.

So it should be like this:
===
VFIO exposes BARs to user space as a byte stream so userspace can
read it using pread()/pwrite(). Since this is a byte stream, VFIO should
not do byte swapping and simply return values as it gets them from
PCI device and copy_to_user will save bytes in the correct
same true for writes.

Instead, the existing code assumes that byte stream in read/write is
little-endian and it fixes endianness for values which it passes to
ioreadXX/iowriteXX helpers in native format. The IO helpers do swapping
again. Since both byte swaps are nops on little-endian host, this works.

This also works for big endian but rather by an accident: it reads 4 bytes
from the stream (@val is big endian), converts to CPU format (which should
be big endian) as it was little endian (and @val becomes actually little
endian) and calls iowrite32() which does swapping on big endian
system again. So byte swap in the host gets cancelled and __raw_writel()
writes the value which was swapped originally by the guest.

This removes byte swaps and makes use of ioread32be/iowrite32be
(and 16bit versions) which do not do byte swap on BE hosts.
For LE hosts, ioread32/iowrite32 are still used.

===

> ===
>
> any better?
>
>
>
>
>>>> Suggested-by: Benjamin Herrenschmidt <[email protected]>
>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>> ---
>>>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++----
>>>> 1 file changed, 16 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
>>>> index 210db24..f363b5a 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
>>>> @@ -21,6 +21,18 @@
>>>>
>>>> #include "vfio_pci_private.h"
>>>>
>>>> +#ifdef __BIG_ENDIAN__
>>>> +#define ioread16_native ioread16be
>>>> +#define ioread32_native ioread32be
>>>> +#define iowrite16_native iowrite16be
>>>> +#define iowrite32_native iowrite32be
>>>> +#else
>>>> +#define ioread16_native ioread16
>>>> +#define ioread32_native ioread32
>>>> +#define iowrite16_native iowrite16
>>>> +#define iowrite32_native iowrite32
>>>> +#endif
>>>> +
>>>> /*
>>>> * Read or write from an __iomem region (MMIO or I/O port) with an excluded
>>>> * range which is inaccessible. The excluded range drops writes and fills
>>>> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>>>> if (copy_from_user(&val, buf, 4))
>>>> return -EFAULT;
>>>>
>>>> - iowrite32(le32_to_cpu(val), io + off);
>>>> + iowrite32_native(val, io + off);
>>>> } else {
>>>> - val = cpu_to_le32(ioread32(io + off));
>>>> + val = ioread32_native(io + off);
>>>>
>>>> if (copy_to_user(buf, &val, 4))
>>>> return -EFAULT;
>>>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>>>> if (copy_from_user(&val, buf, 2))
>>>> return -EFAULT;
>>>>
>>>> - iowrite16(le16_to_cpu(val), io + off);
>>>> + iowrite16_native(val, io + off);
>>>> } else {
>>>> - val = cpu_to_le16(ioread16(io + off));
>>>> + val = ioread16_native(io + off);
>>>>
>>>> if (copy_to_user(buf, &val, 2))
>>>> return -EFAULT;
>>>
>>>
>>>
>>
>>
>
>

--
Alexey

2014-06-19 05:31:04

by Bhushan Bharat

[permalink] [raw]

Subject: RE: [PATCH] vfio: Fix endianness handling for emulated BARs

> -----Original Message-----
> From: Linuxppc-dev [mailto:linuxppc-dev-
> [email protected]] On Behalf Of Alexey
> Kardashevskiy
> Sent: Thursday, June 19, 2014 9:18 AM
> To: Alex Williamson
> Cc: [email protected]; Nikunj A Dadhania; [email protected];
> Alexander Graf; [email protected]
> Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs
>
> On 06/19/2014 11:50 AM, Alexey Kardashevskiy wrote:
> > On 06/19/2014 10:50 AM, Alexey Kardashevskiy wrote:
> >> On 06/19/2014 04:35 AM, Alex Williamson wrote:
> >>> On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote:
> >>>> VFIO exposes BARs to user space as a byte stream so userspace can
> >>>> read it using pread()/pwrite(). Since this is a byte stream, VFIO
> >>>> should not do byte swapping and simply return values as it gets
> >>>> them from PCI device.
> >>>>
> >>>> Instead, the existing code assumes that byte stream in read/write
> >>>> is little-endian and it fixes endianness for values which it passes
> >>>> to ioreadXX/iowriteXX helpers. This works for little-endian as PCI
> >>>> is little endian and le32_to_cpu/... are stubs.
> >>>
> >>> vfio read32:
> >>>
> >>> val = cpu_to_le32(ioread32(io + off));
> >>>
> >>> Where the typical x86 case, ioread32 is:
> >>>
> >>> #define ioread32(addr) readl(addr)
> >>>
> >>> and readl is:
> >>>
> >>> __le32_to_cpu(__raw_readl(addr));
> >>>
> >>> So we do canceling byte swaps, which are both nops on x86, and end
> >>> up returning device endian, which we assume is little endian.
> >>>
> >>> vfio write32 is similar:
> >>>
> >>> iowrite32(le32_to_cpu(val), io + off);
> >>>
> >>> The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel
> >>> out, so input data is device endian, which is assumed little.
> >>>
> >>>> This also works for big endian but rather by an accident: it reads
> >>>> 4 bytes from the stream (@val is big endian), converts to CPU
> >>>> format (which should be big endian) as it was little endian (@val
> >>>> becomes actually little
> >>>> endian) and calls iowrite32() which does not do swapping on big
> >>>> endian system.
> >>>
> >>> Really?
> >>>
> >>> In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around
> >>> writel(), which seems to use the generic implementation, which does
> >>> include a cpu_to_le32.
> >>
> >>
> >> Ouch, wrong comment. iowrite32() does swapping. My bad.
> >>
> >>
> >>>
> >>> I also see other big endian archs like parisc doing cpu_to_le32 on
> >>> iowrite32, so I don't think this statement is true. I imagine it's
> >>> probably working for you because the swap cancel.
> >>>
> >>>> This removes byte swapping and makes use ioread32be/iowrite32be
> >>>> (and 16bit versions) on big-endian systems. The "be" helpers take
> >>>> native endian values and do swapping at the moment of writing to a
> >>>> PCI register using one of "store byte-reversed" instructions.
> >>>
> >>> So now you want iowrite32() on little endian and iowrite32be() on
> >>> big endian, the former does a cpu_to_le32 (which is a nop on little
> >>> endian) and the latter does a cpu_to_be32 (which is a nop on big endian)...
> >>> should we just be using __raw_writel() on both?
> >>
> >>
> >> We can do that too. The beauty of iowrite32be on ppc64 is that it
> >> does not swap and write separately, it is implemented via the "Store
> >> Word Byte-Reverse Indexed X-form" single instruction.
> >>
> >> And some archs (do not know which ones) may add memory barriers in
> >> their implementations of ioread/iowrite. __raw_writel is too raw :)
> >>
> >>> There doesn't actually
> >>> seem to be any change in behavior here, it just eliminates
> >>> back-to-back byte swaps, which are a nop on x86, but not power, right?
> >>
> >> Exactly. No dependency for QEMU.
> >
> > How about that:
> > ===
> >
> > VFIO exposes BARs to user space as a byte stream so userspace can read
> > it using pread()/pwrite(). Since this is a byte stream, VFIO should
> > not do byte swapping and simply return values as it gets them from PCI
> > device.
> >
> > Instead, the existing code assumes that byte stream in read/write is
> > little-endian and it fixes endianness for values which it passes to
> > ioreadXX/iowriteXX helpers in native format. The IO helpers do
> > swapping again. Since both byte swaps are nops on little-endian host, this
> works.
> >
> > This also works for big endian but rather by an accident: it reads 4
> > bytes from the stream (@val is big endian), converts to CPU format
> > (which should be big endian) as it was little endian (and @val becomes
> > actually little
> > endian) and calls iowrite32() which does swapping on big endian system
> > again. So byte swap gets cancelled, __raw_writel() receives a native
> > value and then *(volatile unsigned int __force *)PCI_FIX_ADDR(addr) =
> > v; just does the right thing.
>
> I am wrong here, sorry. This is what happens when you watch soccer between 2am
> and 4am :)
>
>
> >
> > This removes byte swaps and makes use of ioread32be/iowrite32be (and
> > 16bit versions) which do explicit byte swapping at the moment of write
> > to a PCI register. PPC64 uses a special "Store Word Byte-Reverse
> > Indexed X-form" instruction which does swap and store.
>
> No swapping is done here if we use ioread32be as it calls in_be32 and that
> animal does "lwz" which is simple load from memory.
>
> So @val (16/32 bit variable on stack) will have different values on LE and BE
> but since we do not handle it the host and just memcpy it to the buffer, nothing
> breaks here.
>
>
> So it should be like this:
> ===
> VFIO exposes BARs to user space as a byte stream so userspace can read it using
> pread()/pwrite(). Since this is a byte stream, VFIO should not do byte swapping
> and simply return values as it gets them from PCI device and copy_to_user will
> save bytes in the correct same true for writes.

" copy_to_user will save bytes in the correct" ---? --- "same true for writes".

Thanks
-Bharat

>
> Instead, the existing code assumes that byte stream in read/write is little-
> endian and it fixes endianness for values which it passes to ioreadXX/iowriteXX
> helpers in native format. The IO helpers do swapping again. Since both byte
> swaps are nops on little-endian host, this works.

>
> This also works for big endian but rather by an accident: it reads 4 bytes from
> the stream (@val is big endian), converts to CPU format (which should be big
> endian) as it was little endian (and @val becomes actually little
> endian) and calls iowrite32() which does swapping on big endian system again. So
> byte swap in the host gets cancelled and __raw_writel() writes the value which
> was swapped originally by the guest.
>
> This removes byte swaps and makes use of ioread32be/iowrite32be (and 16bit
> versions) which do not do byte swap on BE hosts.
> For LE hosts, ioread32/iowrite32 are still used.
>
> ===
>
>
> > ===
> >
> > any better?
> >
> >
> >
> >
> >>>> Suggested-by: Benjamin Herrenschmidt <[email protected]>
> >>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>>> ---
> >>>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++----
> >>>> 1 file changed, 16 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c
> >>>> b/drivers/vfio/pci/vfio_pci_rdwr.c
> >>>> index 210db24..f363b5a 100644
> >>>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> >>>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> >>>> @@ -21,6 +21,18 @@
> >>>>
> >>>> #include "vfio_pci_private.h"
> >>>>
> >>>> +#ifdef __BIG_ENDIAN__
> >>>> +#define ioread16_native ioread16be
> >>>> +#define ioread32_native ioread32be
> >>>> +#define iowrite16_native iowrite16be
> >>>> +#define iowrite32_native iowrite32be
> >>>> +#else
> >>>> +#define ioread16_native ioread16
> >>>> +#define ioread32_native ioread32
> >>>> +#define iowrite16_native iowrite16
> >>>> +#define iowrite32_native iowrite32
> >>>> +#endif
> >>>> +
> >>>> /*
> >>>> * Read or write from an __iomem region (MMIO or I/O port) with an
> excluded
> >>>> * range which is inaccessible. The excluded range drops writes
> >>>> and fills @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char
> __user *buf,
> >>>> if (copy_from_user(&val, buf, 4))
> >>>> return -EFAULT;
> >>>>
> >>>> - iowrite32(le32_to_cpu(val), io + off);
> >>>> + iowrite32_native(val, io + off);
> >>>> } else {
> >>>> - val = cpu_to_le32(ioread32(io + off));
> >>>> + val = ioread32_native(io + off);
> >>>>
> >>>> if (copy_to_user(buf, &val, 4))
> >>>> return -EFAULT;
> >>>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user
> *buf,
> >>>> if (copy_from_user(&val, buf, 2))
> >>>> return -EFAULT;
> >>>>
> >>>> - iowrite16(le16_to_cpu(val), io + off);
> >>>> + iowrite16_native(val, io + off);
> >>>> } else {
> >>>> - val = cpu_to_le16(ioread16(io + off));
> >>>> + val = ioread16_native(io + off);
> >>>>
> >>>> if (copy_to_user(buf, &val, 2))
> >>>> return -EFAULT;
> >>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>
> --
> Alexey
> _______________________________________________
> Linuxppc-dev mailing list
> [email protected]
> https://lists.ozlabs.org/listinfo/linuxppc-dev
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2014-06-19 06:17:10

by Alexey Kardashevskiy

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 06/19/2014 03:30 PM, [email protected] wrote:
>
>
>> -----Original Message-----
>> From: Linuxppc-dev [mailto:linuxppc-dev-
>> [email protected]] On Behalf Of Alexey
>> Kardashevskiy
>> Sent: Thursday, June 19, 2014 9:18 AM
>> To: Alex Williamson
>> Cc: [email protected]; Nikunj A Dadhania; [email protected];
>> Alexander Graf; [email protected]
>> Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs
>>
>> On 06/19/2014 11:50 AM, Alexey Kardashevskiy wrote:
>>> On 06/19/2014 10:50 AM, Alexey Kardashevskiy wrote:
>>>> On 06/19/2014 04:35 AM, Alex Williamson wrote:
>>>>> On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote:
>>>>>> VFIO exposes BARs to user space as a byte stream so userspace can
>>>>>> read it using pread()/pwrite(). Since this is a byte stream, VFIO
>>>>>> should not do byte swapping and simply return values as it gets
>>>>>> them from PCI device.
>>>>>>
>>>>>> Instead, the existing code assumes that byte stream in read/write
>>>>>> is little-endian and it fixes endianness for values which it passes
>>>>>> to ioreadXX/iowriteXX helpers. This works for little-endian as PCI
>>>>>> is little endian and le32_to_cpu/... are stubs.
>>>>>
>>>>> vfio read32:
>>>>>
>>>>> val = cpu_to_le32(ioread32(io + off));
>>>>>
>>>>> Where the typical x86 case, ioread32 is:
>>>>>
>>>>> #define ioread32(addr) readl(addr)
>>>>>
>>>>> and readl is:
>>>>>
>>>>> __le32_to_cpu(__raw_readl(addr));
>>>>>
>>>>> So we do canceling byte swaps, which are both nops on x86, and end
>>>>> up returning device endian, which we assume is little endian.
>>>>>
>>>>> vfio write32 is similar:
>>>>>
>>>>> iowrite32(le32_to_cpu(val), io + off);
>>>>>
>>>>> The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel
>>>>> out, so input data is device endian, which is assumed little.
>>>>>
>>>>>> This also works for big endian but rather by an accident: it reads
>>>>>> 4 bytes from the stream (@val is big endian), converts to CPU
>>>>>> format (which should be big endian) as it was little endian (@val
>>>>>> becomes actually little
>>>>>> endian) and calls iowrite32() which does not do swapping on big
>>>>>> endian system.
>>>>>
>>>>> Really?
>>>>>
>>>>> In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around
>>>>> writel(), which seems to use the generic implementation, which does
>>>>> include a cpu_to_le32.
>>>>
>>>>
>>>> Ouch, wrong comment. iowrite32() does swapping. My bad.
>>>>
>>>>
>>>>>
>>>>> I also see other big endian archs like parisc doing cpu_to_le32 on
>>>>> iowrite32, so I don't think this statement is true. I imagine it's
>>>>> probably working for you because the swap cancel.
>>>>>
>>>>>> This removes byte swapping and makes use ioread32be/iowrite32be
>>>>>> (and 16bit versions) on big-endian systems. The "be" helpers take
>>>>>> native endian values and do swapping at the moment of writing to a
>>>>>> PCI register using one of "store byte-reversed" instructions.
>>>>>
>>>>> So now you want iowrite32() on little endian and iowrite32be() on
>>>>> big endian, the former does a cpu_to_le32 (which is a nop on little
>>>>> endian) and the latter does a cpu_to_be32 (which is a nop on big endian)...
>>>>> should we just be using __raw_writel() on both?
>>>>
>>>>
>>>> We can do that too. The beauty of iowrite32be on ppc64 is that it
>>>> does not swap and write separately, it is implemented via the "Store
>>>> Word Byte-Reverse Indexed X-form" single instruction.
>>>>
>>>> And some archs (do not know which ones) may add memory barriers in
>>>> their implementations of ioread/iowrite. __raw_writel is too raw :)
>>>>
>>>>> There doesn't actually
>>>>> seem to be any change in behavior here, it just eliminates
>>>>> back-to-back byte swaps, which are a nop on x86, but not power, right?
>>>>
>>>> Exactly. No dependency for QEMU.
>>>
>>> How about that:
>>> ===
>>>
>>> VFIO exposes BARs to user space as a byte stream so userspace can read
>>> it using pread()/pwrite(). Since this is a byte stream, VFIO should
>>> not do byte swapping and simply return values as it gets them from PCI
>>> device.
>>>
>>> Instead, the existing code assumes that byte stream in read/write is
>>> little-endian and it fixes endianness for values which it passes to
>>> ioreadXX/iowriteXX helpers in native format. The IO helpers do
>>> swapping again. Since both byte swaps are nops on little-endian host, this
>> works.
>>>
>>> This also works for big endian but rather by an accident: it reads 4
>>> bytes from the stream (@val is big endian), converts to CPU format
>>> (which should be big endian) as it was little endian (and @val becomes
>>> actually little
>>> endian) and calls iowrite32() which does swapping on big endian system
>>> again. So byte swap gets cancelled, __raw_writel() receives a native
>>> value and then *(volatile unsigned int __force *)PCI_FIX_ADDR(addr) =
>>> v; just does the right thing.
>>
>> I am wrong here, sorry. This is what happens when you watch soccer between 2am
>> and 4am :)
>>
>>
>>>
>>> This removes byte swaps and makes use of ioread32be/iowrite32be (and
>>> 16bit versions) which do explicit byte swapping at the moment of write
>>> to a PCI register. PPC64 uses a special "Store Word Byte-Reverse
>>> Indexed X-form" instruction which does swap and store.
>>
>> No swapping is done here if we use ioread32be as it calls in_be32 and that
>> animal does "lwz" which is simple load from memory.
>>
>> So @val (16/32 bit variable on stack) will have different values on LE and BE
>> but since we do not handle it the host and just memcpy it to the buffer, nothing
>> breaks here.
>>
>>
>> So it should be like this:
>> ===
>> VFIO exposes BARs to user space as a byte stream so userspace can read it using
>> pread()/pwrite(). Since this is a byte stream, VFIO should not do byte swapping
>> and simply return values as it gets them from PCI device and copy_to_user will
>> save bytes in the correct same true for writes.
>
> " copy_to_user will save bytes in the correct" ---? --- "same true for writes".

Oops. "correct order" is it.

>
> Thanks
> -Bharat
>
>>
>> Instead, the existing code assumes that byte stream in read/write is little-
>> endian and it fixes endianness for values which it passes to ioreadXX/iowriteXX
>> helpers in native format. The IO helpers do swapping again. Since both byte
>> swaps are nops on little-endian host, this works.
>
>>
>> This also works for big endian but rather by an accident: it reads 4 bytes from
>> the stream (@val is big endian), converts to CPU format (which should be big
>> endian) as it was little endian (and @val becomes actually little
>> endian) and calls iowrite32() which does swapping on big endian system again. So
>> byte swap in the host gets cancelled and __raw_writel() writes the value which
>> was swapped originally by the guest.
>>
>> This removes byte swaps and makes use of ioread32be/iowrite32be (and 16bit
>> versions) which do not do byte swap on BE hosts.
>> For LE hosts, ioread32/iowrite32 are still used.
>>
>> ===
>>
>>
>>> ===
>>>
>>> any better?
>>>
>>>
>>>
>>>
>>>>>> Suggested-by: Benjamin Herrenschmidt <[email protected]>
>>>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>>>> ---
>>>>>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++----
>>>>>> 1 file changed, 16 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c
>>>>>> b/drivers/vfio/pci/vfio_pci_rdwr.c
>>>>>> index 210db24..f363b5a 100644
>>>>>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
>>>>>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
>>>>>> @@ -21,6 +21,18 @@
>>>>>>
>>>>>> #include "vfio_pci_private.h"
>>>>>>
>>>>>> +#ifdef __BIG_ENDIAN__
>>>>>> +#define ioread16_native ioread16be
>>>>>> +#define ioread32_native ioread32be
>>>>>> +#define iowrite16_native iowrite16be
>>>>>> +#define iowrite32_native iowrite32be
>>>>>> +#else
>>>>>> +#define ioread16_native ioread16
>>>>>> +#define ioread32_native ioread32
>>>>>> +#define iowrite16_native iowrite16
>>>>>> +#define iowrite32_native iowrite32
>>>>>> +#endif
>>>>>> +
>>>>>> /*
>>>>>> * Read or write from an __iomem region (MMIO or I/O port) with an
>> excluded
>>>>>> * range which is inaccessible. The excluded range drops writes
>>>>>> and fills @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char
>> __user *buf,
>>>>>> if (copy_from_user(&val, buf, 4))
>>>>>> return -EFAULT;
>>>>>>
>>>>>> - iowrite32(le32_to_cpu(val), io + off);
>>>>>> + iowrite32_native(val, io + off);
>>>>>> } else {
>>>>>> - val = cpu_to_le32(ioread32(io + off));
>>>>>> + val = ioread32_native(io + off);
>>>>>>
>>>>>> if (copy_to_user(buf, &val, 4))
>>>>>> return -EFAULT;
>>>>>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user
>> *buf,
>>>>>> if (copy_from_user(&val, buf, 2))
>>>>>> return -EFAULT;
>>>>>>
>>>>>> - iowrite16(le16_to_cpu(val), io + off);
>>>>>> + iowrite16_native(val, io + off);
>>>>>> } else {
>>>>>> - val = cpu_to_le16(ioread16(io + off));
>>>>>> + val = ioread16_native(io + off);
>>>>>>
>>>>>> if (copy_to_user(buf, &val, 2))
>>>>>> return -EFAULT;
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Alexey
>> _______________________________________________
>> Linuxppc-dev mailing list
>> [email protected]
>> https://lists.ozlabs.org/listinfo/linuxppc-dev

--
Alexey

2014-06-20 03:22:12

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On Thu, 2014-06-19 at 13:48 +1000, Alexey Kardashevskiy wrote:
> On 06/19/2014 11:50 AM, Alexey Kardashevskiy wrote:
> > On 06/19/2014 10:50 AM, Alexey Kardashevskiy wrote:
> >> On 06/19/2014 04:35 AM, Alex Williamson wrote:
> >>> On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote:
> >>>> VFIO exposes BARs to user space as a byte stream so userspace can
> >>>> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
> >>>> not do byte swapping and simply return values as it gets them from
> >>>> PCI device.
> >>>>
> >>>> Instead, the existing code assumes that byte stream in read/write is
> >>>> little-endian and it fixes endianness for values which it passes to
> >>>> ioreadXX/iowriteXX helpers. This works for little-endian as PCI is
> >>>> little endian and le32_to_cpu/... are stubs.
> >>>
> >>> vfio read32:
> >>>
> >>> val = cpu_to_le32(ioread32(io + off));
> >>>
> >>> Where the typical x86 case, ioread32 is:
> >>>
> >>> #define ioread32(addr) readl(addr)
> >>>
> >>> and readl is:
> >>>
> >>> __le32_to_cpu(__raw_readl(addr));
> >>>
> >>> So we do canceling byte swaps, which are both nops on x86, and end up
> >>> returning device endian, which we assume is little endian.
> >>>
> >>> vfio write32 is similar:
> >>>
> >>> iowrite32(le32_to_cpu(val), io + off);
> >>>
> >>> The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel
> >>> out, so input data is device endian, which is assumed little.
> >>>
> >>>> This also works for big endian but rather by an accident: it reads 4 bytes
> >>>> from the stream (@val is big endian), converts to CPU format (which should
> >>>> be big endian) as it was little endian (@val becomes actually little
> >>>> endian) and calls iowrite32() which does not do swapping on big endian
> >>>> system.
> >>>
> >>> Really?
> >>>
> >>> In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around
> >>> writel(), which seems to use the generic implementation, which does
> >>> include a cpu_to_le32.
> >>
> >>
> >> Ouch, wrong comment. iowrite32() does swapping. My bad.
> >>
> >>
> >>>
> >>> I also see other big endian archs like parisc doing cpu_to_le32 on
> >>> iowrite32, so I don't think this statement is true. I imagine it's
> >>> probably working for you because the swap cancel.
> >>>
> >>>> This removes byte swapping and makes use ioread32be/iowrite32be
> >>>> (and 16bit versions) on big-endian systems. The "be" helpers take
> >>>> native endian values and do swapping at the moment of writing to a PCI
> >>>> register using one of "store byte-reversed" instructions.
> >>>
> >>> So now you want iowrite32() on little endian and iowrite32be() on big
> >>> endian, the former does a cpu_to_le32 (which is a nop on little endian)
> >>> and the latter does a cpu_to_be32 (which is a nop on big endian)...
> >>> should we just be using __raw_writel() on both?
> >>
> >>
> >> We can do that too. The beauty of iowrite32be on ppc64 is that it does not
> >> swap and write separately, it is implemented via the "Store Word
> >> Byte-Reverse Indexed X-form" single instruction.
> >>
> >> And some archs (do not know which ones) may add memory barriers in their
> >> implementations of ioread/iowrite. __raw_writel is too raw :)
> >>
> >>> There doesn't actually
> >>> seem to be any change in behavior here, it just eliminates back-to-back
> >>> byte swaps, which are a nop on x86, but not power, right?
> >>
> >> Exactly. No dependency for QEMU.
> >
> > How about that:
> > ===
> >
> > VFIO exposes BARs to user space as a byte stream so userspace can
> > read it using pread()/pwrite(). Since this is a byte stream, VFIO should
> > not do byte swapping and simply return values as it gets them from
> > PCI device.
> >
> > Instead, the existing code assumes that byte stream in read/write is
> > little-endian and it fixes endianness for values which it passes to
> > ioreadXX/iowriteXX helpers in native format. The IO helpers do swapping
> > again. Since both byte swaps are nops on little-endian host, this works.
> >
> > This also works for big endian but rather by an accident: it reads 4 bytes
> > from the stream (@val is big endian), converts to CPU format (which should
> > be big endian) as it was little endian (and @val becomes actually little
> > endian) and calls iowrite32() which does swapping on big endian
> > system again. So byte swap gets cancelled, __raw_writel() receives
> > a native value and then
> > *(volatile unsigned int __force *)PCI_FIX_ADDR(addr) = v;
> > just does the right thing.
>
> I am wrong here, sorry. This is what happens when you watch soccer between
> 2am and 4am :)
>
>
> >
> > This removes byte swaps and makes use of ioread32be/iowrite32be
> > (and 16bit versions) which do explicit byte swapping at the moment
> > of write to a PCI register. PPC64 uses a special "Store Word
> > Byte-Reverse Indexed X-form" instruction which does swap and store.
>
> No swapping is done here if we use ioread32be as it calls in_be32 and that
> animal does "lwz" which is simple load from memory.
>
> So @val (16/32 bit variable on stack) will have different values on LE and
> BE but since we do not handle it the host and just memcpy it to the buffer,
> nothing breaks here.
>
>
> So it should be like this:
> ===
> VFIO exposes BARs to user space as a byte stream so userspace can
> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
> not do byte swapping and simply return values as it gets them from
> PCI device and copy_to_user will save bytes in the correct
> same true for writes.
>
> Instead, the existing code assumes that byte stream in read/write is
> little-endian and it fixes endianness for values which it passes to
> ioreadXX/iowriteXX helpers in native format. The IO helpers do swapping
> again. Since both byte swaps are nops on little-endian host, this works.
>
> This also works for big endian but rather by an accident: it reads 4 bytes
> from the stream (@val is big endian), converts to CPU format (which should
> be big endian) as it was little endian (and @val becomes actually little
> endian) and calls iowrite32() which does swapping on big endian
> system again. So byte swap in the host gets cancelled and __raw_writel()
> writes the value which was swapped originally by the guest.
>
> This removes byte swaps and makes use of ioread32be/iowrite32be
> (and 16bit versions) which do not do byte swap on BE hosts.
> For LE hosts, ioread32/iowrite32 are still used.
>
> ===

Working on big endian being an accident may be a matter of perspective.
The comment remains that this patch doesn't actually fix anything except
the overhead on big endian systems doing redundant byte swapping and
maybe the philosophy that vfio regions are little endian.

I'm still not a fan of iowrite vs iowritebe, there must be something we
can use that doesn't have an implicit swap. Calling it iowrite*_native
is also an abuse of the namespace. Next thing we know some common code
will legitimately use that name. If we do need to define an alias
(which I'd like to avoid) it should be something like vfio_iowrite32.
Thanks,

Alex

> > ===
> >
> > any better?
> >
> >
> >
> >
> >>>> Suggested-by: Benjamin Herrenschmidt <[email protected]>
> >>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>>> ---
> >>>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++----
> >>>> 1 file changed, 16 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> >>>> index 210db24..f363b5a 100644
> >>>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> >>>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> >>>> @@ -21,6 +21,18 @@
> >>>>
> >>>> #include "vfio_pci_private.h"
> >>>>
> >>>> +#ifdef __BIG_ENDIAN__
> >>>> +#define ioread16_native ioread16be
> >>>> +#define ioread32_native ioread32be
> >>>> +#define iowrite16_native iowrite16be
> >>>> +#define iowrite32_native iowrite32be
> >>>> +#else
> >>>> +#define ioread16_native ioread16
> >>>> +#define ioread32_native ioread32
> >>>> +#define iowrite16_native iowrite16
> >>>> +#define iowrite32_native iowrite32
> >>>> +#endif
> >>>> +
> >>>> /*
> >>>> * Read or write from an __iomem region (MMIO or I/O port) with an excluded
> >>>> * range which is inaccessible. The excluded range drops writes and fills
> >>>> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
> >>>> if (copy_from_user(&val, buf, 4))
> >>>> return -EFAULT;
> >>>>
> >>>> - iowrite32(le32_to_cpu(val), io + off);
> >>>> + iowrite32_native(val, io + off);
> >>>> } else {
> >>>> - val = cpu_to_le32(ioread32(io + off));
> >>>> + val = ioread32_native(io + off);
> >>>>
> >>>> if (copy_to_user(buf, &val, 4))
> >>>> return -EFAULT;
> >>>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
> >>>> if (copy_from_user(&val, buf, 2))
> >>>> return -EFAULT;
> >>>>
> >>>> - iowrite16(le16_to_cpu(val), io + off);
> >>>> + iowrite16_native(val, io + off);
> >>>> } else {
> >>>> - val = cpu_to_le16(ioread16(io + off));
> >>>> + val = ioread16_native(io + off);
> >>>>
> >>>> if (copy_to_user(buf, &val, 2))
> >>>> return -EFAULT;
> >>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>

2014-06-20 14:14:57

by Alexey Kardashevskiy

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 06/20/2014 01:21 PM, Alex Williamson wrote:
> On Thu, 2014-06-19 at 13:48 +1000, Alexey Kardashevskiy wrote:
>> On 06/19/2014 11:50 AM, Alexey Kardashevskiy wrote:
>>> On 06/19/2014 10:50 AM, Alexey Kardashevskiy wrote:
>>>> On 06/19/2014 04:35 AM, Alex Williamson wrote:
>>>>> On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote:
>>>>>> VFIO exposes BARs to user space as a byte stream so userspace can
>>>>>> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
>>>>>> not do byte swapping and simply return values as it gets them from
>>>>>> PCI device.
>>>>>>
>>>>>> Instead, the existing code assumes that byte stream in read/write is
>>>>>> little-endian and it fixes endianness for values which it passes to
>>>>>> ioreadXX/iowriteXX helpers. This works for little-endian as PCI is
>>>>>> little endian and le32_to_cpu/... are stubs.
>>>>>
>>>>> vfio read32:
>>>>>
>>>>> val = cpu_to_le32(ioread32(io + off));
>>>>>
>>>>> Where the typical x86 case, ioread32 is:
>>>>>
>>>>> #define ioread32(addr) readl(addr)
>>>>>
>>>>> and readl is:
>>>>>
>>>>> __le32_to_cpu(__raw_readl(addr));
>>>>>
>>>>> So we do canceling byte swaps, which are both nops on x86, and end up
>>>>> returning device endian, which we assume is little endian.
>>>>>
>>>>> vfio write32 is similar:
>>>>>
>>>>> iowrite32(le32_to_cpu(val), io + off);
>>>>>
>>>>> The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel
>>>>> out, so input data is device endian, which is assumed little.
>>>>>
>>>>>> This also works for big endian but rather by an accident: it reads 4 bytes
>>>>>> from the stream (@val is big endian), converts to CPU format (which should
>>>>>> be big endian) as it was little endian (@val becomes actually little
>>>>>> endian) and calls iowrite32() which does not do swapping on big endian
>>>>>> system.
>>>>>
>>>>> Really?
>>>>>
>>>>> In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around
>>>>> writel(), which seems to use the generic implementation, which does
>>>>> include a cpu_to_le32.
>>>>
>>>>
>>>> Ouch, wrong comment. iowrite32() does swapping. My bad.
>>>>
>>>>
>>>>>
>>>>> I also see other big endian archs like parisc doing cpu_to_le32 on
>>>>> iowrite32, so I don't think this statement is true. I imagine it's
>>>>> probably working for you because the swap cancel.
>>>>>
>>>>>> This removes byte swapping and makes use ioread32be/iowrite32be
>>>>>> (and 16bit versions) on big-endian systems. The "be" helpers take
>>>>>> native endian values and do swapping at the moment of writing to a PCI
>>>>>> register using one of "store byte-reversed" instructions.
>>>>>
>>>>> So now you want iowrite32() on little endian and iowrite32be() on big
>>>>> endian, the former does a cpu_to_le32 (which is a nop on little endian)
>>>>> and the latter does a cpu_to_be32 (which is a nop on big endian)...
>>>>> should we just be using __raw_writel() on both?
>>>>
>>>>
>>>> We can do that too. The beauty of iowrite32be on ppc64 is that it does not
>>>> swap and write separately, it is implemented via the "Store Word
>>>> Byte-Reverse Indexed X-form" single instruction.
>>>>
>>>> And some archs (do not know which ones) may add memory barriers in their
>>>> implementations of ioread/iowrite. __raw_writel is too raw :)
>>>>
>>>>> There doesn't actually
>>>>> seem to be any change in behavior here, it just eliminates back-to-back
>>>>> byte swaps, which are a nop on x86, but not power, right?
>>>>
>>>> Exactly. No dependency for QEMU.
>>>
>>> How about that:
>>> ===
>>>
>>> VFIO exposes BARs to user space as a byte stream so userspace can
>>> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
>>> not do byte swapping and simply return values as it gets them from
>>> PCI device.
>>>
>>> Instead, the existing code assumes that byte stream in read/write is
>>> little-endian and it fixes endianness for values which it passes to
>>> ioreadXX/iowriteXX helpers in native format. The IO helpers do swapping
>>> again. Since both byte swaps are nops on little-endian host, this works.
>>>
>>> This also works for big endian but rather by an accident: it reads 4 bytes
>>> from the stream (@val is big endian), converts to CPU format (which should
>>> be big endian) as it was little endian (and @val becomes actually little
>>> endian) and calls iowrite32() which does swapping on big endian
>>> system again. So byte swap gets cancelled, __raw_writel() receives
>>> a native value and then
>>> *(volatile unsigned int __force *)PCI_FIX_ADDR(addr) = v;
>>> just does the right thing.
>>
>> I am wrong here, sorry. This is what happens when you watch soccer between
>> 2am and 4am :)
>>
>>
>>>
>>> This removes byte swaps and makes use of ioread32be/iowrite32be
>>> (and 16bit versions) which do explicit byte swapping at the moment
>>> of write to a PCI register. PPC64 uses a special "Store Word
>>> Byte-Reverse Indexed X-form" instruction which does swap and store.
>>
>> No swapping is done here if we use ioread32be as it calls in_be32 and that
>> animal does "lwz" which is simple load from memory.
>>
>> So @val (16/32 bit variable on stack) will have different values on LE and
>> BE but since we do not handle it the host and just memcpy it to the buffer,
>> nothing breaks here.
>>
>>
>> So it should be like this:
>> ===
>> VFIO exposes BARs to user space as a byte stream so userspace can
>> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
>> not do byte swapping and simply return values as it gets them from
>> PCI device and copy_to_user will save bytes in the correct
>> same true for writes.
>>
>> Instead, the existing code assumes that byte stream in read/write is
>> little-endian and it fixes endianness for values which it passes to
>> ioreadXX/iowriteXX helpers in native format. The IO helpers do swapping
>> again. Since both byte swaps are nops on little-endian host, this works.
>>
>> This also works for big endian but rather by an accident: it reads 4 bytes
>> from the stream (@val is big endian), converts to CPU format (which should
>> be big endian) as it was little endian (and @val becomes actually little
>> endian) and calls iowrite32() which does swapping on big endian
>> system again. So byte swap in the host gets cancelled and __raw_writel()
>> writes the value which was swapped originally by the guest.
>>
>> This removes byte swaps and makes use of ioread32be/iowrite32be
>> (and 16bit versions) which do not do byte swap on BE hosts.
>> For LE hosts, ioread32/iowrite32 are still used.
>>
>> ===
>
> Working on big endian being an accident may be a matter of perspective.
> The comment remains that this patch doesn't actually fix anything except
> the overhead on big endian systems doing redundant byte swapping and
> maybe the philosophy that vfio regions are little endian.

They are little-endian only between ioread32() and copy_to_user() calls,
besides that it is a bytes stream which does not have endianness so I do
not understand the comment about philosophy...

> I'm still not a fan of iowrite vs iowritebe, there must be something we
> can use that doesn't have an implicit swap. Calling it iowrite*_native
> is also an abuse of the namespace. Next thing we know some common code
> will legitimately use that name. If we do need to define an alias
> (which I'd like to avoid) it should be something like vfio_iowrite32.

We can still use __raw_writel&co, would that be ok?

> Thanks,
>
> Alex
>
>>> ===
>>>
>>> any better?
>>>
>>>
>>>
>>>
>>>>>> Suggested-by: Benjamin Herrenschmidt <[email protected]>
>>>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>>>> ---
>>>>>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++----
>>>>>> 1 file changed, 16 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
>>>>>> index 210db24..f363b5a 100644
>>>>>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
>>>>>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
>>>>>> @@ -21,6 +21,18 @@
>>>>>>
>>>>>> #include "vfio_pci_private.h"
>>>>>>
>>>>>> +#ifdef __BIG_ENDIAN__
>>>>>> +#define ioread16_native ioread16be
>>>>>> +#define ioread32_native ioread32be
>>>>>> +#define iowrite16_native iowrite16be
>>>>>> +#define iowrite32_native iowrite32be
>>>>>> +#else
>>>>>> +#define ioread16_native ioread16
>>>>>> +#define ioread32_native ioread32
>>>>>> +#define iowrite16_native iowrite16
>>>>>> +#define iowrite32_native iowrite32
>>>>>> +#endif
>>>>>> +
>>>>>> /*
>>>>>> * Read or write from an __iomem region (MMIO or I/O port) with an excluded
>>>>>> * range which is inaccessible. The excluded range drops writes and fills
>>>>>> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>>>>>> if (copy_from_user(&val, buf, 4))
>>>>>> return -EFAULT;
>>>>>>
>>>>>> - iowrite32(le32_to_cpu(val), io + off);
>>>>>> + iowrite32_native(val, io + off);
>>>>>> } else {
>>>>>> - val = cpu_to_le32(ioread32(io + off));
>>>>>> + val = ioread32_native(io + off);
>>>>>>
>>>>>> if (copy_to_user(buf, &val, 4))
>>>>>> return -EFAULT;
>>>>>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>>>>>> if (copy_from_user(&val, buf, 2))
>>>>>> return -EFAULT;
>>>>>>
>>>>>> - iowrite16(le16_to_cpu(val), io + off);
>>>>>> + iowrite16_native(val, io + off);
>>>>>> } else {
>>>>>> - val = cpu_to_le16(ioread16(io + off));
>>>>>> + val = ioread16_native(io + off);
>>>>>>
>>>>>> if (copy_to_user(buf, &val, 2))
>>>>>> return -EFAULT;
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
>

--
Alexey

2014-06-20 23:13:17

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:

> Working on big endian being an accident may be a matter of perspective

:-)

> The comment remains that this patch doesn't actually fix anything except
> the overhead on big endian systems doing redundant byte swapping and
> maybe the philosophy that vfio regions are little endian.

Yes, that works by accident because technically VFIO is a transport and
thus shouldn't perform any endian swapping of any sort, which remains
the responsibility of the end driver which is the only one to know
whether a given BAR location is a a register or some streaming data
and in the former case whether it's LE or BE (some PCI devices are BE
even ! :-)

But yes, in the end, it works with the dual "cancelling" swaps and the
overhead of those swaps is probably drowned in the noise of the syscall
overhead.

> I'm still not a fan of iowrite vs iowritebe, there must be something we
> can use that doesn't have an implicit swap.

Sadly there isn't ... In the old day we didn't even have the "be"
variant and readl/writel style accessors still don't have them either
for all archs.

There is __raw_readl/writel but here the semantics are much more than
just "don't swap", they also don't have memory barriers (which means
they are essentially useless to most drivers unless those are platform
specific drivers which know exactly what they are doing, or in the rare
cases such as accessing a framebuffer which we know never have side
effects).

> Calling it iowrite*_native is also an abuse of the namespace.

> Next thing we know some common code
> will legitimately use that name.

I might make sense to those definitions into a common header. There have
been a handful of cases in the past that wanted that sort of "native
byte order" MMIOs iirc (though don't ask me for examples, I can't really
remember).

> If we do need to define an alias
> (which I'd like to avoid) it should be something like vfio_iowrite32.
> Thanks,

Cheers,
Ben.

> Alex
>
> > > ===
> > >
> > > any better?
> > >
> > >
> > >
> > >
> > >>>> Suggested-by: Benjamin Herrenschmidt <[email protected]>
> > >>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> > >>>> ---
> > >>>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++----
> > >>>> 1 file changed, 16 insertions(+), 4 deletions(-)
> > >>>>
> > >>>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> > >>>> index 210db24..f363b5a 100644
> > >>>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> > >>>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> > >>>> @@ -21,6 +21,18 @@
> > >>>>
> > >>>> #include "vfio_pci_private.h"
> > >>>>
> > >>>> +#ifdef __BIG_ENDIAN__
> > >>>> +#define ioread16_native ioread16be
> > >>>> +#define ioread32_native ioread32be
> > >>>> +#define iowrite16_native iowrite16be
> > >>>> +#define iowrite32_native iowrite32be
> > >>>> +#else
> > >>>> +#define ioread16_native ioread16
> > >>>> +#define ioread32_native ioread32
> > >>>> +#define iowrite16_native iowrite16
> > >>>> +#define iowrite32_native iowrite32
> > >>>> +#endif
> > >>>> +
> > >>>> /*
> > >>>> * Read or write from an __iomem region (MMIO or I/O port) with an excluded
> > >>>> * range which is inaccessible. The excluded range drops writes and fills
> > >>>> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
> > >>>> if (copy_from_user(&val, buf, 4))
> > >>>> return -EFAULT;
> > >>>>
> > >>>> - iowrite32(le32_to_cpu(val), io + off);
> > >>>> + iowrite32_native(val, io + off);
> > >>>> } else {
> > >>>> - val = cpu_to_le32(ioread32(io + off));
> > >>>> + val = ioread32_native(io + off);
> > >>>>
> > >>>> if (copy_to_user(buf, &val, 4))
> > >>>> return -EFAULT;
> > >>>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
> > >>>> if (copy_from_user(&val, buf, 2))
> > >>>> return -EFAULT;
> > >>>>
> > >>>> - iowrite16(le16_to_cpu(val), io + off);
> > >>>> + iowrite16_native(val, io + off);
> > >>>> } else {
> > >>>> - val = cpu_to_le16(ioread16(io + off));
> > >>>> + val = ioread16_native(io + off);
> > >>>>
> > >>>> if (copy_to_user(buf, &val, 2))
> > >>>> return -EFAULT;
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >
> > >
> >
> >
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-06-20 23:16:55

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On Sat, 2014-06-21 at 00:14 +1000, Alexey Kardashevskiy wrote:

> We can still use __raw_writel&co, would that be ok?

No unless you understand precisely what kind of memory barriers each
platform require for these.

Cheers,
Ben.

2014-06-24 10:11:20

by Alexey Kardashevskiy

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
>
>> Working on big endian being an accident may be a matter of perspective
>
> :-)
>
>> The comment remains that this patch doesn't actually fix anything except
>> the overhead on big endian systems doing redundant byte swapping and
>> maybe the philosophy that vfio regions are little endian.
>
> Yes, that works by accident because technically VFIO is a transport and
> thus shouldn't perform any endian swapping of any sort, which remains
> the responsibility of the end driver which is the only one to know
> whether a given BAR location is a a register or some streaming data
> and in the former case whether it's LE or BE (some PCI devices are BE
> even ! :-)
>
> But yes, in the end, it works with the dual "cancelling" swaps and the
> overhead of those swaps is probably drowned in the noise of the syscall
> overhead.
>
>> I'm still not a fan of iowrite vs iowritebe, there must be something we
>> can use that doesn't have an implicit swap.
>
> Sadly there isn't ... In the old day we didn't even have the "be"
> variant and readl/writel style accessors still don't have them either
> for all archs.
>
> There is __raw_readl/writel but here the semantics are much more than
> just "don't swap", they also don't have memory barriers (which means
> they are essentially useless to most drivers unless those are platform
> specific drivers which know exactly what they are doing, or in the rare
> cases such as accessing a framebuffer which we know never have side
> effects).
>
>> Calling it iowrite*_native is also an abuse of the namespace.
>
>
>> Next thing we know some common code
>> will legitimately use that name.
>
> I might make sense to those definitions into a common header. There have
> been a handful of cases in the past that wanted that sort of "native
> byte order" MMIOs iirc (though don't ask me for examples, I can't really
> remember).
>
>> If we do need to define an alias
>> (which I'd like to avoid) it should be something like vfio_iowrite32.

Ping?

We need to make a decision whether to move those xxx_native() helpers
somewhere (where?) or leave the patch as is (as we figured out that
iowriteXX functions implement barriers and we cannot just use raw
accessors) and fix commit log to explain everything.

Thanks!

>> Thanks,
>
> Cheers,
> Ben.
>
>> Alex
>>
>>>> ===
>>>>
>>>> any better?
>>>>
>>>>
>>>>
>>>>
>>>>>>> Suggested-by: Benjamin Herrenschmidt <[email protected]>
>>>>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>>>>> ---
>>>>>>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++----
>>>>>>> 1 file changed, 16 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
>>>>>>> index 210db24..f363b5a 100644
>>>>>>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
>>>>>>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
>>>>>>> @@ -21,6 +21,18 @@
>>>>>>>
>>>>>>> #include "vfio_pci_private.h"
>>>>>>>
>>>>>>> +#ifdef __BIG_ENDIAN__
>>>>>>> +#define ioread16_native ioread16be
>>>>>>> +#define ioread32_native ioread32be
>>>>>>> +#define iowrite16_native iowrite16be
>>>>>>> +#define iowrite32_native iowrite32be
>>>>>>> +#else
>>>>>>> +#define ioread16_native ioread16
>>>>>>> +#define ioread32_native ioread32
>>>>>>> +#define iowrite16_native iowrite16
>>>>>>> +#define iowrite32_native iowrite32
>>>>>>> +#endif
>>>>>>> +
>>>>>>> /*
>>>>>>> * Read or write from an __iomem region (MMIO or I/O port) with an excluded
>>>>>>> * range which is inaccessible. The excluded range drops writes and fills
>>>>>>> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>>>>>>> if (copy_from_user(&val, buf, 4))
>>>>>>> return -EFAULT;
>>>>>>>
>>>>>>> - iowrite32(le32_to_cpu(val), io + off);
>>>>>>> + iowrite32_native(val, io + off);
>>>>>>> } else {
>>>>>>> - val = cpu_to_le32(ioread32(io + off));
>>>>>>> + val = ioread32_native(io + off);
>>>>>>>
>>>>>>> if (copy_to_user(buf, &val, 4))
>>>>>>> return -EFAULT;
>>>>>>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>>>>>>> if (copy_from_user(&val, buf, 2))
>>>>>>> return -EFAULT;
>>>>>>>
>>>>>>> - iowrite16(le16_to_cpu(val), io + off);
>>>>>>> + iowrite16_native(val, io + off);
>>>>>>> } else {
>>>>>>> - val = cpu_to_le16(ioread16(io + off));
>>>>>>> + val = ioread16_native(io + off);
>>>>>>>
>>>>>>> if (copy_to_user(buf, &val, 2))
>>>>>>> return -EFAULT;
>>>>>>
>>>>>>
>>>>>>

--
Alexey

2014-06-24 10:42:01

by Alexander Graf

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 24.06.14 12:11, Alexey Kardashevskiy wrote:
> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
>> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
>>
>>> Working on big endian being an accident may be a matter of perspective
>> :-)
>>
>>> The comment remains that this patch doesn't actually fix anything except
>>> the overhead on big endian systems doing redundant byte swapping and
>>> maybe the philosophy that vfio regions are little endian.
>> Yes, that works by accident because technically VFIO is a transport and
>> thus shouldn't perform any endian swapping of any sort, which remains
>> the responsibility of the end driver which is the only one to know
>> whether a given BAR location is a a register or some streaming data
>> and in the former case whether it's LE or BE (some PCI devices are BE
>> even ! :-)
>>
>> But yes, in the end, it works with the dual "cancelling" swaps and the
>> overhead of those swaps is probably drowned in the noise of the syscall
>> overhead.
>>
>>> I'm still not a fan of iowrite vs iowritebe, there must be something we
>>> can use that doesn't have an implicit swap.
>> Sadly there isn't ... In the old day we didn't even have the "be"
>> variant and readl/writel style accessors still don't have them either
>> for all archs.
>>
>> There is __raw_readl/writel but here the semantics are much more than
>> just "don't swap", they also don't have memory barriers (which means
>> they are essentially useless to most drivers unless those are platform
>> specific drivers which know exactly what they are doing, or in the rare
>> cases such as accessing a framebuffer which we know never have side
>> effects).
>>
>>> Calling it iowrite*_native is also an abuse of the namespace.
>>
>>> Next thing we know some common code
>>> will legitimately use that name.
>> I might make sense to those definitions into a common header. There have
>> been a handful of cases in the past that wanted that sort of "native
>> byte order" MMIOs iirc (though don't ask me for examples, I can't really
>> remember).
>>
>>> If we do need to define an alias
>>> (which I'd like to avoid) it should be something like vfio_iowrite32.
>
> Ping?
>
> We need to make a decision whether to move those xxx_native() helpers
> somewhere (where?) or leave the patch as is (as we figured out that
> iowriteXX functions implement barriers and we cannot just use raw
> accessors) and fix commit log to explain everything.

Is there actually any difference in generated code with this patch
applied and without? I would hope that iowrite..() is inlined and
cancels out the cpu_to_le..() calls that are also inlined?

Alex

2014-06-24 12:50:46

by Alexey Kardashevskiy

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 06/24/2014 08:41 PM, Alexander Graf wrote:
>
> On 24.06.14 12:11, Alexey Kardashevskiy wrote:
>> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
>>> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
>>>
>>>> Working on big endian being an accident may be a matter of perspective
>>> :-)
>>>
>>>> The comment remains that this patch doesn't actually fix anything except
>>>> the overhead on big endian systems doing redundant byte swapping and
>>>> maybe the philosophy that vfio regions are little endian.
>>> Yes, that works by accident because technically VFIO is a transport and
>>> thus shouldn't perform any endian swapping of any sort, which remains
>>> the responsibility of the end driver which is the only one to know
>>> whether a given BAR location is a a register or some streaming data
>>> and in the former case whether it's LE or BE (some PCI devices are BE
>>> even ! :-)
>>>
>>> But yes, in the end, it works with the dual "cancelling" swaps and the
>>> overhead of those swaps is probably drowned in the noise of the syscall
>>> overhead.
>>>
>>>> I'm still not a fan of iowrite vs iowritebe, there must be something we
>>>> can use that doesn't have an implicit swap.
>>> Sadly there isn't ... In the old day we didn't even have the "be"
>>> variant and readl/writel style accessors still don't have them either
>>> for all archs.
>>>
>>> There is __raw_readl/writel but here the semantics are much more than
>>> just "don't swap", they also don't have memory barriers (which means
>>> they are essentially useless to most drivers unless those are platform
>>> specific drivers which know exactly what they are doing, or in the rare
>>> cases such as accessing a framebuffer which we know never have side
>>> effects).
>>>
>>>> Calling it iowrite*_native is also an abuse of the namespace.
>>>
>>>> Next thing we know some common code
>>>> will legitimately use that name.
>>> I might make sense to those definitions into a common header. There have
>>> been a handful of cases in the past that wanted that sort of "native
>>> byte order" MMIOs iirc (though don't ask me for examples, I can't really
>>> remember).
>>>
>>>> If we do need to define an alias
>>>> (which I'd like to avoid) it should be something like vfio_iowrite32.
>>
>> Ping?
>>
>> We need to make a decision whether to move those xxx_native() helpers
>> somewhere (where?) or leave the patch as is (as we figured out that
>> iowriteXX functions implement barriers and we cannot just use raw
>> accessors) and fix commit log to explain everything.
>
> Is there actually any difference in generated code with this patch applied
> and without? I would hope that iowrite..() is inlined and cancels out the
> cpu_to_le..() calls that are also inlined?

iowrite32 is a non-inline function so conversions take place so are the
others. And sorry but I fail to see why this matters. We are not trying to
accelerate things, we are removing redundant operations which confuse
people who read the code.

--
Alexey

2014-06-24 12:52:28

by Alexander Graf

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 24.06.14 14:50, Alexey Kardashevskiy wrote:
> On 06/24/2014 08:41 PM, Alexander Graf wrote:
>> On 24.06.14 12:11, Alexey Kardashevskiy wrote:
>>> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
>>>> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
>>>>
>>>>> Working on big endian being an accident may be a matter of perspective
>>>> :-)
>>>>
>>>>> The comment remains that this patch doesn't actually fix anything except
>>>>> the overhead on big endian systems doing redundant byte swapping and
>>>>> maybe the philosophy that vfio regions are little endian.
>>>> Yes, that works by accident because technically VFIO is a transport and
>>>> thus shouldn't perform any endian swapping of any sort, which remains
>>>> the responsibility of the end driver which is the only one to know
>>>> whether a given BAR location is a a register or some streaming data
>>>> and in the former case whether it's LE or BE (some PCI devices are BE
>>>> even ! :-)
>>>>
>>>> But yes, in the end, it works with the dual "cancelling" swaps and the
>>>> overhead of those swaps is probably drowned in the noise of the syscall
>>>> overhead.
>>>>
>>>>> I'm still not a fan of iowrite vs iowritebe, there must be something we
>>>>> can use that doesn't have an implicit swap.
>>>> Sadly there isn't ... In the old day we didn't even have the "be"
>>>> variant and readl/writel style accessors still don't have them either
>>>> for all archs.
>>>>
>>>> There is __raw_readl/writel but here the semantics are much more than
>>>> just "don't swap", they also don't have memory barriers (which means
>>>> they are essentially useless to most drivers unless those are platform
>>>> specific drivers which know exactly what they are doing, or in the rare
>>>> cases such as accessing a framebuffer which we know never have side
>>>> effects).
>>>>
>>>>> Calling it iowrite*_native is also an abuse of the namespace.
>>>>> Next thing we know some common code
>>>>> will legitimately use that name.
>>>> I might make sense to those definitions into a common header. There have
>>>> been a handful of cases in the past that wanted that sort of "native
>>>> byte order" MMIOs iirc (though don't ask me for examples, I can't really
>>>> remember).
>>>>
>>>>> If we do need to define an alias
>>>>> (which I'd like to avoid) it should be something like vfio_iowrite32.
>>> Ping?
>>>
>>> We need to make a decision whether to move those xxx_native() helpers
>>> somewhere (where?) or leave the patch as is (as we figured out that
>>> iowriteXX functions implement barriers and we cannot just use raw
>>> accessors) and fix commit log to explain everything.
>> Is there actually any difference in generated code with this patch applied
>> and without? I would hope that iowrite..() is inlined and cancels out the
>> cpu_to_le..() calls that are also inlined?
> iowrite32 is a non-inline function so conversions take place so are the
> others. And sorry but I fail to see why this matters. We are not trying to
> accelerate things, we are removing redundant operations which confuse
> people who read the code.

The confusion depends on where you're coming from. If you happen to know
that "iowrite32" writes in LE, then the LE conversion makes a lot of sense.

I don't have a strong feeling either way though and will let Alex decide
on the path forward :).

Alex

2014-06-24 13:01:52

by Alexey Kardashevskiy

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 06/24/2014 10:52 PM, Alexander Graf wrote:
>
> On 24.06.14 14:50, Alexey Kardashevskiy wrote:
>> On 06/24/2014 08:41 PM, Alexander Graf wrote:
>>> On 24.06.14 12:11, Alexey Kardashevskiy wrote:
>>>> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
>>>>> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
>>>>>
>>>>>> Working on big endian being an accident may be a matter of perspective
>>>>> :-)
>>>>>
>>>>>> The comment remains that this patch doesn't actually fix anything except
>>>>>> the overhead on big endian systems doing redundant byte swapping and
>>>>>> maybe the philosophy that vfio regions are little endian.
>>>>> Yes, that works by accident because technically VFIO is a transport and
>>>>> thus shouldn't perform any endian swapping of any sort, which remains
>>>>> the responsibility of the end driver which is the only one to know
>>>>> whether a given BAR location is a a register or some streaming data
>>>>> and in the former case whether it's LE or BE (some PCI devices are BE
>>>>> even ! :-)
>>>>>
>>>>> But yes, in the end, it works with the dual "cancelling" swaps and the
>>>>> overhead of those swaps is probably drowned in the noise of the syscall
>>>>> overhead.
>>>>>
>>>>>> I'm still not a fan of iowrite vs iowritebe, there must be something we
>>>>>> can use that doesn't have an implicit swap.
>>>>> Sadly there isn't ... In the old day we didn't even have the "be"
>>>>> variant and readl/writel style accessors still don't have them either
>>>>> for all archs.
>>>>>
>>>>> There is __raw_readl/writel but here the semantics are much more than
>>>>> just "don't swap", they also don't have memory barriers (which means
>>>>> they are essentially useless to most drivers unless those are platform
>>>>> specific drivers which know exactly what they are doing, or in the rare
>>>>> cases such as accessing a framebuffer which we know never have side
>>>>> effects).
>>>>>
>>>>>> Calling it iowrite*_native is also an abuse of the namespace.
>>>>>> Next thing we know some common code
>>>>>> will legitimately use that name.
>>>>> I might make sense to those definitions into a common header. There have
>>>>> been a handful of cases in the past that wanted that sort of "native
>>>>> byte order" MMIOs iirc (though don't ask me for examples, I can't really
>>>>> remember).
>>>>>
>>>>>> If we do need to define an alias
>>>>>> (which I'd like to avoid) it should be something like vfio_iowrite32.
>>>> Ping?
>>>>
>>>> We need to make a decision whether to move those xxx_native() helpers
>>>> somewhere (where?) or leave the patch as is (as we figured out that
>>>> iowriteXX functions implement barriers and we cannot just use raw
>>>> accessors) and fix commit log to explain everything.
>>> Is there actually any difference in generated code with this patch applied
>>> and without? I would hope that iowrite..() is inlined and cancels out the
>>> cpu_to_le..() calls that are also inlined?
>> iowrite32 is a non-inline function so conversions take place so are the
>> others. And sorry but I fail to see why this matters. We are not trying to
>> accelerate things, we are removing redundant operations which confuse
>> people who read the code.
>
> The confusion depends on where you're coming from. If you happen to know
> that "iowrite32" writes in LE, then the LE conversion makes a lot of sense.

It was like this (and this is just confusing):

iowrite32(le32_to_cpu(val), io + off);

What would make sense (according to you and I would understand this) is this:

iowrite32(cpu_to_le32(val), io + off);

Or I missed your point, did I?

> I don't have a strong feeling either way though and will let Alex decide on
> the path forward :)

It would probably help if you picked the side :)

--
Alexey

2014-06-24 13:22:54

by Alexander Graf

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 24.06.14 15:01, Alexey Kardashevskiy wrote:
> On 06/24/2014 10:52 PM, Alexander Graf wrote:
>> On 24.06.14 14:50, Alexey Kardashevskiy wrote:
>>> On 06/24/2014 08:41 PM, Alexander Graf wrote:
>>>> On 24.06.14 12:11, Alexey Kardashevskiy wrote:
>>>>> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
>>>>>> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
>>>>>>
>>>>>>> Working on big endian being an accident may be a matter of perspective
>>>>>> :-)
>>>>>>
>>>>>>> The comment remains that this patch doesn't actually fix anything except
>>>>>>> the overhead on big endian systems doing redundant byte swapping and
>>>>>>> maybe the philosophy that vfio regions are little endian.
>>>>>> Yes, that works by accident because technically VFIO is a transport and
>>>>>> thus shouldn't perform any endian swapping of any sort, which remains
>>>>>> the responsibility of the end driver which is the only one to know
>>>>>> whether a given BAR location is a a register or some streaming data
>>>>>> and in the former case whether it's LE or BE (some PCI devices are BE
>>>>>> even ! :-)
>>>>>>
>>>>>> But yes, in the end, it works with the dual "cancelling" swaps and the
>>>>>> overhead of those swaps is probably drowned in the noise of the syscall
>>>>>> overhead.
>>>>>>
>>>>>>> I'm still not a fan of iowrite vs iowritebe, there must be something we
>>>>>>> can use that doesn't have an implicit swap.
>>>>>> Sadly there isn't ... In the old day we didn't even have the "be"
>>>>>> variant and readl/writel style accessors still don't have them either
>>>>>> for all archs.
>>>>>>
>>>>>> There is __raw_readl/writel but here the semantics are much more than
>>>>>> just "don't swap", they also don't have memory barriers (which means
>>>>>> they are essentially useless to most drivers unless those are platform
>>>>>> specific drivers which know exactly what they are doing, or in the rare
>>>>>> cases such as accessing a framebuffer which we know never have side
>>>>>> effects).
>>>>>>
>>>>>>> Calling it iowrite*_native is also an abuse of the namespace.
>>>>>>> Next thing we know some common code
>>>>>>> will legitimately use that name.
>>>>>> I might make sense to those definitions into a common header. There have
>>>>>> been a handful of cases in the past that wanted that sort of "native
>>>>>> byte order" MMIOs iirc (though don't ask me for examples, I can't really
>>>>>> remember).
>>>>>>
>>>>>>> If we do need to define an alias
>>>>>>> (which I'd like to avoid) it should be something like vfio_iowrite32.
>>>>> Ping?
>>>>>
>>>>> We need to make a decision whether to move those xxx_native() helpers
>>>>> somewhere (where?) or leave the patch as is (as we figured out that
>>>>> iowriteXX functions implement barriers and we cannot just use raw
>>>>> accessors) and fix commit log to explain everything.
>>>> Is there actually any difference in generated code with this patch applied
>>>> and without? I would hope that iowrite..() is inlined and cancels out the
>>>> cpu_to_le..() calls that are also inlined?
>>> iowrite32 is a non-inline function so conversions take place so are the
>>> others. And sorry but I fail to see why this matters. We are not trying to
>>> accelerate things, we are removing redundant operations which confuse
>>> people who read the code.
>> The confusion depends on where you're coming from. If you happen to know
>> that "iowrite32" writes in LE, then the LE conversion makes a lot of sense.
> It was like this (and this is just confusing):
>
> iowrite32(le32_to_cpu(val), io + off);
>
> What would make sense (according to you and I would understand this) is this:
>
> iowrite32(cpu_to_le32(val), io + off);
>
>
> Or I missed your point, did I?

No, you didn't miss it. I think for people who know how iowrite32()
works the above is obvious. I find the fact that iowrite32() writes in
LE always pretty scary though ;).

So IMHO we should either create new, generic iowrite helpers that don't
do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls.

Alex

2014-06-24 14:21:41

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On Tue, 2014-06-24 at 15:22 +0200, Alexander Graf wrote:
> On 24.06.14 15:01, Alexey Kardashevskiy wrote:
> > On 06/24/2014 10:52 PM, Alexander Graf wrote:
> >> On 24.06.14 14:50, Alexey Kardashevskiy wrote:
> >>> On 06/24/2014 08:41 PM, Alexander Graf wrote:
> >>>> On 24.06.14 12:11, Alexey Kardashevskiy wrote:
> >>>>> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
> >>>>>> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
> >>>>>>
> >>>>>>> Working on big endian being an accident may be a matter of perspective
> >>>>>> :-)
> >>>>>>
> >>>>>>> The comment remains that this patch doesn't actually fix anything except
> >>>>>>> the overhead on big endian systems doing redundant byte swapping and
> >>>>>>> maybe the philosophy that vfio regions are little endian.
> >>>>>> Yes, that works by accident because technically VFIO is a transport and
> >>>>>> thus shouldn't perform any endian swapping of any sort, which remains
> >>>>>> the responsibility of the end driver which is the only one to know
> >>>>>> whether a given BAR location is a a register or some streaming data
> >>>>>> and in the former case whether it's LE or BE (some PCI devices are BE
> >>>>>> even ! :-)
> >>>>>>
> >>>>>> But yes, in the end, it works with the dual "cancelling" swaps and the
> >>>>>> overhead of those swaps is probably drowned in the noise of the syscall
> >>>>>> overhead.
> >>>>>>
> >>>>>>> I'm still not a fan of iowrite vs iowritebe, there must be something we
> >>>>>>> can use that doesn't have an implicit swap.
> >>>>>> Sadly there isn't ... In the old day we didn't even have the "be"
> >>>>>> variant and readl/writel style accessors still don't have them either
> >>>>>> for all archs.
> >>>>>>
> >>>>>> There is __raw_readl/writel but here the semantics are much more than
> >>>>>> just "don't swap", they also don't have memory barriers (which means
> >>>>>> they are essentially useless to most drivers unless those are platform
> >>>>>> specific drivers which know exactly what they are doing, or in the rare
> >>>>>> cases such as accessing a framebuffer which we know never have side
> >>>>>> effects).
> >>>>>>
> >>>>>>> Calling it iowrite*_native is also an abuse of the namespace.
> >>>>>>> Next thing we know some common code
> >>>>>>> will legitimately use that name.
> >>>>>> I might make sense to those definitions into a common header. There have
> >>>>>> been a handful of cases in the past that wanted that sort of "native
> >>>>>> byte order" MMIOs iirc (though don't ask me for examples, I can't really
> >>>>>> remember).
> >>>>>>
> >>>>>>> If we do need to define an alias
> >>>>>>> (which I'd like to avoid) it should be something like vfio_iowrite32.
> >>>>> Ping?
> >>>>>
> >>>>> We need to make a decision whether to move those xxx_native() helpers
> >>>>> somewhere (where?) or leave the patch as is (as we figured out that
> >>>>> iowriteXX functions implement barriers and we cannot just use raw
> >>>>> accessors) and fix commit log to explain everything.
> >>>> Is there actually any difference in generated code with this patch applied
> >>>> and without? I would hope that iowrite..() is inlined and cancels out the
> >>>> cpu_to_le..() calls that are also inlined?
> >>> iowrite32 is a non-inline function so conversions take place so are the
> >>> others. And sorry but I fail to see why this matters. We are not trying to
> >>> accelerate things, we are removing redundant operations which confuse
> >>> people who read the code.
> >> The confusion depends on where you're coming from. If you happen to know
> >> that "iowrite32" writes in LE, then the LE conversion makes a lot of sense.
> > It was like this (and this is just confusing):
> >
> > iowrite32(le32_to_cpu(val), io + off);
> >
> > What would make sense (according to you and I would understand this) is this:
> >
> > iowrite32(cpu_to_le32(val), io + off);
> >
> >
> > Or I missed your point, did I?
>
> No, you didn't miss it. I think for people who know how iowrite32()
> works the above is obvious. I find the fact that iowrite32() writes in
> LE always pretty scary though ;).
>
> So IMHO we should either create new, generic iowrite helpers that don't
> do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls.

I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense
and keeps the byte order consistent regardless of the platform, while
iowrite32(val) or iowrite32be(val) makes me scratch my head and try to
remember that the byte swaps are a nop on the given platforms. As Ben
noted, a native, no-swap ioread/write doesn't exist, but perhaps should.
I'd prefer an attempt be made to make it exist before adding
vfio-specific macros. vfio is arguably doing the right thing here given
the functions available. Thanks,

Alex

2014-06-24 14:33:36

by Alexey Kardashevskiy

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 06/25/2014 12:21 AM, Alex Williamson wrote:
> On Tue, 2014-06-24 at 15:22 +0200, Alexander Graf wrote:
>> On 24.06.14 15:01, Alexey Kardashevskiy wrote:
>>> On 06/24/2014 10:52 PM, Alexander Graf wrote:
>>>> On 24.06.14 14:50, Alexey Kardashevskiy wrote:
>>>>> On 06/24/2014 08:41 PM, Alexander Graf wrote:
>>>>>> On 24.06.14 12:11, Alexey Kardashevskiy wrote:
>>>>>>> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
>>>>>>>> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
>>>>>>>>
>>>>>>>>> Working on big endian being an accident may be a matter of perspective
>>>>>>>> :-)
>>>>>>>>
>>>>>>>>> The comment remains that this patch doesn't actually fix anything except
>>>>>>>>> the overhead on big endian systems doing redundant byte swapping and
>>>>>>>>> maybe the philosophy that vfio regions are little endian.
>>>>>>>> Yes, that works by accident because technically VFIO is a transport and
>>>>>>>> thus shouldn't perform any endian swapping of any sort, which remains
>>>>>>>> the responsibility of the end driver which is the only one to know
>>>>>>>> whether a given BAR location is a a register or some streaming data
>>>>>>>> and in the former case whether it's LE or BE (some PCI devices are BE
>>>>>>>> even ! :-)
>>>>>>>>
>>>>>>>> But yes, in the end, it works with the dual "cancelling" swaps and the
>>>>>>>> overhead of those swaps is probably drowned in the noise of the syscall
>>>>>>>> overhead.
>>>>>>>>
>>>>>>>>> I'm still not a fan of iowrite vs iowritebe, there must be something we
>>>>>>>>> can use that doesn't have an implicit swap.
>>>>>>>> Sadly there isn't ... In the old day we didn't even have the "be"
>>>>>>>> variant and readl/writel style accessors still don't have them either
>>>>>>>> for all archs.
>>>>>>>>
>>>>>>>> There is __raw_readl/writel but here the semantics are much more than
>>>>>>>> just "don't swap", they also don't have memory barriers (which means
>>>>>>>> they are essentially useless to most drivers unless those are platform
>>>>>>>> specific drivers which know exactly what they are doing, or in the rare
>>>>>>>> cases such as accessing a framebuffer which we know never have side
>>>>>>>> effects).
>>>>>>>>
>>>>>>>>> Calling it iowrite*_native is also an abuse of the namespace.
>>>>>>>>> Next thing we know some common code
>>>>>>>>> will legitimately use that name.
>>>>>>>> I might make sense to those definitions into a common header. There have
>>>>>>>> been a handful of cases in the past that wanted that sort of "native
>>>>>>>> byte order" MMIOs iirc (though don't ask me for examples, I can't really
>>>>>>>> remember).
>>>>>>>>
>>>>>>>>> If we do need to define an alias
>>>>>>>>> (which I'd like to avoid) it should be something like vfio_iowrite32.
>>>>>>> Ping?
>>>>>>>
>>>>>>> We need to make a decision whether to move those xxx_native() helpers
>>>>>>> somewhere (where?) or leave the patch as is (as we figured out that
>>>>>>> iowriteXX functions implement barriers and we cannot just use raw
>>>>>>> accessors) and fix commit log to explain everything.
>>>>>> Is there actually any difference in generated code with this patch applied
>>>>>> and without? I would hope that iowrite..() is inlined and cancels out the
>>>>>> cpu_to_le..() calls that are also inlined?
>>>>> iowrite32 is a non-inline function so conversions take place so are the
>>>>> others. And sorry but I fail to see why this matters. We are not trying to
>>>>> accelerate things, we are removing redundant operations which confuse
>>>>> people who read the code.
>>>> The confusion depends on where you're coming from. If you happen to know
>>>> that "iowrite32" writes in LE, then the LE conversion makes a lot of sense.
>>> It was like this (and this is just confusing):
>>>
>>> iowrite32(le32_to_cpu(val), io + off);
>>>
>>> What would make sense (according to you and I would understand this) is this:
>>>
>>> iowrite32(cpu_to_le32(val), io + off);
>>>
>>>
>>> Or I missed your point, did I?
>>
>> No, you didn't miss it. I think for people who know how iowrite32()
>> works the above is obvious. I find the fact that iowrite32() writes in
>> LE always pretty scary though ;).
>>
>> So IMHO we should either create new, generic iowrite helpers that don't
>> do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls.
>
> I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense

I do not understand why @val is considered LE here and need to be converted
to CPU. Really. I truly believe it should be cpu_to_le32().

> and keeps the byte order consistent regardless of the platform, while
> iowrite32(val) or iowrite32be(val) makes me scratch my head and try to
> remember that the byte swaps are a nop on the given platforms. As Ben
> noted, a native, no-swap ioread/write doesn't exist, but perhaps should.
> I'd prefer an attempt be made to make it exist before adding
> vfio-specific macros. vfio is arguably doing the right thing here given
> the functions available. Thanks,

--
Alexey

2014-06-24 14:40:46

by David Laight

[permalink] [raw]

Subject: RE: [PATCH] vfio: Fix endianness handling for emulated BARs

From: Alexey Kardashevskiy
...
> >> So IMHO we should either create new, generic iowrite helpers that don't
> >> do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls.
> >
> > I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense
>
>
> I do not understand why @val is considered LE here and need to be converted
> to CPU. Really. I truly believe it should be cpu_to_le32().

Think about it carefully...

Apparently iowrite32() is writing a 'cpu' value out 'le'.
So if you have a 'le' value you need to convert it to 'cpu' first.

I really won't ask how 'be' ppc managed to get 'le' on-chip peripherals.

David

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2014-06-24 14:43:21

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote:
> On 06/25/2014 12:21 AM, Alex Williamson wrote:
> > On Tue, 2014-06-24 at 15:22 +0200, Alexander Graf wrote:
> >> On 24.06.14 15:01, Alexey Kardashevskiy wrote:
> >>> On 06/24/2014 10:52 PM, Alexander Graf wrote:
> >>>> On 24.06.14 14:50, Alexey Kardashevskiy wrote:
> >>>>> On 06/24/2014 08:41 PM, Alexander Graf wrote:
> >>>>>> On 24.06.14 12:11, Alexey Kardashevskiy wrote:
> >>>>>>> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
> >>>>>>>> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
> >>>>>>>>
> >>>>>>>>> Working on big endian being an accident may be a matter of perspective
> >>>>>>>> :-)
> >>>>>>>>
> >>>>>>>>> The comment remains that this patch doesn't actually fix anything except
> >>>>>>>>> the overhead on big endian systems doing redundant byte swapping and
> >>>>>>>>> maybe the philosophy that vfio regions are little endian.
> >>>>>>>> Yes, that works by accident because technically VFIO is a transport and
> >>>>>>>> thus shouldn't perform any endian swapping of any sort, which remains
> >>>>>>>> the responsibility of the end driver which is the only one to know
> >>>>>>>> whether a given BAR location is a a register or some streaming data
> >>>>>>>> and in the former case whether it's LE or BE (some PCI devices are BE
> >>>>>>>> even ! :-)
> >>>>>>>>
> >>>>>>>> But yes, in the end, it works with the dual "cancelling" swaps and the
> >>>>>>>> overhead of those swaps is probably drowned in the noise of the syscall
> >>>>>>>> overhead.
> >>>>>>>>
> >>>>>>>>> I'm still not a fan of iowrite vs iowritebe, there must be something we
> >>>>>>>>> can use that doesn't have an implicit swap.
> >>>>>>>> Sadly there isn't ... In the old day we didn't even have the "be"
> >>>>>>>> variant and readl/writel style accessors still don't have them either
> >>>>>>>> for all archs.
> >>>>>>>>
> >>>>>>>> There is __raw_readl/writel but here the semantics are much more than
> >>>>>>>> just "don't swap", they also don't have memory barriers (which means
> >>>>>>>> they are essentially useless to most drivers unless those are platform
> >>>>>>>> specific drivers which know exactly what they are doing, or in the rare
> >>>>>>>> cases such as accessing a framebuffer which we know never have side
> >>>>>>>> effects).
> >>>>>>>>
> >>>>>>>>> Calling it iowrite*_native is also an abuse of the namespace.
> >>>>>>>>> Next thing we know some common code
> >>>>>>>>> will legitimately use that name.
> >>>>>>>> I might make sense to those definitions into a common header. There have
> >>>>>>>> been a handful of cases in the past that wanted that sort of "native
> >>>>>>>> byte order" MMIOs iirc (though don't ask me for examples, I can't really
> >>>>>>>> remember).
> >>>>>>>>
> >>>>>>>>> If we do need to define an alias
> >>>>>>>>> (which I'd like to avoid) it should be something like vfio_iowrite32.
> >>>>>>> Ping?
> >>>>>>>
> >>>>>>> We need to make a decision whether to move those xxx_native() helpers
> >>>>>>> somewhere (where?) or leave the patch as is (as we figured out that
> >>>>>>> iowriteXX functions implement barriers and we cannot just use raw
> >>>>>>> accessors) and fix commit log to explain everything.
> >>>>>> Is there actually any difference in generated code with this patch applied
> >>>>>> and without? I would hope that iowrite..() is inlined and cancels out the
> >>>>>> cpu_to_le..() calls that are also inlined?
> >>>>> iowrite32 is a non-inline function so conversions take place so are the
> >>>>> others. And sorry but I fail to see why this matters. We are not trying to
> >>>>> accelerate things, we are removing redundant operations which confuse
> >>>>> people who read the code.
> >>>> The confusion depends on where you're coming from. If you happen to know
> >>>> that "iowrite32" writes in LE, then the LE conversion makes a lot of sense.
> >>> It was like this (and this is just confusing):
> >>>
> >>> iowrite32(le32_to_cpu(val), io + off);
> >>>
> >>> What would make sense (according to you and I would understand this) is this:
> >>>
> >>> iowrite32(cpu_to_le32(val), io + off);
> >>>
> >>>
> >>> Or I missed your point, did I?
> >>
> >> No, you didn't miss it. I think for people who know how iowrite32()
> >> works the above is obvious. I find the fact that iowrite32() writes in
> >> LE always pretty scary though ;).
> >>
> >> So IMHO we should either create new, generic iowrite helpers that don't
> >> do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls.
> >
> > I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense
>
>
> I do not understand why @val is considered LE here and need to be converted
> to CPU. Really. I truly believe it should be cpu_to_le32().

Because iowrite32 is defined to take a cpu byte order value and write it
as little endian.

> > and keeps the byte order consistent regardless of the platform, while
> > iowrite32(val) or iowrite32be(val) makes me scratch my head and try to
> > remember that the byte swaps are a nop on the given platforms. As Ben
> > noted, a native, no-swap ioread/write doesn't exist, but perhaps should.
> > I'd prefer an attempt be made to make it exist before adding
> > vfio-specific macros. vfio is arguably doing the right thing here given
> > the functions available. Thanks,
>
>

2014-06-24 16:27:09

by Alexey Kardashevskiy

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 06/25/2014 12:43 AM, Alex Williamson wrote:
> On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote:
>> On 06/25/2014 12:21 AM, Alex Williamson wrote:
>>> On Tue, 2014-06-24 at 15:22 +0200, Alexander Graf wrote:
>>>> On 24.06.14 15:01, Alexey Kardashevskiy wrote:
>>>>> On 06/24/2014 10:52 PM, Alexander Graf wrote:
>>>>>> On 24.06.14 14:50, Alexey Kardashevskiy wrote:
>>>>>>> On 06/24/2014 08:41 PM, Alexander Graf wrote:
>>>>>>>> On 24.06.14 12:11, Alexey Kardashevskiy wrote:
>>>>>>>>> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
>>>>>>>>>> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
>>>>>>>>>>
>>>>>>>>>>> Working on big endian being an accident may be a matter of perspective
>>>>>>>>>> :-)
>>>>>>>>>>
>>>>>>>>>>> The comment remains that this patch doesn't actually fix anything except
>>>>>>>>>>> the overhead on big endian systems doing redundant byte swapping and
>>>>>>>>>>> maybe the philosophy that vfio regions are little endian.
>>>>>>>>>> Yes, that works by accident because technically VFIO is a transport and
>>>>>>>>>> thus shouldn't perform any endian swapping of any sort, which remains
>>>>>>>>>> the responsibility of the end driver which is the only one to know
>>>>>>>>>> whether a given BAR location is a a register or some streaming data
>>>>>>>>>> and in the former case whether it's LE or BE (some PCI devices are BE
>>>>>>>>>> even ! :-)
>>>>>>>>>>
>>>>>>>>>> But yes, in the end, it works with the dual "cancelling" swaps and the
>>>>>>>>>> overhead of those swaps is probably drowned in the noise of the syscall
>>>>>>>>>> overhead.
>>>>>>>>>>
>>>>>>>>>>> I'm still not a fan of iowrite vs iowritebe, there must be something we
>>>>>>>>>>> can use that doesn't have an implicit swap.
>>>>>>>>>> Sadly there isn't ... In the old day we didn't even have the "be"
>>>>>>>>>> variant and readl/writel style accessors still don't have them either
>>>>>>>>>> for all archs.
>>>>>>>>>>
>>>>>>>>>> There is __raw_readl/writel but here the semantics are much more than
>>>>>>>>>> just "don't swap", they also don't have memory barriers (which means
>>>>>>>>>> they are essentially useless to most drivers unless those are platform
>>>>>>>>>> specific drivers which know exactly what they are doing, or in the rare
>>>>>>>>>> cases such as accessing a framebuffer which we know never have side
>>>>>>>>>> effects).
>>>>>>>>>>
>>>>>>>>>>> Calling it iowrite*_native is also an abuse of the namespace.
>>>>>>>>>>> Next thing we know some common code
>>>>>>>>>>> will legitimately use that name.
>>>>>>>>>> I might make sense to those definitions into a common header. There have
>>>>>>>>>> been a handful of cases in the past that wanted that sort of "native
>>>>>>>>>> byte order" MMIOs iirc (though don't ask me for examples, I can't really
>>>>>>>>>> remember).
>>>>>>>>>>
>>>>>>>>>>> If we do need to define an alias
>>>>>>>>>>> (which I'd like to avoid) it should be something like vfio_iowrite32.
>>>>>>>>> Ping?
>>>>>>>>>
>>>>>>>>> We need to make a decision whether to move those xxx_native() helpers
>>>>>>>>> somewhere (where?) or leave the patch as is (as we figured out that
>>>>>>>>> iowriteXX functions implement barriers and we cannot just use raw
>>>>>>>>> accessors) and fix commit log to explain everything.
>>>>>>>> Is there actually any difference in generated code with this patch applied
>>>>>>>> and without? I would hope that iowrite..() is inlined and cancels out the
>>>>>>>> cpu_to_le..() calls that are also inlined?
>>>>>>> iowrite32 is a non-inline function so conversions take place so are the
>>>>>>> others. And sorry but I fail to see why this matters. We are not trying to
>>>>>>> accelerate things, we are removing redundant operations which confuse
>>>>>>> people who read the code.
>>>>>> The confusion depends on where you're coming from. If you happen to know
>>>>>> that "iowrite32" writes in LE, then the LE conversion makes a lot of sense.
>>>>> It was like this (and this is just confusing):
>>>>>
>>>>> iowrite32(le32_to_cpu(val), io + off);
>>>>>
>>>>> What would make sense (according to you and I would understand this) is this:
>>>>>
>>>>> iowrite32(cpu_to_le32(val), io + off);
>>>>>
>>>>>
>>>>> Or I missed your point, did I?
>>>>
>>>> No, you didn't miss it. I think for people who know how iowrite32()
>>>> works the above is obvious. I find the fact that iowrite32() writes in
>>>> LE always pretty scary though ;).
>>>>
>>>> So IMHO we should either create new, generic iowrite helpers that don't
>>>> do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls.
>>>
>>> I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense
>>
>>
>> I do not understand why @val is considered LE here and need to be converted
>> to CPU. Really. I truly believe it should be cpu_to_le32().
>
> Because iowrite32 is defined to take a cpu byte order value and write it
> as little endian.

Ok, then neither le32_to_cpu() nor cpu_to_le32() should be there at all, if
we are talking about not scratching anyone's head :)

>>> and keeps the byte order consistent regardless of the platform, while
>>> iowrite32(val) or iowrite32be(val) makes me scratch my head and try to
>>> remember that the byte swaps are a nop on the given platforms. As Ben
>>> noted, a native, no-swap ioread/write doesn't exist, but perhaps should.
>>> I'd prefer an attempt be made to make it exist before adding
>>> vfio-specific macros. vfio is arguably doing the right thing here given
>>> the functions available. Thanks,

I do not mind to make that atempt but what exactly would make sense here?
Try moving macros to include/asm-generic/io.h? Something else? Thanks.

--
Alexey

2014-06-24 21:46:45

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On Tue, 2014-06-24 at 12:41 +0200, Alexander Graf wrote:
> Is there actually any difference in generated code with this patch
> applied and without? I would hope that iowrite..() is inlined and
> cancels out the cpu_to_le..() calls that are also inlined?

No, the former uses byteswapping asm, the compiler can't "cancel" it
out, but the overhead of the additional byteswap might not be
measurable.

Cheers,
Ben.

2014-06-24 21:55:05

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote:
>
> I do not understand why @val is considered LE here and need to be
> converted
> to CPU. Really. I truly believe it should be cpu_to_le32().

No. Both are slightly wrong semantically but le32_to_cpu() is less
wrong :-)

iowrite32 supposedly takes a "cpu" value as argument and writes an "le"
value. So if anything, you need something that converts to a "cpu" value
before you call iowrite32.

Now it's still slightly wrong because the "input" to le32_to_cpu() is
supposed to be an "LE" value but of course here it's not, it's a "cpu"
value too :-)

But yes, I agree with aw here, either do nothing or stick a set of
iowriteXX_native or something equivalent in the generic iomap header,
define them in term of iowrite32be/iowrite32 based on the compile time
endian of the arch.

Hitting asm-generic/iomap.h I think will cover all archs except ARM.
For ARM, just hack arch/arm/asm/io.h

Cheers,
Ben.

2014-06-25 02:43:25

by Alexey Kardashevskiy

[permalink] [raw]

Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs

On 06/25/2014 07:54 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote:
>>
>> I do not understand why @val is considered LE here and need to be
>> converted
>> to CPU. Really. I truly believe it should be cpu_to_le32().
>
> No. Both are slightly wrong semantically but le32_to_cpu() is less
> wrong :-)
>
> iowrite32 supposedly takes a "cpu" value as argument and writes an "le"
> value. So if anything, you need something that converts to a "cpu" value
> before you call iowrite32.
>
> Now it's still slightly wrong because the "input" to le32_to_cpu() is
> supposed to be an "LE" value but of course here it's not, it's a "cpu"
> value too :-)
>
> But yes, I agree with aw here, either do nothing or stick a set of
> iowriteXX_native or something equivalent in the generic iomap header,
> define them in term of iowrite32be/iowrite32 based on the compile time
> endian of the arch.

Ok. I'll do nothing.

> Hitting asm-generic/iomap.h I think will cover all archs except ARM.
> For ARM, just hack arch/arm/asm/io.h

--
Alexey