Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757426AbaFSGRK (ORCPT ); Thu, 19 Jun 2014 02:17:10 -0400 Received: from mail-pb0-f50.google.com ([209.85.160.50]:56059 "EHLO mail-pb0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752598AbaFSGRI (ORCPT ); Thu, 19 Jun 2014 02:17:08 -0400 Message-ID: <53A2805E.3000401@ozlabs.ru> Date: Thu, 19 Jun 2014 16:17:02 +1000 From: Alexey Kardashevskiy User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: "Bharat.Bhushan@freescale.com" , Alex Williamson CC: "kvm@vger.kernel.org" , Nikunj A Dadhania , "linux-kernel@vger.kernel.org" , Alexander Graf , "linuxppc-dev@lists.ozlabs.org" Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs References: <1403091391-31780-1-git-send-email-aik@ozlabs.ru> <1403116512.3707.175.camel@ul30vt.home> <53A233E9.6030006@ozlabs.ru> <53A241F6.9010307@ozlabs.ru> <53A25D74.5000804@ozlabs.ru> <0df01f652acd446fa6db39f7049d6e72@BLUPR03MB566.namprd03.prod.outlook.com> In-Reply-To: <0df01f652acd446fa6db39f7049d6e72@BLUPR03MB566.namprd03.prod.outlook.com> Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/19/2014 03:30 PM, Bharat.Bhushan@freescale.com wrote: > > >> -----Original Message----- >> From: Linuxppc-dev [mailto:linuxppc-dev- >> bounces+bharat.bhushan=freescale.com@lists.ozlabs.org] On Behalf Of Alexey >> Kardashevskiy >> Sent: Thursday, June 19, 2014 9:18 AM >> To: Alex Williamson >> Cc: kvm@vger.kernel.org; Nikunj A Dadhania; linux-kernel@vger.kernel.org; >> Alexander Graf; linuxppc-dev@lists.ozlabs.org >> Subject: Re: [PATCH] vfio: Fix endianness handling for emulated BARs >> >> On 06/19/2014 11:50 AM, Alexey Kardashevskiy wrote: >>> On 06/19/2014 10:50 AM, Alexey Kardashevskiy wrote: >>>> On 06/19/2014 04:35 AM, Alex Williamson wrote: >>>>> On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote: >>>>>> VFIO exposes BARs to user space as a byte stream so userspace can >>>>>> read it using pread()/pwrite(). Since this is a byte stream, VFIO >>>>>> should not do byte swapping and simply return values as it gets >>>>>> them from PCI device. >>>>>> >>>>>> Instead, the existing code assumes that byte stream in read/write >>>>>> is little-endian and it fixes endianness for values which it passes >>>>>> to ioreadXX/iowriteXX helpers. This works for little-endian as PCI >>>>>> is little endian and le32_to_cpu/... are stubs. >>>>> >>>>> vfio read32: >>>>> >>>>> val = cpu_to_le32(ioread32(io + off)); >>>>> >>>>> Where the typical x86 case, ioread32 is: >>>>> >>>>> #define ioread32(addr) readl(addr) >>>>> >>>>> and readl is: >>>>> >>>>> __le32_to_cpu(__raw_readl(addr)); >>>>> >>>>> So we do canceling byte swaps, which are both nops on x86, and end >>>>> up returning device endian, which we assume is little endian. >>>>> >>>>> vfio write32 is similar: >>>>> >>>>> iowrite32(le32_to_cpu(val), io + off); >>>>> >>>>> The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel >>>>> out, so input data is device endian, which is assumed little. >>>>> >>>>>> This also works for big endian but rather by an accident: it reads >>>>>> 4 bytes from the stream (@val is big endian), converts to CPU >>>>>> format (which should be big endian) as it was little endian (@val >>>>>> becomes actually little >>>>>> endian) and calls iowrite32() which does not do swapping on big >>>>>> endian system. >>>>> >>>>> Really? >>>>> >>>>> In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around >>>>> writel(), which seems to use the generic implementation, which does >>>>> include a cpu_to_le32. >>>> >>>> >>>> Ouch, wrong comment. iowrite32() does swapping. My bad. >>>> >>>> >>>>> >>>>> I also see other big endian archs like parisc doing cpu_to_le32 on >>>>> iowrite32, so I don't think this statement is true. I imagine it's >>>>> probably working for you because the swap cancel. >>>>> >>>>>> This removes byte swapping and makes use ioread32be/iowrite32be >>>>>> (and 16bit versions) on big-endian systems. The "be" helpers take >>>>>> native endian values and do swapping at the moment of writing to a >>>>>> PCI register using one of "store byte-reversed" instructions. >>>>> >>>>> So now you want iowrite32() on little endian and iowrite32be() on >>>>> big endian, the former does a cpu_to_le32 (which is a nop on little >>>>> endian) and the latter does a cpu_to_be32 (which is a nop on big endian)... >>>>> should we just be using __raw_writel() on both? >>>> >>>> >>>> We can do that too. The beauty of iowrite32be on ppc64 is that it >>>> does not swap and write separately, it is implemented via the "Store >>>> Word Byte-Reverse Indexed X-form" single instruction. >>>> >>>> And some archs (do not know which ones) may add memory barriers in >>>> their implementations of ioread/iowrite. __raw_writel is too raw :) >>>> >>>>> There doesn't actually >>>>> seem to be any change in behavior here, it just eliminates >>>>> back-to-back byte swaps, which are a nop on x86, but not power, right? >>>> >>>> Exactly. No dependency for QEMU. >>> >>> How about that: >>> === >>> >>> VFIO exposes BARs to user space as a byte stream so userspace can read >>> it using pread()/pwrite(). Since this is a byte stream, VFIO should >>> not do byte swapping and simply return values as it gets them from PCI >>> device. >>> >>> Instead, the existing code assumes that byte stream in read/write is >>> little-endian and it fixes endianness for values which it passes to >>> ioreadXX/iowriteXX helpers in native format. The IO helpers do >>> swapping again. Since both byte swaps are nops on little-endian host, this >> works. >>> >>> This also works for big endian but rather by an accident: it reads 4 >>> bytes from the stream (@val is big endian), converts to CPU format >>> (which should be big endian) as it was little endian (and @val becomes >>> actually little >>> endian) and calls iowrite32() which does swapping on big endian system >>> again. So byte swap gets cancelled, __raw_writel() receives a native >>> value and then *(volatile unsigned int __force *)PCI_FIX_ADDR(addr) = >>> v; just does the right thing. >> >> I am wrong here, sorry. This is what happens when you watch soccer between 2am >> and 4am :) >> >> >>> >>> This removes byte swaps and makes use of ioread32be/iowrite32be (and >>> 16bit versions) which do explicit byte swapping at the moment of write >>> to a PCI register. PPC64 uses a special "Store Word Byte-Reverse >>> Indexed X-form" instruction which does swap and store. >> >> No swapping is done here if we use ioread32be as it calls in_be32 and that >> animal does "lwz" which is simple load from memory. >> >> So @val (16/32 bit variable on stack) will have different values on LE and BE >> but since we do not handle it the host and just memcpy it to the buffer, nothing >> breaks here. >> >> >> So it should be like this: >> === >> VFIO exposes BARs to user space as a byte stream so userspace can read it using >> pread()/pwrite(). Since this is a byte stream, VFIO should not do byte swapping >> and simply return values as it gets them from PCI device and copy_to_user will >> save bytes in the correct same true for writes. > > " copy_to_user will save bytes in the correct" ---? --- "same true for writes". Oops. "correct order" is it. > > Thanks > -Bharat > >> >> Instead, the existing code assumes that byte stream in read/write is little- >> endian and it fixes endianness for values which it passes to ioreadXX/iowriteXX >> helpers in native format. The IO helpers do swapping again. Since both byte >> swaps are nops on little-endian host, this works. > >> >> This also works for big endian but rather by an accident: it reads 4 bytes from >> the stream (@val is big endian), converts to CPU format (which should be big >> endian) as it was little endian (and @val becomes actually little >> endian) and calls iowrite32() which does swapping on big endian system again. So >> byte swap in the host gets cancelled and __raw_writel() writes the value which >> was swapped originally by the guest. >> >> This removes byte swaps and makes use of ioread32be/iowrite32be (and 16bit >> versions) which do not do byte swap on BE hosts. >> For LE hosts, ioread32/iowrite32 are still used. >> >> === >> >> >>> === >>> >>> any better? >>> >>> >>> >>> >>>>>> Suggested-by: Benjamin Herrenschmidt >>>>>> Signed-off-by: Alexey Kardashevskiy >>>>>> --- >>>>>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++---- >>>>>> 1 file changed, 16 insertions(+), 4 deletions(-) >>>>>> >>>>>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c >>>>>> b/drivers/vfio/pci/vfio_pci_rdwr.c >>>>>> index 210db24..f363b5a 100644 >>>>>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c >>>>>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c >>>>>> @@ -21,6 +21,18 @@ >>>>>> >>>>>> #include "vfio_pci_private.h" >>>>>> >>>>>> +#ifdef __BIG_ENDIAN__ >>>>>> +#define ioread16_native ioread16be >>>>>> +#define ioread32_native ioread32be >>>>>> +#define iowrite16_native iowrite16be >>>>>> +#define iowrite32_native iowrite32be >>>>>> +#else >>>>>> +#define ioread16_native ioread16 >>>>>> +#define ioread32_native ioread32 >>>>>> +#define iowrite16_native iowrite16 >>>>>> +#define iowrite32_native iowrite32 >>>>>> +#endif >>>>>> + >>>>>> /* >>>>>> * Read or write from an __iomem region (MMIO or I/O port) with an >> excluded >>>>>> * range which is inaccessible. The excluded range drops writes >>>>>> and fills @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char >> __user *buf, >>>>>> if (copy_from_user(&val, buf, 4)) >>>>>> return -EFAULT; >>>>>> >>>>>> - iowrite32(le32_to_cpu(val), io + off); >>>>>> + iowrite32_native(val, io + off); >>>>>> } else { >>>>>> - val = cpu_to_le32(ioread32(io + off)); >>>>>> + val = ioread32_native(io + off); >>>>>> >>>>>> if (copy_to_user(buf, &val, 4)) >>>>>> return -EFAULT; >>>>>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user >> *buf, >>>>>> if (copy_from_user(&val, buf, 2)) >>>>>> return -EFAULT; >>>>>> >>>>>> - iowrite16(le16_to_cpu(val), io + off); >>>>>> + iowrite16_native(val, io + off); >>>>>> } else { >>>>>> - val = cpu_to_le16(ioread16(io + off)); >>>>>> + val = ioread16_native(io + off); >>>>>> >>>>>> if (copy_to_user(buf, &val, 2)) >>>>>> return -EFAULT; >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> >> -- >> Alexey >> _______________________________________________ >> Linuxppc-dev mailing list >> Linuxppc-dev@lists.ozlabs.org >> https://lists.ozlabs.org/listinfo/linuxppc-dev -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/