Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757447AbcCRPEf (ORCPT ); Fri, 18 Mar 2016 11:04:35 -0400 Received: from e28smtp03.in.ibm.com ([125.16.236.3]:35625 "EHLO e28smtp03.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755819AbcCRPEb (ORCPT ); Fri, 18 Mar 2016 11:04:31 -0400 X-IBM-Helo: d28relay04.in.ibm.com X-IBM-MailFrom: xyjxie@linux.vnet.ibm.com X-IBM-RcptTo: kvm@vger.kernel.org;linux-kernel@vger.kernel.org;linux-pci@vger.kernel.org;linux-doc@vger.kernel.org Subject: Re: [RFC PATCH v4 4/7] PCI: Modify resource_alignment to support multiple devices To: Alex Williamson References: <1457336918-3893-1-git-send-email-xyjxie@linux.vnet.ibm.com> <1457336918-3893-5-git-send-email-xyjxie@linux.vnet.ibm.com> <20160316103022.246edf22@t450s.home> <56EA94E2.1010502@linux.vnet.ibm.com> <20160317064048.56b25b9f@ul30vt.home> Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-doc@vger.kernel.org, bhelgaas@google.com, corbet@lwn.net, aik@ozlabs.ru, benh@kernel.crashing.org, paulus@samba.org, mpe@ellerman.id.au, warrier@linux.vnet.ibm.com, zhong@linux.vnet.ibm.com, nikunj@linux.vnet.ibm.com From: Yongji Xie Message-ID: <56EC18EA.5090002@linux.vnet.ibm.com> Date: Fri, 18 Mar 2016 23:04:10 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <20160317064048.56b25b9f@ul30vt.home> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-TM-AS-MML: disable x-cbid: 16031815-0009-0000-0000-00000B5860D1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12873 Lines: 314 On 2016/3/17 20:40, Alex Williamson wrote: > On Thu, 17 Mar 2016 19:28:34 +0800 > Yongji Xie wrote: > >> On 2016/3/17 0:30, Alex Williamson wrote: >>> On Mon, 7 Mar 2016 15:48:35 +0800 >>> Yongji Xie wrote: >>> >>>> When vfio passthrough a PCI device of which MMIO BARs >>>> are smaller than PAGE_SIZE, guest will not handle the >>>> mmio accesses to the BARs which leads to mmio emulations >>>> in host. >>>> >>>> This is because vfio will not allow to passthrough one >>>> BAR's mmio page which may be shared with other BARs. >>>> >>>> To solve this performance issue, this patch modifies >>>> resource_alignment to support syntax where multiple >>>> devices get the same alignment. So we can use something >>>> like "pci=resource_alignment=*:*:*.*:noresize" to >>>> enforce the alignment of all MMIO BARs to be at least >>>> PAGE_SIZE so that one BAR's mmio page would not be >>>> shared with other BARs. >>>> >>>> Signed-off-by: Yongji Xie >>>> --- >>>> Documentation/kernel-parameters.txt | 2 + >>>> drivers/pci/pci.c | 90 ++++++++++++++++++++++++++++++----- >>>> include/linux/pci.h | 4 ++ >>>> 3 files changed, 85 insertions(+), 11 deletions(-) >>>> >>>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt >>>> index 8028631..74b38ab 100644 >>>> --- a/Documentation/kernel-parameters.txt >>>> +++ b/Documentation/kernel-parameters.txt >>>> @@ -2918,6 +2918,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. >>>> aligned memory resources. >>>> If is not specified, >>>> PAGE_SIZE is used as alignment. >>>> + , , and can be set to >>>> + "*" which means match all values. >>> I don't see anywhere that you're automatically enabling this for your >>> platform, so presumably you're expecting users to determine on their >>> own that they have a performance problem and hoping that they'll figure >>> out that they need to use this option to resolve it. The first irate >>> question you'll get back is why doesn't this happen automatically? >> Actually, I just want to make the code simple. Maybe it is >> not a good idea to let users enable this option manually. >> I will try to fix it like that: > That's not entirely what I meant, I think having a way for a user to > enable it is a good thing, but maybe there need to be cases where it's > enabled automatically. With 4k pages, this is often not an issue since > the PCI spec recommends 4k or 8k alignment for resources, but that > doesn't preclude an unusual device where a user might want to enable it > anyway. At 64k page size, problems become more common, so we need to > think about either enabling it automatically or somehow making it more > apparent to the user that the option is available for this purpose. OK. I see. I will add a new arch-independent macro to indicate the min alignments of all PCI BARs. And we can also specify the alignments through the resource_alignment. >> diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h >> index 6f8065a..6659752 100644 >> --- a/arch/powerpc/include/asm/pci.h >> +++ b/arch/powerpc/include/asm/pci.h >> @@ -30,6 +30,8 @@ >> #define PCIBIOS_MIN_IO 0x1000 >> #define PCIBIOS_MIN_MEM 0x10000000 >> >> +#define PCIBIOS_MIN_ALIGNMENT PAGE_SIZE >> + >> struct pci_dev; >> >> /* Values for the `which' argument to sys_pciconfig_iobase syscall. */ >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index dadd28a..9f644e4 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -4593,6 +4593,8 @@ EXPORT_SYMBOL_GPL(pci_ignore_hotplug); >> static char resource_alignment_param[RESOURCE_ALIGNMENT_PARAM_SIZE] = {0}; >> static DEFINE_SPINLOCK(resource_alignment_lock); >> >> +#define DISABLE_ARCH_ALIGNMENT -1 >> +#define DEFAULT_ALIGNMENT -2 >> /** >> * pci_specified_resource_alignment - get resource alignment specified >> by user. >> * @dev: the PCI device to get >> @@ -4609,6 +4611,9 @@ static resource_size_t >> pci_specified_resource_alignment(struct pci_dev *dev, >> char *p; >> bool invalid = false; >> >> +#ifdef PCIBIOS_MIN_ALIGNMENT >> + align = PCIBIOS_MIN_ALIGNMENT; >> +#endif >> spin_lock(&resource_alignment_lock); >> p = resource_alignment_param; >> while (*p) { >> @@ -4617,7 +4622,7 @@ static resource_size_t >> pci_specified_resource_alignment(struct pci_dev *dev, >> p[count] == '@') { >> p += count + 1; >> } else { >> - align_order = -1; >> + align_order = DEFAULT_ALIGNMENT; >> } >> if (p[0] == '*' && p[1] == ':') { >> seg = -1; >> @@ -4673,8 +4678,10 @@ static resource_size_t >> pci_specified_resource_alignment(struct pci_dev *dev, >> (bus == dev->bus->number || bus == -1) && >> (slot == PCI_SLOT(dev->devfn) || slot == -1) && >> (func == PCI_FUNC(dev->devfn) || func == -1)) { >> - if (align_order == -1) >> + if (align_order == DEFAULT_ALIGNMENT) >> align = PAGE_SIZE; >> + else if (align_order == DISABLE_ARCH_ALIGNMENT) >> + align = 0; >> else >> align = 1 << align_order; >> if (!pci_resources_page_aligned && >>>> PCI-PCI bridge can be specified, if resource >>>> windows need to be expanded. >>>> noresize: Don't change the resources' sizes when >>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>>> index 760cce5..44ab59f 100644 >>>> --- a/drivers/pci/pci.c >>>> +++ b/drivers/pci/pci.c >>>> @@ -102,6 +102,8 @@ unsigned int pcibios_max_latency = 255; >>>> /* If set, the PCIe ARI capability will not be used. */ >>>> static bool pcie_ari_disabled; >>>> >>>> +bool pci_resources_page_aligned; >>>> + >>>> /** >>>> * pci_bus_max_busnr - returns maximum PCI bus number of given bus' children >>>> * @bus: pointer to PCI bus structure to search >>>> @@ -4604,6 +4606,7 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev, >>>> int seg, bus, slot, func, align_order, count; >>>> resource_size_t align = 0; >>>> char *p; >>>> + bool invalid = false; >>>> >>>> spin_lock(&resource_alignment_lock); >>>> p = resource_alignment_param; >>>> @@ -4615,16 +4618,49 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev, >>>> } else { >>>> align_order = -1; >>>> } >>>> - if (sscanf(p, "%x:%x:%x.%x%n", >>>> - &seg, &bus, &slot, &func, &count) != 4) { >>>> + if (p[0] == '*' && p[1] == ':') { >>>> + seg = -1; >>>> + count = 1; >>>> + } else if (sscanf(p, "%x%n", &seg, &count) != 1 || >>>> + p[count] != ':') { >>>> + invalid = true; >>>> + break; >>>> + } >>>> + p += count + 1; >>>> + if (*p == '*') { >>>> + bus = -1; >>>> + count = 1; >>>> + } else if (sscanf(p, "%x%n", &bus, &count) != 1) { >>>> + invalid = true; >>>> + break; >>>> + } >>>> + p += count; >>>> + if (*p == '.') { >>>> + slot = bus; >>>> + bus = seg; >>>> seg = 0; >>>> - if (sscanf(p, "%x:%x.%x%n", >>>> - &bus, &slot, &func, &count) != 3) { >>>> - /* Invalid format */ >>>> - printk(KERN_ERR "PCI: Can't parse resource_alignment parameter: %s\n", >>>> - p); >>>> + p++; >>>> + } else if (*p == ':') { >>>> + p++; >>>> + if (p[0] == '*' && p[1] == '.') { >>>> + slot = -1; >>>> + count = 1; >>>> + } else if (sscanf(p, "%x%n", &slot, &count) != 1 || >>>> + p[count] != '.') { >>>> + invalid = true; >>>> break; >>>> } >>>> + p += count + 1; >>>> + } else { >>>> + invalid = true; >>>> + break; >>>> + } >>>> + if (*p == '*') { >>>> + func = -1; >>>> + count = 1; >>>> + } else if (sscanf(p, "%x%n", &func, &count) != 1) { >>>> + invalid = true; >>>> + break; >>>> } >>>> p += count; >>>> if (!strncmp(p, ":noresize", 9)) { >>>> @@ -4632,23 +4668,34 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev, >>>> p += 9; >>>> } else >>>> *resize = true; >>>> - if (seg == pci_domain_nr(dev->bus) && >>>> - bus == dev->bus->number && >>>> - slot == PCI_SLOT(dev->devfn) && >>>> - func == PCI_FUNC(dev->devfn)) { >>>> + if ((seg == pci_domain_nr(dev->bus) || seg == -1) && >>>> + (bus == dev->bus->number || bus == -1) && >>>> + (slot == PCI_SLOT(dev->devfn) || slot == -1) && >>>> + (func == PCI_FUNC(dev->devfn) || func == -1)) { >>>> if (align_order == -1) >>>> align = PAGE_SIZE; >>>> else >>>> align = 1 << align_order; >>>> + if (!pci_resources_page_aligned && >>>> + (align >= PAGE_SIZE && >>>> + seg == -1 && bus == -1 && >>>> + slot == -1 && func == -1)) >>>> + pci_resources_page_aligned = true; >>>> /* Found */ >>>> break; >>>> } >>>> if (*p != ';' && *p != ',') { >>>> /* End of param or invalid format */ >>>> + invalid = true; >>>> break; >>>> } >>>> p++; >>>> } >>>> + if (invalid == true) { >>>> + /* Invalid format */ >>>> + printk(KERN_ERR "PCI: Can't parse resource_alignment parameter:%s\n", >>>> + p); >>>> + } >>>> spin_unlock(&resource_alignment_lock); >>>> return align; >>>> } >>>> @@ -4769,6 +4816,27 @@ static int __init pci_resource_alignment_sysfs_init(void) >>>> } >>>> late_initcall(pci_resource_alignment_sysfs_init); >>>> >>>> +/* >>>> + * This function checks whether PCI BARs' mmio page will be shared >>>> + * with other BARs. >>>> + */ >>>> +bool pci_resources_share_page(struct pci_dev *dev, int resno) >>>> +{ >>>> + struct resource *res = dev->resource + resno; >>>> + >>>> + if (resource_size(res) >= PAGE_SIZE) >>>> + return false; >>>> + if (pci_resources_page_aligned && !(res->start & ~PAGE_MASK) && >>>> + res->flags & IORESOURCE_MEM) { >>>> + if (res->sibling) >>>> + return (res->sibling->start & ~PAGE_MASK); >>>> + else >>>> + return false; >>>> + } >>>> + return true; >>>> +} >>>> +EXPORT_SYMBOL_GPL(pci_resources_share_page); >>> This should be a separate patch, there's no mention of this part of the >>> change at all in the commitlog. Also, pci_resource_page_aligned is >>> only set if we use the magic wildcards to set alignment for all >>> devices, it's not set if we use a specific seg:bus:dev.fn. That's >>> not consistent. Can't we make use of the STARTALIGN flag for this or >>> did that get removed when resources were assigned? >> I only set pci_resource_page_aligned when using magic wildcards >> because I must make sure pci_resources_share_page() can return >> correctly when a hotplug event happens. >> >> For example, there is only one half-page resource in the system. >> And we use a specific seg:bus:dev.fn to force it to be page aligned. >> So we set pci_resource_page_aligned because all devices are page >> aligned now. It will return false when we call >> pci_resources_share_page() for this resource. But this return value >> will be not correct if we hot add a new device which also have >> a half-page resource which will share one page with the previous >> resource. > That's a good point, it should be documented in a comment. OK. I will do it. >>> The test itself is not entirely reassuring, I'd like some positive >>> indication that the device has been aligned, not simply that it should >>> have been and the start alignment appears that it might have happened. >>> Apparently you don't trust pci_resources_page_aligned either since you >>> still test the start alignment. >> Yes, I don't trust pci_resources_page_aligned. Because >> resource_alignment will not work in some case, e.g. >> PCI_PROBE_ONLY or IORESOURCE_PCI_FIXED is set. >> >> But I think *start* of the resource can give the positive >> indication. The resource should not share page with >> others if the start of the resource and its sibling are both >> page aligned. > If you can't trust pci_resource_page_aligned then the alignment of > start seems like it only indicates that it's likely aligned, not that > it is aligned. Sorry. I don't get your point why start of resource can't indicates it's aligned. The res->start is equal to the start of allocated PCI address. Shouldn't it mean mmio page of the resource is exclusive if res->start is page aligned and res->sibling->start - res->start > PAGE_SIZE? Thanks, Yongji Xie