Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161030AbbD2FwA (ORCPT ); Wed, 29 Apr 2015 01:52:00 -0400 Received: from ozlabs.org ([103.22.144.67]:43647 "EHLO ozlabs.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753342AbbD2FuF (ORCPT ); Wed, 29 Apr 2015 01:50:05 -0400 Date: Wed, 29 Apr 2015 15:04:08 +1000 From: David Gibson To: Alexey Kardashevskiy Cc: linuxppc-dev@lists.ozlabs.org, Benjamin Herrenschmidt , Paul Mackerras , Alex Williamson , Gavin Shan , linux-kernel@vger.kernel.org Subject: Re: [PATCH kernel v9 22/32] powerpc/powernv: Implement multilevel TCE tables Message-ID: <20150429050408.GS32589@voom.redhat.com> References: <1429964096-11524-1-git-send-email-aik@ozlabs.ru> <1429964096-11524-23-git-send-email-aik@ozlabs.ru> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="qjXtncIm5b3tWrFJ" Content-Disposition: inline In-Reply-To: <1429964096-11524-23-git-send-email-aik@ozlabs.ru> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12301 Lines: 351 --qjXtncIm5b3tWrFJ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Apr 25, 2015 at 10:14:46PM +1000, Alexey Kardashevskiy wrote: > TCE tables might get too big in case of 4K IOMMU pages and DDW enabled > on huge guests (hundreds of GB of RAM) so the kernel might be unable to > allocate contiguous chunk of physical memory to store the TCE table. >=20 > To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE ta= bles, > up to 5 levels which splits the table into a tree of smaller subtables. >=20 > This adds multi-level TCE tables support to pnv_pci_create_table() > and pnv_pci_free_table() helpers. >=20 > Signed-off-by: Alexey Kardashevskiy > --- > Changes: > v9: > * moved from ioda2 to common powernv pci code > * fixed cleanup if allocation fails in a middle > * removed check for the size - all boundary checks happen in the calling = code > anyway > --- > arch/powerpc/include/asm/iommu.h | 2 + > arch/powerpc/platforms/powernv/pci-ioda.c | 15 +++-- > arch/powerpc/platforms/powernv/pci.c | 94 +++++++++++++++++++++++++= ++++-- > arch/powerpc/platforms/powernv/pci.h | 4 +- > 4 files changed, 104 insertions(+), 11 deletions(-) >=20 > diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/= iommu.h > index 7e7ca0a..0f50ee2 100644 > --- a/arch/powerpc/include/asm/iommu.h > +++ b/arch/powerpc/include/asm/iommu.h > @@ -96,6 +96,8 @@ struct iommu_pool { > struct iommu_table { > unsigned long it_busno; /* Bus number this table belongs to */ > unsigned long it_size; /* Size of iommu table in entries */ > + unsigned long it_indirect_levels; > + unsigned long it_level_size; > unsigned long it_offset; /* Offset into global table */ > unsigned long it_base; /* mapped address of tce table */ > unsigned long it_index; /* which iommu table this is */ > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/pla= tforms/powernv/pci-ioda.c > index 59baa15..cc1d09c 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -1967,13 +1967,17 @@ static long pnv_pci_ioda2_set_window(struct iommu= _table_group *table_group, > table_group); > struct pnv_phb *phb =3D pe->phb; > int64_t rc; > + const unsigned long size =3D tbl->it_indirect_levels ? > + tbl->it_level_size : tbl->it_size; > const __u64 start_addr =3D tbl->it_offset << tbl->it_page_shift; > const __u64 win_size =3D tbl->it_size << tbl->it_page_shift; > =20 > pe_info(pe, "Setting up window at %llx..%llx " > - "pgsize=3D0x%x tablesize=3D0x%lx\n", > + "pgsize=3D0x%x tablesize=3D0x%lx " > + "levels=3D%d levelsize=3D%x\n", > start_addr, start_addr + win_size - 1, > - 1UL << tbl->it_page_shift, tbl->it_size << 3); > + 1UL << tbl->it_page_shift, tbl->it_size << 3, > + tbl->it_indirect_levels + 1, tbl->it_level_size << 3); > =20 > tbl->it_table_group =3D &pe->table_group; > =20 > @@ -1984,9 +1988,9 @@ static long pnv_pci_ioda2_set_window(struct iommu_t= able_group *table_group, > rc =3D opal_pci_map_pe_dma_window(phb->opal_id, > pe->pe_number, > pe->pe_number << 1, > - 1, > + tbl->it_indirect_levels + 1, > __pa(tbl->it_base), > - tbl->it_size << 3, > + size << 3, > 1ULL << tbl->it_page_shift); > if (rc) { > pe_err(pe, "Failed to configure TCE table, err %ld\n", rc); > @@ -2099,7 +2103,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_p= hb *phb, > phb->ioda.m32_pci_base); > =20 > rc =3D pnv_pci_create_table(&pe->table_group, pe->phb->hose->node, > - 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl); > + 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, > + POWERNV_IOMMU_DEFAULT_LEVELS, tbl); > if (rc) { > pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc); > return; > diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platform= s/powernv/pci.c > index 6bcfad5..fc129c4 100644 > --- a/arch/powerpc/platforms/powernv/pci.c > +++ b/arch/powerpc/platforms/powernv/pci.c > @@ -46,6 +46,8 @@ > #define cfg_dbg(fmt...) do { } while(0) > //#define cfg_dbg(fmt...) printk(fmt) > =20 > +#define ROUND_UP(x, n) (((x) + (n) - 1ULL) & ~((n) - 1ULL)) Use the existing ALIGN_UP macro instead of creating a new one. > #ifdef CONFIG_PCI_MSI > static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type) > { > @@ -577,6 +579,19 @@ struct pci_ops pnv_pci_ops =3D { > static __be64 *pnv_tce(struct iommu_table *tbl, long idx) > { > __be64 *tmp =3D ((__be64 *)tbl->it_base); > + int level =3D tbl->it_indirect_levels; > + const long shift =3D ilog2(tbl->it_level_size); > + unsigned long mask =3D (tbl->it_level_size - 1) << (level * shift); > + > + while (level) { > + int n =3D (idx & mask) >> (level * shift); > + unsigned long tce =3D be64_to_cpu(tmp[n]); > + > + tmp =3D __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE)); > + idx &=3D ~mask; > + mask >>=3D shift; > + --level; > + } > =20 > return tmp + idx; > } > @@ -648,12 +663,18 @@ void pnv_pci_setup_iommu_table(struct iommu_table *= tbl, > } > =20 > static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift, > + unsigned levels, unsigned long limit, > unsigned long *tce_table_allocated) > { > struct page *tce_mem =3D NULL; > - __be64 *addr; > + __be64 *addr, *tmp; > unsigned order =3D max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT; > unsigned long local_allocated =3D 1UL << (order + PAGE_SHIFT); > + unsigned entries =3D 1UL << (shift - 3); > + long i; > + > + if (limit =3D=3D *tce_table_allocated) > + return NULL; If this is for what I think, it seems a bit unsafe. Shouldn't it be >=3D, otherwise it could fail to trip if the limit isn't exactly a >multiple of the bottom level allocation unit. > tce_mem =3D alloc_pages_node(nid, GFP_KERNEL, order); > if (!tce_mem) { > @@ -662,14 +683,33 @@ static __be64 *pnv_alloc_tce_table_pages(int nid, u= nsigned shift, > } > addr =3D page_address(tce_mem); > memset(addr, 0, local_allocated); > - *tce_table_allocated =3D local_allocated; > + > + --levels; > + if (!levels) { > + /* Update tce_table_allocated with bottom level table size only */ > + *tce_table_allocated +=3D local_allocated; > + return addr; > + } > + > + for (i =3D 0; i < entries; ++i) { > + tmp =3D pnv_alloc_tce_table_pages(nid, shift, levels, limit, > + tce_table_allocated); Urgh.. it's a limited depth so it *might* be ok, but recursion is generally avoided in the kernel, becuase of the very limited stack size. > + if (!tmp) > + break; > + > + addr[i] =3D cpu_to_be64(__pa(tmp) | > + TCE_PCI_READ | TCE_PCI_WRITE); > + } It also seems like it would make sense for this function ti set it_indirect_levels ant it_level_size, rather than leaving it to the caller. > return addr; > } > =20 > +static void pnv_free_tce_table_pages(unsigned long addr, unsigned long s= ize, > + unsigned level); > + > long pnv_pci_create_table(struct iommu_table_group *table_group, int nid, > __u64 bus_offset, __u32 page_shift, __u64 window_size, > - struct iommu_table *tbl) > + __u32 levels, struct iommu_table *tbl) > { > void *addr; > unsigned long tce_table_allocated =3D 0; > @@ -678,16 +718,34 @@ long pnv_pci_create_table(struct iommu_table_group = *table_group, int nid, > unsigned table_shift =3D entries_shift + 3; > const unsigned long tce_table_size =3D max(0x1000UL, 1UL << table_shift= ); > =20 > + if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS)) > + return -EINVAL; > + > if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size)) > return -EINVAL; > =20 > + /* Adjust direct table size from window_size and levels */ > + entries_shift =3D ROUND_UP(entries_shift, levels) / levels; ROUND_UP() only works if the second parameter is a power of 2. Is that always true for levels? For division rounding up, the usual idiom is just ((a + (b - 1)) / b) > + table_shift =3D entries_shift + 3; > + table_shift =3D max_t(unsigned, table_shift, PAGE_SHIFT); Does the PAGE_SHIFT rounding make sense any more? I would have thought you'd round the level size up to page size, rather than the whole thing. > /* Allocate TCE table */ > addr =3D pnv_alloc_tce_table_pages(nid, table_shift, > - &tce_table_allocated); > + levels, tce_table_size, &tce_table_allocated); > + if (!addr) > + return -ENOMEM; > + > + if (tce_table_size !=3D tce_table_allocated) { > + pnv_free_tce_table_pages((unsigned long) addr, > + tbl->it_level_size, tbl->it_indirect_levels); > + return -ENOMEM; > + } > =20 > /* Setup linux iommu table */ > pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset, > page_shift); > + tbl->it_level_size =3D 1ULL << (table_shift - 3); > + tbl->it_indirect_levels =3D levels - 1; > =20 > pr_info("Created TCE table: window size =3D %08llx, " > "tablesize =3D %lx (%lx), start @%08llx\n", > @@ -697,12 +755,38 @@ long pnv_pci_create_table(struct iommu_table_group = *table_group, int nid, > return 0; > } > =20 > +static void pnv_free_tce_table_pages(unsigned long addr, unsigned long s= ize, > + unsigned level) > +{ > + addr &=3D ~(TCE_PCI_READ | TCE_PCI_WRITE); > + > + if (level) { > + long i; > + u64 *tmp =3D (u64 *) addr; > + > + for (i =3D 0; i < size; ++i) { > + unsigned long hpa =3D be64_to_cpu(tmp[i]); > + > + if (!(hpa & (TCE_PCI_READ | TCE_PCI_WRITE))) > + continue; > + > + pnv_free_tce_table_pages((unsigned long) __va(hpa), > + size, level - 1); > + } > + } > + > + free_pages(addr, get_order(size << 3)); > +} > + > void pnv_pci_free_table(struct iommu_table *tbl) > { > + const unsigned long size =3D tbl->it_indirect_levels ? > + tbl->it_level_size : tbl->it_size; > + > if (!tbl->it_size) > return; > =20 > - free_pages(tbl->it_base, get_order(tbl->it_size << 3)); > + pnv_free_tce_table_pages(tbl->it_base, size, tbl->it_indirect_levels); > iommu_reset_table(tbl, "pnv"); > } > =20 > diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platform= s/powernv/pci.h > index e6cbbec..3d1ff584 100644 > --- a/arch/powerpc/platforms/powernv/pci.h > +++ b/arch/powerpc/platforms/powernv/pci.h > @@ -218,9 +218,11 @@ int pnv_pci_cfg_write(struct pci_dn *pdn, > extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl, > void *tce_mem, u64 tce_size, > u64 dma_offset, unsigned page_shift); > +#define POWERNV_IOMMU_DEFAULT_LEVELS 1 > +#define POWERNV_IOMMU_MAX_LEVELS 5 > extern long pnv_pci_create_table(struct iommu_table_group *table_group, = int nid, > __u64 bus_offset, __u32 page_shift, __u64 window_size, > - struct iommu_table *tbl); > + __u32 levels, struct iommu_table *tbl); > extern void pnv_pci_free_table(struct iommu_table *tbl); > extern void pnv_pci_init_p5ioc2_hub(struct device_node *np); > extern void pnv_pci_init_ioda_hub(struct device_node *np); --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --qjXtncIm5b3tWrFJ Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBAgAGBQJVQGZIAAoJEGw4ysog2bOSSbMQAK2zX8wy+dqGwdeob+6hCV49 3YxpIwnRTXtYlYVSrfEqT9r2B5xTAi2QBQ1rV+LWUY/UL0V0qq+x5kMIxv2ubU/W J+ZKVIbJHDATYOwgYaf8uO/HwhPCukMdvlJp2a5qpRgzapu/YoXOKZN5IxN17asQ Rw6whDYsaQsurOK9Wnl8MtFh9oSJ9ELElqV+FQwLI+ACJiWn9KFpE4odWLbkayue nR3mqnJLakH0x9gi2r44PIXlLUg1PeWzHZx5EzrEev7Jdb42qnifZgzc5jnST5Pp 5KhwnFGR3j59PFOWEfQissBi4XFDf60JTE1dWnlwc1QpfU/TpkAER4NMNYBfkyox GnbKPqinRcBIox8p+wZ5wYaMTQYwkqOr0ihaWEj+rf+Ejg02qxVbF8yVXdor8Sfc r+OpLeghoiwMCDkVshCDUFyyDCTxevKGYWfXzcR3WLw2L1f84h8jmJDnIDjJLYBU 3Tl7RkWvPDcY+7uYgUL/edBDq8SvI+BtzWqRX37rJguCvYHRxHb1QTq1mAQRNG9v GhKjrWgglmVe7wTD41hOvzD2OteDK0TgPhC24/Px2yg4iUeBVceJwYkT5mPYw1Oe ZIHXNBsZwtUJ/QgorAif82q44qV2ZfHVflsmE1gGzcJjNB9ZHzYed/aWbqd2Re/7 OpPJc0NzL61ANNJeDzGo =MVIW -----END PGP SIGNATURE----- --qjXtncIm5b3tWrFJ-- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/