Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
From:   "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To:     Mike Rapoport <rppt@linux.ibm.com>
CC:     "hch@lst.de" <hch@lst.de>,
        "m.szyprowski@samsung.com" <m.szyprowski@samsung.com>,
        "robin.murphy@arm.com" <robin.murphy@arm.com>,
        "will@kernel.org" <will@kernel.org>,
        "ganapatrao.kulkarni@cavium.com" <ganapatrao.kulkarni@cavium.com>,
        "catalin.marinas@arm.com" <catalin.marinas@arm.com>,
        "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
        "iommu@lists.linux-foundation.org" <iommu@lists.linux-foundation.org>,
        "linux-arm-kernel@lists.infradead.org" 
        <linux-arm-kernel@lists.infradead.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "Zengtao (B)" <prime.zeng@hisilicon.com>,
        huangdaode <huangdaode@huawei.com>,
        Linuxarm <linuxarm@huawei.com>,
        "Jonathan Cameron" <jonathan.cameron@huawei.com>,
        Nicolas Saenz Julienne <nsaenzjulienne@suse.de>,
        Steve Capper <steve.capper@arm.com>
Subject: RE: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve
 per-numa CMA
Thread-Topic: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve
 per-numa CMA
Thread-Index: AQHWd69ix8OgXlSJiUCOYl4oKg5K9qlCGRQAgADS8MA=
Date:   Fri, 21 Aug 2020 19:14:56 +0000
Message-ID: <d442db3bd2094078a4e50b1c6ffe63b8@hisilicon.com>
References: <20200821113355.6140-1-song.bao.hua@hisilicon.com>
 <20200821113355.6140-2-song.bao.hua@hisilicon.com>
 <20200821142804.GR969206@linux.ibm.com>
In-Reply-To: <20200821142804.GR969206@linux.ibm.com>
Accept-Language: en-GB, en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8BIT
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk


> -----Original Message-----
> From: Mike Rapoport [mailto:rppt@linux.ibm.com]
> Sent: Saturday, August 22, 2020 2:28 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: hch@lst.de; m.szyprowski@samsung.com; robin.murphy@arm.com;
> will@kernel.org; ganapatrao.kulkarni@cavium.com;
> catalin.marinas@arm.com; akpm@linux-foundation.org;
> iommu@lists.linux-foundation.org; linux-arm-kernel@lists.infradead.org;
> linux-kernel@vger.kernel.org; Zengtao (B) <prime.zeng@hisilicon.com>;
> huangdaode <huangdaode@huawei.com>; Linuxarm <linuxarm@huawei.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>; Nicolas Saenz Julienne
> <nsaenzjulienne@suse.de>; Steve Capper <steve.capper@arm.com>
> Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On Fri, Aug 21, 2020 at 11:33:53PM +1200, Barry Song wrote:
> > Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
> > coherent DMA buffers to save their command queues and page tables. As
> > there is only one default CMA in the whole system, SMMUs on nodes other
> > than node0 will get remote memory. This leads to significant latency.
> >
> > This patch provides per-numa CMA so that drivers like SMMU can get local
> > memory. Tests show localizing CMA can decrease dma_unmap latency much.
> > For instance, before this patch, SMMU on node2  has to wait for more than
> > 560ns for the completion of CMD_SYNC in an empty command queue; with
> this
> > patch, it needs 240ns only.
> >
> > A positive side effect of this patch would be improving performance even
> > further for those users who are worried about performance more than DMA
> > security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
> > drivers can get local coherent DMA buffers.
> >
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
> > Cc: Steve Capper <steve.capper@arm.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Mike Rapoport <rppt@linux.ibm.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > ---
> >  -v7: with respect to Will's comments
> >  * move to use for_each_online_node
> >  * add description if users don't specify pernuma_cma
> >  * provide default value for CONFIG_DMA_PERNUMA_CMA
> >
> >  .../admin-guide/kernel-parameters.txt         |  11 ++
> >  include/linux/dma-contiguous.h                |   6 ++
> >  kernel/dma/Kconfig                            |  11 ++
> >  kernel/dma/contiguous.c                       | 100
> ++++++++++++++++--
> >  4 files changed, 118 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..c609527fc35a 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,17 @@
> >  			altogether. For more information, see
> >  			include/linux/dma-contiguous.h
> >
> > +	pernuma_cma=nn[MG]
> 
> Maybe cma_pernuma or cma_pernode?

Sounds good.

> 
> > +			[ARM64,KNL]
> > +			Sets the size of kernel per-numa memory area for
> > +			contiguous memory allocations. A value of 0 disables
> > +			per-numa CMA altogether. And If this option is not
> > +			specificed, the default value is 0.
> > +			With per-numa CMA enabled, DMA users on node nid will
> > +			first try to allocate buffer from the pernuma area
> > +			which is located in node nid, if the allocation fails,
> > +			they will fallback to the global default memory area.
> > +
> >  	cmo_free_hint=	[PPC] Format: { yes | no }
> >  			Specify whether pages are marked as being inactive
> >  			when they are freed.  This is used in CMO environments
> > diff --git a/include/linux/dma-contiguous.h
> b/include/linux/dma-contiguous.h
> > index 03f8e98e3bcc..fe55e004f1f4 100644
> > --- a/include/linux/dma-contiguous.h
> > +++ b/include/linux/dma-contiguous.h
> > @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct
> device *dev, struct page *page,
> >
> >  #endif
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +void dma_pernuma_cma_reserve(void);
> > +#else
> > +static inline void dma_pernuma_cma_reserve(void) { }
> > +#endif
> > +
> >  #endif
> >
> >  #endif
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index 847a9d1fa634..c38979d45b13 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -118,6 +118,17 @@ config DMA_CMA
> >  	  If unsure, say "n".
> >
> >  if  DMA_CMA
> > +
> > +config DMA_PERNUMA_CMA
> > +	bool "Enable separate DMA Contiguous Memory Area for each NUMA
> Node"
> > +	default NUMA && ARM64
> > +	help
> > +	  Enable this option to get pernuma CMA areas so that devices like
> > +	  ARM64 SMMU can get local memory by DMA coherent APIs.
> > +
> > +	  You can set the size of pernuma CMA by specifying
> "pernuma_cma=size"
> > +	  on the kernel's command line.
> > +
> >  comment "Default contiguous memory area size:"
> >
> >  config CMA_SIZE_MBYTES
> > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> > index cff7e60968b9..0383c9b86715 100644
> > --- a/kernel/dma/contiguous.c
> > +++ b/kernel/dma/contiguous.c
> > @@ -69,6 +69,19 @@ static int __init early_cma(char *p)
> >  }
> >  early_param("cma", early_cma);
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +
> > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> > +static phys_addr_t pernuma_size_bytes __initdata;
> > +
> > +static int __init early_pernuma_cma(char *p)
> > +{
> > +	pernuma_size_bytes = memparse(p, &p);
> > +	return 0;
> > +}
> > +early_param("pernuma_cma", early_pernuma_cma);
> > +#endif
> > +
> >  #ifdef CONFIG_CMA_SIZE_PERCENTAGE
> >
> >  static phys_addr_t __init __maybe_unused
> cma_early_percent_memory(void)
> > @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t
> cma_early_percent_memory(void)
> >
> >  #endif
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +void __init dma_pernuma_cma_reserve(void)
> > +{
> > +	int nid;
> > +
> > +	if (!pernuma_size_bytes)
> > +		return;
> > +
> > +	for_each_online_node(nid) {
> > +		int ret;
> > +		char name[20];
> > +		struct cma **cma = &dma_contiguous_pernuma_area[nid];
> > +
> > +		snprintf(name, sizeof(name), "pernuma%d", nid);
> > +		ret = cma_declare_contiguous_nid(0, pernuma_size_bytes, 0, 0,
> > +						 0, false, name, cma, nid);
> > +		if (ret) {
> > +			pr_warn("%s: reservation failed: err %d, node %d", __func__,
> > +				ret, nid);
> > +			continue;
> > +		}
> > +
> > +		pr_debug("%s: reserved %llu MiB on node %d\n", __func__,
> > +			(unsigned long long)pernuma_size_bytes / SZ_1M, nid);
> > +	}
> > +}
> > +#endif
> > +
> >  /**
> >   * dma_contiguous_reserve() - reserve area(s) for contiguous memory
> handling
> >   * @limit: End address of the reserved memory (optional, 0 for any).
> > @@ -228,23 +269,44 @@ static struct page *cma_alloc_aligned(struct cma
> *cma, size_t size, gfp_t gfp)
> >   * @size:  Requested allocation size.
> >   * @gfp:   Allocation flags.
> >   *
> > - * This function allocates contiguous memory buffer for specified device. It
> > - * tries to use device specific contiguous memory area if available, or the
> > - * default global one.
> > + * tries to use device specific contiguous memory area if available, or it
> > + * tries to use per-numa cma, if the allocation fails, it will fallback to
> > + * try default global one.
> >   *
> > - * Note that it byapss one-page size of allocations from the global area as
> > - * the addresses within one page are always contiguous, so there is no need
> > - * to waste CMA pages for that kind; it also helps reduce fragmentations.
> > + * Note that it bypass one-page size of allocations from the per-numa and
> > + * global area as the addresses within one page are always contiguous, so
> > + * there is no need to waste CMA pages for that kind; it also helps reduce
> > + * fragmentations.
> >   */
> >  struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t
> gfp)
> >  {
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +	int nid = dev_to_node(dev);
> > +#endif
> > +
> >  	/* CMA can be used only in the context which permits sleeping */
> >  	if (!gfpflags_allow_blocking(gfp))
> >  		return NULL;
> >  	if (dev->cma_area)
> >  		return cma_alloc_aligned(dev->cma_area, size, gfp);
> > -	if (size <= PAGE_SIZE || !dma_contiguous_default_area)
> > +	if (size <= PAGE_SIZE)
> > +		return NULL;
> > +
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +	if (nid != NUMA_NO_NODE && !(gfp & (GFP_DMA | GFP_DMA32))) {
> > +		struct cma *cma = dma_contiguous_pernuma_area[nid];
> 
> It could be that for some node reservation failedm than
> dma_contiguous_pernuma_area[nid] would be NULL.
> I'd add a fallback to another node here.

This has been done.
If dma_contiguous_pernuma_area[nid] is null, it will fallback to the default global cma.

> 
> > +		struct page *page;
> > +
> > +		if (cma) {
> > +			page = cma_alloc_aligned(cma, size, gfp);
> > +			if (page)
> > +				return page;
> > +		}
> > +	}
> > +#endif
> 
> I think the selection of the area can be put in a helper funciton and
> then here we just try to allocate from the selected area. E.g.
> 
> static struct cma* dma_get_cma_area(struct device *dev)
> {
> #ifdef CONFIG_DMA_PERNUMA_CMA
> 	int nid = dev_to_node(dev);
> 	struct cma *cma = dma_contiguous_pernuma_area[nid];
> 
> 	if (!cma)
> 		/* select cma from another node */ ;
> 
> 	return cma;
> #else
> 	return dma_contiguous_default_area;
> #endif
> }
> 

It is possible dma_contiguous_pernuma_area[nid] is not null, but we fail to get memory
from it due to it is either full or has no GFP_DMA(32) support. In this case, we still need
to fallback to the default global cma. So the code is trying pernuma_cma, then trying
default global cma. It is not picking one from the two areas. It is trying both.

> struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
> {
> 	struct cma *cma;
> 	...
> 
> 
> 	cma = dma_get_cma_area(dev);
> 	if (!cma)
> 		return NULL;
> 
> 	return cma_alloc_aligned(cma, size, gfp);
> }
> 
> > +	if (!dma_contiguous_default_area)
> >  		return NULL;
> > +
> >  	return cma_alloc_aligned(dma_contiguous_default_area, size, gfp);
> >  }
> >
> > @@ -261,9 +323,27 @@ struct page *dma_alloc_contiguous(struct device
> *dev, size_t size, gfp_t gfp)
> >   */
> >  void dma_free_contiguous(struct device *dev, struct page *page, size_t
> size)
> >  {
> > -	if (!cma_release(dev_get_cma_area(dev), page,
> > -			 PAGE_ALIGN(size) >> PAGE_SHIFT))
> > -		__free_pages(page, get_order(size));
> 
> Here as well, dev_get_cma_area() can be replaced with, say
> dma_get_dev_cma_area(dev, page) that will hide the below logic.

As explained above, this won't work.

> 
> > +	unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
> > +
> > +	/* if dev has its own cma, free page from there */
> > +	if (dev->cma_area) {
> > +		if (cma_release(dev->cma_area, page, count))
> > +			return;
> > +	} else {
> > +		/*
> > +		 * otherwise, page is from either per-numa cma or default cma
> > +		 */
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +		if (cma_release(dma_contiguous_pernuma_area[page_to_nid(page)],
> > +					page, count))
> > +			return;
> > +#endif
> > +		if (cma_release(dma_contiguous_default_area, page, count))
> > +			return;
> > +	}
> > +
> > +	/* not in any cma, free from buddy */
> > +	__free_pages(page, get_order(size));
> >  }
> >
> >  /*
> > --
> > 2.27.0
> >
> >

Thanks
Barry