Received: by 2002:a05:6a10:f3d0:0:0:0:0 with SMTP id a16csp2868033pxv; Mon, 12 Jul 2021 04:00:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzkFuoo2VbG78tqIo4rsnY6bGKZluHDVmTTSwWmlTFsKPdS2hXal0N3sVvo9pVUEeTZxRWp X-Received: by 2002:a92:c60c:: with SMTP id p12mr19109558ilm.7.1626087650253; Mon, 12 Jul 2021 04:00:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1626087650; cv=none; d=google.com; s=arc-20160816; b=wRrFwCWdG48q9hgMpHgsrFd9/SAbDck9E4XiPgykSCQAZlaCznkWYXLp/FzHv4C+h0 begIPqiWQrN1s1HFFwjLySycRbZ7VAM3+NKwdr0X1QMATo0IpFrX10T+mVYaRtrn8Ans 3X8VInPWQGI3oxzI0960btvsxyGqn/8N/ULlz4cQsqHHCRa5qqGusb5FInPqdP4+JET5 iXZ7QK7KQ1S/TNmc4Phh2fbQaHpaMFaBMVEUKSMTR0cQmCKvmbn5bTEhBNqH1MtVx73r YagTEsxdLJuIlZPklHG++1OMygoNIVtqkWaGXsDgq7a93vDHssUEEm5vyD7lNoq9TXcF vYSw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :to:subject; bh=nfzn37qenYnIRz1dhiIttT5OoCQ0aCDvCCcjc/gU0d0=; b=YUBonSKQifABCtbqF+BQMP4LAQ5oX0U77iJceyb6LB9va577ykEgd3plk+3A6I9N7H AzBbKl4ulSUlUMfcqD8UMjsLaEVR09Xdp8n+uUnm48s3GEEh82x1wkV/uhACm7vrj7iI yRpNVBnCkpZwKig8OOLMM3EQ9fQ8tIsZU57kBzhdV9ZRcz6Js51YVYj6qzkfDYyzfTi0 XPowtfp/uw/Uf5A9KXFDPfio8fewoQt3liVvOSr9iOAe423LcAM7iSprWVMklvav2dNh ZC5A4Ez9QsOEyf9yzqmnssbw7Kxffs+jnhP3GGjJAOAuU9R0x0mFETt7bLmRQDp2dgGK wGVQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f21si16322873jav.53.2021.07.12.04.00.38; Mon, 12 Jul 2021 04:00:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240090AbhGLJ60 (ORCPT + 99 others); Mon, 12 Jul 2021 05:58:26 -0400 Received: from foss.arm.com ([217.140.110.172]:52442 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239933AbhGLJ6Y (ORCPT ); Mon, 12 Jul 2021 05:58:24 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6D4721FB; Mon, 12 Jul 2021 02:55:35 -0700 (PDT) Received: from [10.57.35.32] (unknown [10.57.35.32]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 44DB43F694; Mon, 12 Jul 2021 02:55:34 -0700 (PDT) Subject: Re: [PATCH v2] coresight: tmc-etr: Speed up for bounce buffer in flat mode To: Leo Yan , Mathieu Poirier , Mike Leach , Alexander Shishkin , coresight@lists.linaro.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org References: <20210710070115.462674-1-leo.yan@linaro.org> From: Suzuki K Poulose Message-ID: Date: Mon, 12 Jul 2021 10:55:32 +0100 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210710070115.462674-1-leo.yan@linaro.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Leo, On 10/07/2021 08:01, Leo Yan wrote: > The AUX bounce buffer is allocated with API dma_alloc_coherent(), in the > low level's architecture code, e.g. for Arm64, it maps the memory with > the attribution "Normal non-cacheable"; this can be concluded from the > definition for pgprot_dmacoherent() in arch/arm64/include/asm/pgtable.h. > > Later when access the AUX bounce buffer, since the memory mapping is > non-cacheable, it's low efficiency due to every load instruction must > reach out DRAM. > > This patch changes to allocate pages with alloc_pages_node(), thus the > driver can access the memory with cacheable mapping in the kernel linear > virtual address; therefore, because load instructions can fetch data > from cache lines rather than always read data from DRAM, the driver can > boost memory coping performance. After using the cacheable mapping, the > driver uses dma_sync_single_for_cpu() to invalidate cacheline prior to > read bounce buffer so can avoid read stale trace data. > > By measurement the duration for function tmc_update_etr_buffer() with > ftrace function_graph tracer, it shows the performance significant > improvement for copying 4MiB data from bounce buffer: > > # echo tmc_etr_get_data_flat_buf > set_graph_notrace // avoid noise > # echo tmc_update_etr_buffer > set_graph_function > # echo function_graph > current_tracer > > before: > > # CPU DURATION FUNCTION CALLS > # | | | | | | | > 2) | tmc_update_etr_buffer() { > ... > 2) # 8148.320 us | } > > after: > > # CPU DURATION FUNCTION CALLS > # | | | | | | | > 2) | tmc_update_etr_buffer() { > ... > 2) # 2463.980 us | } Thats a good speed up ! Thanks for the patch. One minor comment below. > > Signed-off-by: Leo Yan > --- > > Changes from v1: > Set "flat_buf->daddr" to 0 when fails to map DMA region; and dropped the > unexpected if condition change in tmc_etr_free_flat_buf(). > > .../hwtracing/coresight/coresight-tmc-etr.c | 56 ++++++++++++++++--- > 1 file changed, 49 insertions(+), 7 deletions(-) > > diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c > index acdb59e0e661..888b0f929d33 100644 > --- a/drivers/hwtracing/coresight/coresight-tmc-etr.c > +++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c > @@ -21,6 +21,7 @@ > > struct etr_flat_buf { > struct device *dev; > + struct page *pages; > dma_addr_t daddr; > void *vaddr; > size_t size; > @@ -600,6 +601,7 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata, > { > struct etr_flat_buf *flat_buf; > struct device *real_dev = drvdata->csdev->dev.parent; > + ssize_t aligned_size; > > /* We cannot reuse existing pages for flat buf */ > if (pages) > @@ -609,11 +611,18 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata, > if (!flat_buf) > return -ENOMEM; > > - flat_buf->vaddr = dma_alloc_coherent(real_dev, etr_buf->size, > - &flat_buf->daddr, GFP_KERNEL); > - if (!flat_buf->vaddr) { > - kfree(flat_buf); > - return -ENOMEM; > + aligned_size = PAGE_ALIGN(etr_buf->size); > + flat_buf->pages = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, > + get_order(aligned_size)); > + if (!flat_buf->pages) > + goto fail_alloc_pages; > + > + flat_buf->vaddr = page_address(flat_buf->pages); > + flat_buf->daddr = dma_map_page(real_dev, flat_buf->pages, 0, > + aligned_size, DMA_FROM_DEVICE); > + if (dma_mapping_error(real_dev, flat_buf->daddr)) { > + flat_buf->daddr = 0; > + goto fail_dma_map_page; > } > > flat_buf->size = etr_buf->size; > @@ -622,6 +631,12 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata, > etr_buf->mode = ETR_MODE_FLAT; > etr_buf->private = flat_buf; > return 0; > + > +fail_dma_map_page: > + __free_pages(flat_buf->pages, get_order(aligned_size)); > +fail_alloc_pages: > + kfree(flat_buf); > + return -ENOMEM; > } > > static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf) > @@ -630,15 +645,20 @@ static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf) > > if (flat_buf && flat_buf->daddr) { > struct device *real_dev = flat_buf->dev->parent; > + ssize_t aligned_size = PAGE_ALIGN(etr_buf->size); > > - dma_free_coherent(real_dev, flat_buf->size, > - flat_buf->vaddr, flat_buf->daddr); > + dma_unmap_page(real_dev, flat_buf->daddr, aligned_size, > + DMA_FROM_DEVICE); > + __free_pages(flat_buf->pages, get_order(aligned_size)); > } > kfree(flat_buf); > } > > static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp) > { > + struct etr_flat_buf *flat_buf = etr_buf->private; > + struct device *real_dev = flat_buf->dev->parent; > + > /* > * Adjust the buffer to point to the beginning of the trace data > * and update the available trace data. > @@ -648,6 +668,28 @@ static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp) > etr_buf->len = etr_buf->size; > else > etr_buf->len = rwp - rrp; > + > + if (etr_buf->offset + etr_buf->len > etr_buf->size) { > + int len1, len2; > + > + /* > + * If trace data is wrapped around, sync AUX bounce buffer > + * for two chunks: "len1" is for the trace date length at > + * the tail of bounce buffer, and "len2" is the length from > + * the start of the buffer after wrapping around. > + */ > + len1 = etr_buf->size - etr_buf->offset; > + len2 = etr_buf->len - len1; > + dma_sync_single_for_cpu(real_dev, > + flat_buf->daddr + etr_buf->offset, > + len1, DMA_FROM_DEVICE); > + dma_sync_single_for_cpu(real_dev, flat_buf->daddr, > + len2, DMA_FROM_DEVICE); We always start tracing at the beginning of the buffer and the only reason why we would get a wrap around, is when the buffer is full. So you could as well sync the entire buffer in one go dma_sync_single_for_cpu(real_dev, flat_buf->daddr, etr_buf->len, DMA_FROM_DEVICE); > + } else { > + dma_sync_single_for_cpu(real_dev, > + flat_buf->daddr + etr_buf->offset, > + etr_buf->len, DMA_FROM_DEVICE); > + } > } > > static ssize_t tmc_etr_get_data_flat_buf(struct etr_buf *etr_buf, > Eitherways, Reviewed-by: Suzuki K Poulose