LinuxLists.cc - [PATCH 0/4] ntb perf and ntb tool improvements

2020-02-05 17:18:55

Subject: [PATCH 0/4] ntb perf and ntb tool improvements

This patch series modifies the ntb perf code to
have separate functions for CPU and DMA transfers.
It also adds handling for -EAGAIN error code so
that we re-try sending commands later.

Fixes have been made to ntb_perf and ntb_tool
to use 'struct device' associated with 'struct
pci_dev' rather than 'struct ntb_dev'.

The patches are based on Linus' tree,

git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Arindam Nath (2):
ntb_perf: refactor code for CPU and DMA transfers
ntb_perf: send command in response to EAGAIN

Sanjay R Mehta (2):
ntb_perf: pass correct struct device to dma_alloc_coherent
ntb_tool: pass correct struct device to dma_alloc_coherent

drivers/ntb/test/ntb_perf.c | 168 ++++++++++++++++++++++++++----------
drivers/ntb/test/ntb_tool.c | 6 +-
2 files changed, 126 insertions(+), 48 deletions(-)

--
2.17.1

2020-02-05 17:19:01

by Nath, Arindam

[permalink] [raw]

Subject: [PATCH 2/4] ntb_perf: send command in response to EAGAIN

perf_spad_cmd_send() and perf_msg_cmd_send() return
-EAGAIN after trying to send commands for a maximum
of MSG_TRIES re-tries. But currently there is no
handling for this error. These functions are invoked
from perf_service_work() through function pointers,
so rather than simply call these functions is not
enough. We need to make sure to invoke them again in
case of -EAGAIN. Since peer status bits were cleared
before calling these functions, we set the same status
bits before queueing the work again for later invocation.
This way we simply won't go ahead and initialize the
XLAT registers wrongfully in case sending the very first
command itself fails.

Signed-off-by: Arindam Nath <[email protected]>
---
drivers/ntb/test/ntb_perf.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/ntb/test/ntb_perf.c b/drivers/ntb/test/ntb_perf.c
index 0e9b9efe74a4..5116655f0211 100644
--- a/drivers/ntb/test/ntb_perf.c
+++ b/drivers/ntb/test/ntb_perf.c
@@ -625,14 +625,24 @@ static void perf_service_work(struct work_struct *work)
{
struct perf_peer *peer = to_peer_service(work);

- if (test_and_clear_bit(PERF_CMD_SSIZE, &peer->sts))
- perf_cmd_send(peer, PERF_CMD_SSIZE, peer->outbuf_size);
+ if (test_and_clear_bit(PERF_CMD_SSIZE, &peer->sts)) {
+ if (perf_cmd_send(peer, PERF_CMD_SSIZE, peer->outbuf_size)
+ == -EAGAIN) {
+ set_bit(PERF_CMD_SSIZE, &peer->sts);
+ (void)queue_work(system_highpri_wq, &peer->service);
+ }
+ }

if (test_and_clear_bit(PERF_CMD_RSIZE, &peer->sts))
perf_setup_inbuf(peer);

- if (test_and_clear_bit(PERF_CMD_SXLAT, &peer->sts))
- perf_cmd_send(peer, PERF_CMD_SXLAT, peer->inbuf_xlat);
+ if (test_and_clear_bit(PERF_CMD_SXLAT, &peer->sts)) {
+ if (perf_cmd_send(peer, PERF_CMD_SXLAT, peer->inbuf_xlat)
+ == -EAGAIN) {
+ set_bit(PERF_CMD_SXLAT, &peer->sts);
+ (void)queue_work(system_highpri_wq, &peer->service);
+ }
+ }

if (test_and_clear_bit(PERF_CMD_RXLAT, &peer->sts))
perf_setup_outbuf(peer);
--
2.17.1

2020-02-05 17:19:01

by Nath, Arindam

[permalink] [raw]

Subject: [PATCH 1/4] ntb_perf: refactor code for CPU and DMA transfers

This patch creates separate function to handle CPU
and DMA transfers. Since CPU transfers use memcopy
and DMA transfers use dmaengine APIs, these changes
not only allow logical separation between the two,
but also allows someone to clearly see the difference
in the way the two are handled.

In the case of DMA, we DMA from system memory to the
memory window(MW) of NTB, which is a MMIO region, we
should not use dma_map_page() for mapping MW. The
correct way to map a MMIO region is to use
dma_map_resource(), so the code is modified
accordingly.

dma_map_resource() expects physical address of the
region to be mapped for DMA, we add a new field,
outbuf_phys_addr, to struct perf_peer, and also
another field, outbuf_dma_addr, to store the
corresponding mapped address returned by the API.

Since the MW is contiguous, rather than mapping
chunk-by-chunk, we map the entire MW before the
actual DMA transfer happens. Then for each chunk,
we simply pass offset into the mapped region and
DMA to that region. Then later, we unmap the MW
during perf_clear_test().

The above means that now we need to have different
function parameters to deal with in the case of
CPU and DMA transfers. In the case of CPU transfers,
we simply need the CPU virtual addresses for memcopy,
but in the case of DMA, we need dma_addr_t, which
will be different from CPU physical address depending
on whether IOMMU is enabled or not. Thus we now
have two separate functions, perf_copy_chunk_cpu(),
and perf_copy_chunk_dma() to take care of above
consideration.

Signed-off-by: Arindam Nath <[email protected]>
---
drivers/ntb/test/ntb_perf.c | 143 ++++++++++++++++++++++++++----------
1 file changed, 105 insertions(+), 38 deletions(-)

diff --git a/drivers/ntb/test/ntb_perf.c b/drivers/ntb/test/ntb_perf.c
index e9b7c2dfc730..0e9b9efe74a4 100644
--- a/drivers/ntb/test/ntb_perf.c
+++ b/drivers/ntb/test/ntb_perf.c
@@ -149,6 +149,8 @@ struct perf_peer {
u64 outbuf_xlat;
resource_size_t outbuf_size;
void __iomem *outbuf;
+ phys_addr_t outbuf_phys_addr;
+ dma_addr_t outbuf_dma_addr;

/* Inbound MW params */
dma_addr_t inbuf_xlat;
@@ -775,49 +777,41 @@ static void perf_dma_copy_callback(void *data)
wake_up(&pthr->dma_wait);
}

-static int perf_copy_chunk(struct perf_thread *pthr,
- void __iomem *dst, void *src, size_t len)
+static int perf_copy_chunk_cpu(struct perf_thread *pthr,
+ void __iomem *dst, void *src, size_t len)
+{
+ memcpy_toio(dst, src, len);
+
+ return likely(atomic_read(&pthr->perf->tsync) > 0) ? 0 : -EINTR;
+}
+
+static int perf_copy_chunk_dma(struct perf_thread *pthr,
+ dma_addr_t dst, void *src, size_t len)
{
struct dma_async_tx_descriptor *tx;
struct dmaengine_unmap_data *unmap;
struct device *dma_dev;
int try = 0, ret = 0;

- if (!use_dma) {
- memcpy_toio(dst, src, len);
- goto ret_check_tsync;
- }
-
dma_dev = pthr->dma_chan->device->dev;
-
- if (!is_dma_copy_aligned(pthr->dma_chan->device, offset_in_page(src),
- offset_in_page(dst), len))
- return -EIO;
-
- unmap = dmaengine_get_unmap_data(dma_dev, 2, GFP_NOWAIT);
+ unmap = dmaengine_get_unmap_data(dma_dev, 1, GFP_NOWAIT);
if (!unmap)
return -ENOMEM;

unmap->len = len;
unmap->addr[0] = dma_map_page(dma_dev, virt_to_page(src),
- offset_in_page(src), len, DMA_TO_DEVICE);
+ offset_in_page(src), len, DMA_TO_DEVICE);
if (dma_mapping_error(dma_dev, unmap->addr[0])) {
ret = -EIO;
goto err_free_resource;
}
unmap->to_cnt = 1;

- unmap->addr[1] = dma_map_page(dma_dev, virt_to_page(dst),
- offset_in_page(dst), len, DMA_FROM_DEVICE);
- if (dma_mapping_error(dma_dev, unmap->addr[1])) {
- ret = -EIO;
- goto err_free_resource;
- }
- unmap->from_cnt = 1;
-
do {
- tx = dmaengine_prep_dma_memcpy(pthr->dma_chan, unmap->addr[1],
- unmap->addr[0], len, DMA_PREP_INTERRUPT | DMA_CTRL_ACK);
+ tx = dmaengine_prep_dma_memcpy(pthr->dma_chan, dst,
+ unmap->addr[0], len,
+ DMA_PREP_INTERRUPT |
+ DMA_CTRL_ACK);
if (!tx)
msleep(DMA_MDELAY);
} while (!tx && (try++ < DMA_TRIES));
@@ -833,22 +827,16 @@ static int perf_copy_chunk(struct perf_thread *pthr,

ret = dma_submit_error(dmaengine_submit(tx));
if (ret) {
- dmaengine_unmap_put(unmap);
goto err_free_resource;
}

- dmaengine_unmap_put(unmap);
-
atomic_inc(&pthr->dma_sync);
dma_async_issue_pending(pthr->dma_chan);

-ret_check_tsync:
- return likely(atomic_read(&pthr->perf->tsync) > 0) ? 0 : -EINTR;
-
err_free_resource:
dmaengine_unmap_put(unmap);

- return ret;
+ return likely(atomic_read(&pthr->perf->tsync) > 0) ? ret : -EINTR;
}

static bool perf_dma_filter(struct dma_chan *chan, void *data)
@@ -893,7 +881,7 @@ static int perf_init_test(struct perf_thread *pthr)
return 0;
}

-static int perf_run_test(struct perf_thread *pthr)
+static int perf_run_test_cpu(struct perf_thread *pthr)
{
struct perf_peer *peer = pthr->perf->test_peer;
struct perf_ctx *perf = pthr->perf;
@@ -911,10 +899,9 @@ static int perf_run_test(struct perf_thread *pthr)
flt_dst = peer->outbuf;

pthr->duration = ktime_get();
-
/* Copied field is cleared on test launch stage */
while (pthr->copied < total_size) {
- ret = perf_copy_chunk(pthr, flt_dst, flt_src, chunk_size);
+ ret = perf_copy_chunk_cpu(pthr, flt_dst, flt_src, chunk_size);
if (ret) {
dev_err(&perf->ntb->dev, "%d: Got error %d on test\n",
pthr->tidx, ret);
@@ -937,6 +924,74 @@ static int perf_run_test(struct perf_thread *pthr)
return 0;
}

+static int perf_run_test_dma(struct perf_thread *pthr)
+{
+ struct perf_peer *peer = pthr->perf->test_peer;
+ struct perf_ctx *perf = pthr->perf;
+ struct device *dma_dev;
+ dma_addr_t flt_dst, bnd_dst;
+ u64 total_size, chunk_size;
+ void *flt_src;
+ int ret = 0;
+
+ total_size = 1ULL << total_order;
+ chunk_size = 1ULL << chunk_order;
+ chunk_size = min_t(u64, peer->outbuf_size, chunk_size);
+
+ /* Map MW for DMA */
+ dma_dev = pthr->dma_chan->device->dev;
+ peer->outbuf_dma_addr = dma_map_resource(dma_dev,
+ peer->outbuf_phys_addr,
+ peer->outbuf_size,
+ DMA_FROM_DEVICE, 0);
+ if (dma_mapping_error(dma_dev, peer->outbuf_dma_addr)) {
+ dma_unmap_resource(dma_dev, peer->outbuf_dma_addr,
+ peer->outbuf_size, DMA_FROM_DEVICE, 0);
+ return -EIO;
+ }
+
+ flt_src = pthr->src;
+ bnd_dst = peer->outbuf_dma_addr + peer->outbuf_size;
+ flt_dst = peer->outbuf_dma_addr;
+
+ pthr->duration = ktime_get();
+ /* Copied field is cleared on test launch stage */
+ while (pthr->copied < total_size) {
+ ret = perf_copy_chunk_dma(pthr, flt_dst, flt_src, chunk_size);
+ if (ret) {
+ dev_err(&perf->ntb->dev, "%d: Got error %d on test\n",
+ pthr->tidx, ret);
+ return ret;
+ }
+
+ pthr->copied += chunk_size;
+
+ flt_dst += chunk_size;
+ flt_src += chunk_size;
+ if (flt_dst >= bnd_dst || flt_dst < peer->outbuf_dma_addr) {
+ flt_dst = peer->outbuf_dma_addr;
+ flt_src = pthr->src;
+ }
+
+ /* Give up CPU to give a chance for other threads to use it */
+ schedule();
+ }
+
+ return 0;
+}
+
+static int perf_run_test(struct perf_thread *pthr)
+{
+ int ret = 0;
+
+ if (!use_dma)
+ ret = perf_run_test_cpu(pthr);
+ else
+ ret = perf_run_test_dma(pthr);
+
+ return ret;
+}
+
static int perf_sync_test(struct perf_thread *pthr)
{
struct perf_ctx *perf = pthr->perf;
@@ -969,6 +1024,8 @@ static int perf_sync_test(struct perf_thread *pthr)
static void perf_clear_test(struct perf_thread *pthr)
{
struct perf_ctx *perf = pthr->perf;
+ struct perf_peer *peer = pthr->perf->test_peer;
+ struct device *dma_dev;

if (!use_dma)
goto no_dma_notify;
@@ -978,6 +1035,10 @@ static void perf_clear_test(struct perf_thread *pthr)
* We call it anyway just to be sure of the transfers completion.
*/
(void)dmaengine_terminate_sync(pthr->dma_chan);
+ /* Un-map MW */
+ dma_dev = pthr->dma_chan->device->dev;
+ dma_unmap_resource(dma_dev, peer->outbuf_dma_addr, peer->outbuf_size,
+ DMA_FROM_DEVICE, 0);

dma_release_channel(pthr->dma_chan);

@@ -1383,10 +1444,16 @@ static int perf_setup_peer_mw(struct perf_peer *peer)
if (ret)
return ret;

- peer->outbuf = devm_ioremap_wc(&perf->ntb->dev, phys_addr,
- peer->outbuf_size);
- if (!peer->outbuf)
- return -ENOMEM;
+ if (use_dma) {
+ /* For DMA to/from MW */
+ peer->outbuf_phys_addr = phys_addr;
+ } else {
+ /* For CPU read(from)/write(to) MW */
+ peer->outbuf = devm_ioremap_wc(&perf->ntb->dev, phys_addr,
+ peer->outbuf_size);
+ if (!peer->outbuf)
+ return -ENOMEM;
+ }

if (max_mw_size && peer->outbuf_size > max_mw_size) {
peer->outbuf_size = max_mw_size;
--
2.17.1

2020-02-05 17:19:24

by Nath, Arindam

[permalink] [raw]

Subject: [PATCH 4/4] ntb_tool: pass correct struct device to dma_alloc_coherent

From: Sanjay R Mehta <[email protected]>

Currently, ntb->dev is passed to dma_alloc_coherent
and dma_free_coherent calls. The returned dma_addr_t
is the CPU physical address. This works fine as long
as IOMMU is disabled. But when IOMMU is enabled, we
need to make sure that IOVA is returned for dma_addr_t.
So the correct way to achieve this is by changing the
first parameter of dma_alloc_coherent() as ntb->pdev->dev
instead.

Fixes: 5648e56 ("NTB: ntb_perf: Add full multi-port NTB API support")
Signed-off-by: Sanjay R Mehta <[email protected]>
Signed-off-by: Arindam Nath <[email protected]>
---
drivers/ntb/test/ntb_tool.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/ntb/test/ntb_tool.c b/drivers/ntb/test/ntb_tool.c
index d592c0ffbd19..025747c1568e 100644
--- a/drivers/ntb/test/ntb_tool.c
+++ b/drivers/ntb/test/ntb_tool.c
@@ -590,7 +590,7 @@ static int tool_setup_mw(struct tool_ctx *tc, int pidx, int widx,
inmw->size = min_t(resource_size_t, req_size, size);
inmw->size = round_up(inmw->size, addr_align);
inmw->size = round_up(inmw->size, size_align);
- inmw->mm_base = dma_alloc_coherent(&tc->ntb->dev, inmw->size,
+ inmw->mm_base = dma_alloc_coherent(&tc->ntb->pdev->dev, inmw->size,
&inmw->dma_base, GFP_KERNEL);
if (!inmw->mm_base)
return -ENOMEM;
@@ -612,7 +612,7 @@ static int tool_setup_mw(struct tool_ctx *tc, int pidx, int widx,
return 0;

err_free_dma:
- dma_free_coherent(&tc->ntb->dev, inmw->size, inmw->mm_base,
+ dma_free_coherent(&tc->ntb->pdev->dev, inmw->size, inmw->mm_base,
inmw->dma_base);
inmw->mm_base = NULL;
inmw->dma_base = 0;
@@ -629,7 +629,7 @@ static void tool_free_mw(struct tool_ctx *tc, int pidx, int widx)

if (inmw->mm_base != NULL) {
ntb_mw_clear_trans(tc->ntb, pidx, widx);
- dma_free_coherent(&tc->ntb->dev, inmw->size,
+ dma_free_coherent(&tc->ntb->pdev->dev, inmw->size,
inmw->mm_base, inmw->dma_base);
}

--
2.17.1

2020-02-05 17:20:40

by Nath, Arindam

[permalink] [raw]

Subject: [PATCH 3/4] ntb_perf: pass correct struct device to dma_alloc_coherent

From: Sanjay R Mehta <[email protected]>

Currently, ntb->dev is passed to dma_alloc_coherent
and dma_free_coherent calls. The returned dma_addr_t
is the CPU physical address. This works fine as long
as IOMMU is disabled. But when IOMMU is enabled, we
need to make sure that IOVA is returned for dma_addr_t.
So the correct way to achieve this is by changing the
first parameter of dma_alloc_coherent() as ntb->pdev->dev
instead.

Fixes: 5648e56 ("NTB: ntb_perf: Add full multi-port NTB API support")
Signed-off-by: Sanjay R Mehta <[email protected]>
Signed-off-by: Arindam Nath <[email protected]>
---
drivers/ntb/test/ntb_perf.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/ntb/test/ntb_perf.c b/drivers/ntb/test/ntb_perf.c
index 5116655f0211..65501f24b742 100644
--- a/drivers/ntb/test/ntb_perf.c
+++ b/drivers/ntb/test/ntb_perf.c
@@ -558,7 +558,7 @@ static void perf_free_inbuf(struct perf_peer *peer)
return;

(void)ntb_mw_clear_trans(peer->perf->ntb, peer->pidx, peer->gidx);
- dma_free_coherent(&peer->perf->ntb->dev, peer->inbuf_size,
+ dma_free_coherent(&peer->perf->ntb->pdev->dev, peer->inbuf_size,
peer->inbuf, peer->inbuf_xlat);
peer->inbuf = NULL;
}
@@ -587,8 +587,9 @@ static int perf_setup_inbuf(struct perf_peer *peer)

perf_free_inbuf(peer);

- peer->inbuf = dma_alloc_coherent(&perf->ntb->dev, peer->inbuf_size,
- &peer->inbuf_xlat, GFP_KERNEL);
+ peer->inbuf = dma_alloc_coherent(&perf->ntb->pdev->dev,
+ peer->inbuf_size, &peer->inbuf_xlat,
+ GFP_KERNEL);
if (!peer->inbuf) {
dev_err(&perf->ntb->dev, "Failed to alloc inbuf of %pa\n",
&peer->inbuf_size);
--
2.17.1

2020-02-05 18:46:21

by Logan Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH 3/4] ntb_perf: pass correct struct device to dma_alloc_coherent

On 2020-02-05 10:16 a.m., Arindam Nath wrote:
> From: Sanjay R Mehta <[email protected]>
>
> Currently, ntb->dev is passed to dma_alloc_coherent
> and dma_free_coherent calls. The returned dma_addr_t
> is the CPU physical address. This works fine as long
> as IOMMU is disabled. But when IOMMU is enabled, we
> need to make sure that IOVA is returned for dma_addr_t.
> So the correct way to achieve this is by changing the
> first parameter of dma_alloc_coherent() as ntb->pdev->dev
> instead.
>
> Fixes: 5648e56 ("NTB: ntb_perf: Add full multi-port NTB API support")
> Signed-off-by: Sanjay R Mehta <[email protected]>
> Signed-off-by: Arindam Nath <[email protected]>

Ugh, this has been fixed repeatedly and independently by a number of
people. I sent a fix[1] in more than a year ago and Sanjay repeated the
effort a couple months ago[2].

I have the same feed back for you that I had for him: once we fix the
bug we should also go in and remove the now unnecessary
dma_coerce_mask_and_coherent() calls in the drivers at the same time
seeing it no longer makes any sense. My patch did this already.

Thanks,

Logan

[1] https://lore.kernel.org/lkml/[email protected]/
[2]
https://lore.kernel.org/lkml/[email protected]/

2020-02-07 08:42:43

by Nath, Arindam

[permalink] [raw]

Subject: RE: [PATCH 3/4] ntb_perf: pass correct struct device to dma_alloc_coherent

[AMD Official Use Only - Internal Distribution Only]

> -----Original Message-----
> From: Logan Gunthorpe <[email protected]>
> Sent: Thursday, February 6, 2020 00:14
> To: Nath, Arindam <[email protected]>; Jon Mason
> <[email protected]>; Dave Jiang <[email protected]>; Allen Hubbe
> <[email protected]>; Mehta, Sanju <[email protected]>
> Cc: [email protected]; [email protected]
> Subject: Re: [PATCH 3/4] ntb_perf: pass correct struct device to
> dma_alloc_coherent
>
>
>
> On 2020-02-05 10:16 a.m., Arindam Nath wrote:
> > From: Sanjay R Mehta <[email protected]>
> >
> > Currently, ntb->dev is passed to dma_alloc_coherent
> > and dma_free_coherent calls. The returned dma_addr_t
> > is the CPU physical address. This works fine as long
> > as IOMMU is disabled. But when IOMMU is enabled, we
> > need to make sure that IOVA is returned for dma_addr_t.
> > So the correct way to achieve this is by changing the
> > first parameter of dma_alloc_coherent() as ntb->pdev->dev
> > instead.
> >
> > Fixes: 5648e56 ("NTB: ntb_perf: Add full multi-port NTB API support")
> > Signed-off-by: Sanjay R Mehta <[email protected]>
> > Signed-off-by: Arindam Nath <[email protected]>
>
> Ugh, this has been fixed repeatedly and independently by a number of
> people. I sent a fix[1] in more than a year ago and Sanjay repeated the
> effort a couple months ago[2].
>
> I have the same feed back for you that I had for him: once we fix the
> bug we should also go in and remove the now unnecessary
> dma_coerce_mask_and_coherent() calls in the drivers at the same time
> seeing it no longer makes any sense. My patch did this already.

Thanks Logan. I will include your patches in my next version of the set,
and mention you in the "From" tag.

Arindam

>
> Thanks,
>
> Logan
>
> [1]
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> kernel.org%2Flkml%2F20190109192233.5752-3-
> logang%40deltatee.com%2F&data=02%7C01%7Carindam.nath%40amd.
> com%7Cabc7298c86c54b82db1a08d7aa6b66aa%7C3dd8961fe4884e608e11a82
> d994e183d%7C0%7C0%7C637165251004193969&sdata=52DOTHpKjseZjv
> E6jmWWVVLQLeRiaykbVTQipGQbLT0%3D&reserved=0
> [2]
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> kernel.org%2Flkml%2F1575983255-70377-1-git-send-email-
> Sanju.Mehta%40amd.com%2F&data=02%7C01%7Carindam.nath%40am
> d.com%7Cabc7298c86c54b82db1a08d7aa6b66aa%7C3dd8961fe4884e608e11a
> 82d994e183d%7C0%7C0%7C637165251004193969&sdata=RFR%2BLFp5a
> ON1MA4Erx4soqL9pLHc%2BLWVNuLY0%2B0zcbo%3D&reserved=0