2014-07-08 08:26:36

by Alexandre Courbot

[permalink] [raw]
Subject: [PATCH v4 0/6] drm: nouveau: memory coherency on ARM

Another revision of this patchset critical for GK20A to operate.

Previous attempts were exclusively using either TTM's regular page allocator or
the DMA API one. Both have their advantages and drawbacks: the page allocator is
fast but requires explicit synchronization on non-coherent architectures,
whereas the DMA allocator always returns coherent memory, but is also slower,
creates a permanent kernel mapping, and is more constrained as to which memory
it can use.

This version attempts to use the most-fit allocator according to the buffer
use-case:
- buffers that are passed to user-space can explicitly be synced during their
validation and preparation for CPU access, as previously shown by Lucas
(http://lists.freedesktop.org/archives/nouveau/2013-August/014029.html ). For
these, we don't mind if the memory is not coherent and prefer to use the page
allocator.
- buffers that are used by the kernel, typically fences and GPFIFO buffers, are
accessed rarely and thus should not trigger a costly flush or cache
invalidation. For these, we want to guarantee coherent access and use the DMA
API if necessary.

This series attempts to implement this behavior by allowing the
TTM_PL_FLAG_UNCACHED flag to be passed to nouveau_bo_new(). On coherent
architectures this flag is a no-op ; on non-coherent architectures, it will
force the creation of a coherent buffer using the DMA-API.

Several fixes and changes were necessary to enable this behavior:
- CPU addresses of DMA-allocated BOs must be made visible (patch 1) so the
coherent mapping can be used by drivers
- The DMA-sync functions are required for BOs populated using the page allocator
(patch 4). Pages need to be mapped to the device using the correct API if we
are to call the sync functions (patch 2). Additionally, we need to understand
whether we are on a CPU-coherent architecture (patch 3).
- Coherent BOs need to be detected by Nouveau so their coherent kernel mapping
can be used instead of creating a new one (patch 5).
- Finally, buffers that are used by the kernel should be requested to be
coherent (page 6).

Changes since v3:
- Only use the DMA allocator for BOs that strictly require to be coherent
- Fixed the way pages are mapped to the GPU on platform devices
- Thoroughly checked with CONFIG_DMA_API_DEBUG that there were no API violations

Alexandre Courbot (6):
drm/ttm: expose CPU address of DMA-allocated pages
drm/nouveau: map pages using DMA API on platform devices
drm/nouveau: introduce nv_device_is_cpu_coherent()
drm/nouveau: synchronize BOs when required
drm/nouveau: implement explicitly coherent BOs
drm/nouveau: allocate GPFIFOs and fences coherently

drivers/gpu/drm/nouveau/core/engine/device/base.c | 14 ++-
drivers/gpu/drm/nouveau/core/include/core/device.h | 3 +
drivers/gpu/drm/nouveau/nouveau_bo.c | 132 +++++++++++++++++++--
drivers/gpu/drm/nouveau/nouveau_bo.h | 3 +
drivers/gpu/drm/nouveau/nouveau_chan.c | 2 +-
drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++
drivers/gpu/drm/nouveau/nv84_fence.c | 4 +-
drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 2 +
drivers/gpu/drm/ttm/ttm_tt.c | 6 +-
include/drm/ttm/ttm_bo_driver.h | 2 +
10 files changed, 167 insertions(+), 13 deletions(-)

--
2.0.0


2014-07-08 08:26:50

by Alexandre Courbot

[permalink] [raw]
Subject: [PATCH v4 4/6] drm/nouveau: synchronize BOs when required

On architectures for which access to GPU memory is non-coherent,
caches need to be flushed and invalidated explicitly when BO control
changes between CPU and GPU.

This patch adds buffer synchronization functions which invokes the
correct API (PCI or DMA) to ensure synchronization is effective.

Based on the TTM DMA cache helper patches by Lucas Stach.

Signed-off-by: Lucas Stach <[email protected]>
Signed-off-by: Alexandre Courbot <[email protected]>
---
drivers/gpu/drm/nouveau/nouveau_bo.c | 56 +++++++++++++++++++++++++++++++++++
drivers/gpu/drm/nouveau/nouveau_bo.h | 2 ++
drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
3 files changed, 70 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 67e9e8e2e2ec..47e4e8886769 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
ttm_bo_kunmap(&nvbo->kmap);
}

+void
+nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
+{
+ struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
+ struct nouveau_device *device = nouveau_dev(drm->dev);
+ struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
+ int i;
+
+ if (!ttm_dma)
+ return;
+
+ if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
+ return;
+
+ if (nv_device_is_pci(device)) {
+ for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+ pci_dma_sync_single_for_device(device->pdev,
+ ttm_dma->dma_address[i], PAGE_SIZE,
+ PCI_DMA_TODEVICE);
+ } else {
+ for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+ dma_sync_single_for_device(nv_device_base(device),
+ ttm_dma->dma_address[i], PAGE_SIZE,
+ DMA_TO_DEVICE);
+ }
+}
+
+void
+nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
+{
+ struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
+ struct nouveau_device *device = nouveau_dev(drm->dev);
+ struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
+ int i;
+
+ if (!ttm_dma)
+ return;
+
+ if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
+ return;
+
+ if (nv_device_is_pci(device)) {
+ for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+ pci_dma_sync_single_for_cpu(device->pdev,
+ ttm_dma->dma_address[i], PAGE_SIZE,
+ PCI_DMA_FROMDEVICE);
+ } else {
+ for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+ dma_sync_single_for_cpu(nv_device_base(device),
+ ttm_dma->dma_address[i], PAGE_SIZE,
+ DMA_FROM_DEVICE);
+ }
+}
+
int
nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
bool no_wait_gpu)
@@ -418,6 +472,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
if (!ttm)
return ret;

+ nouveau_bo_sync_for_device(nvbo);
+
device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
nv_wr32(device, 0x70004, 0x00000001);
if (!nv_wait(device, 0x070004, 0x00000001, 0x00000000))
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
index ff17c1f432fc..fa42298d2dca 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.h
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.h
@@ -81,6 +81,8 @@ void nouveau_bo_wr32(struct nouveau_bo *, unsigned index, u32 val);
void nouveau_bo_fence(struct nouveau_bo *, struct nouveau_fence *);
int nouveau_bo_validate(struct nouveau_bo *, bool interruptible,
bool no_wait_gpu);
+void nouveau_bo_sync_for_device(struct nouveau_bo *nvbo);
+void nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo);

struct nouveau_vma *
nouveau_bo_vma_find(struct nouveau_bo *, struct nouveau_vm *);
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index c90c0dc0afe8..08829a720891 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -896,6 +896,7 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
spin_lock(&nvbo->bo.bdev->fence_lock);
ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
spin_unlock(&nvbo->bo.bdev->fence_lock);
+ nouveau_bo_sync_for_cpu(nvbo);
drm_gem_object_unreference_unlocked(gem);
return ret;
}
@@ -904,6 +905,17 @@ int
nouveau_gem_ioctl_cpu_fini(struct drm_device *dev, void *data,
struct drm_file *file_priv)
{
+ struct drm_nouveau_gem_cpu_fini *req = data;
+ struct drm_gem_object *gem;
+ struct nouveau_bo *nvbo;
+
+ gem = drm_gem_object_lookup(dev, file_priv, req->handle);
+ if (!gem)
+ return -ENOENT;
+ nvbo = nouveau_gem_object(gem);
+
+ nouveau_bo_sync_for_device(nvbo);
+ drm_gem_object_unreference_unlocked(gem);
return 0;
}

--
2.0.0

2014-07-08 08:26:54

by Alexandre Courbot

[permalink] [raw]
Subject: [PATCH v4 6/6] drm/nouveau: allocate GPFIFOs and fences coherently

Specify TTM_PL_FLAG_UNCACHED when allocating GPFIFOs and fences to
allow them to be safely accessed by the kernel without being synced
on non-coherent architectures.

Signed-off-by: Alexandre Courbot <[email protected]>
---
drivers/gpu/drm/nouveau/nouveau_chan.c | 2 +-
drivers/gpu/drm/nouveau/nv84_fence.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_chan.c b/drivers/gpu/drm/nouveau/nouveau_chan.c
index ccb6b452d6d0..155b1b192676 100644
--- a/drivers/gpu/drm/nouveau/nouveau_chan.c
+++ b/drivers/gpu/drm/nouveau/nouveau_chan.c
@@ -110,7 +110,7 @@ nouveau_channel_prep(struct nouveau_drm *drm, struct nouveau_cli *cli,
chan->handle = handle;

/* allocate memory for dma push buffer */
- target = TTM_PL_FLAG_TT;
+ target = TTM_PL_FLAG_TT | TTM_PL_FLAG_UNCACHED;
if (nouveau_vram_pushbuf)
target = TTM_PL_FLAG_VRAM;

diff --git a/drivers/gpu/drm/nouveau/nv84_fence.c b/drivers/gpu/drm/nouveau/nv84_fence.c
index 9fd475c89820..b5d6737b6b8d 100644
--- a/drivers/gpu/drm/nouveau/nv84_fence.c
+++ b/drivers/gpu/drm/nouveau/nv84_fence.c
@@ -257,8 +257,8 @@ nv84_fence_create(struct nouveau_drm *drm)

if (ret == 0)
ret = nouveau_bo_new(drm->dev, 16 * (pfifo->max + 1), 0,
- TTM_PL_FLAG_TT, 0, 0, NULL,
- &priv->bo_gart);
+ TTM_PL_FLAG_TT | TTM_PL_FLAG_UNCACHED, 0,
+ 0, NULL, &priv->bo_gart);
if (ret == 0) {
ret = nouveau_bo_pin(priv->bo_gart, TTM_PL_FLAG_TT);
if (ret == 0) {
--
2.0.0

2014-07-08 08:27:55

by Alexandre Courbot

[permalink] [raw]
Subject: [PATCH v4 5/6] drm/nouveau: implement explicitly coherent BOs

Allow nouveau_bo_new() to recognize the TTM_PL_FLAG_UNCACHED flag, which
means that we want the allocated BO to be perfectly coherent between the
CPU and GPU. This is useful on non-coherent architectures for which we
do not want to manually sync some rarely-accessed buffers: typically,
fences and pushbuffers.

A TTM BO allocated with the TTM_PL_FLAG_UNCACHED on a non-coherent
architecture will be populated using the DMA API, and accesses to it
performed using the coherent mapping performed by dma_alloc_coherent().

Signed-off-by: Alexandre Courbot <[email protected]>
---
drivers/gpu/drm/nouveau/nouveau_bo.c | 76 ++++++++++++++++++++++++++++++++----
drivers/gpu/drm/nouveau/nouveau_bo.h | 1 +
2 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 47e4e8886769..23a29adfabf0 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -219,6 +219,9 @@ nouveau_bo_new(struct drm_device *dev, int size, int align,
nvbo->tile_flags = tile_flags;
nvbo->bo.bdev = &drm->ttm.bdev;

+ if (!nv_device_is_cpu_coherent(nouveau_dev(dev)))
+ nvbo->force_coherent = flags & TTM_PL_FLAG_UNCACHED;
+
nvbo->page_shift = 12;
if (drm->client.base.vm) {
if (!(flags & TTM_PL_FLAG_TT) && size > 256 * 1024)
@@ -289,8 +292,9 @@ void
nouveau_bo_placement_set(struct nouveau_bo *nvbo, uint32_t type, uint32_t busy)
{
struct ttm_placement *pl = &nvbo->placement;
- uint32_t flags = TTM_PL_MASK_CACHING |
- (nvbo->pin_refcnt ? TTM_PL_FLAG_NO_EVICT : 0);
+ uint32_t flags = (nvbo->force_coherent ? TTM_PL_FLAG_UNCACHED :
+ TTM_PL_MASK_CACHING) |
+ (nvbo->pin_refcnt ? TTM_PL_FLAG_NO_EVICT : 0);

pl->placement = nvbo->placements;
set_placement_list(nvbo->placements, &pl->num_placement,
@@ -390,7 +394,14 @@ nouveau_bo_map(struct nouveau_bo *nvbo)
if (ret)
return ret;

- ret = ttm_bo_kmap(&nvbo->bo, 0, nvbo->bo.mem.num_pages, &nvbo->kmap);
+ /*
+ * TTM buffers allocated using the DMA API already have a mapping, let's
+ * use it instead.
+ */
+ if (!nvbo->force_coherent)
+ ret = ttm_bo_kmap(&nvbo->bo, 0, nvbo->bo.mem.num_pages,
+ &nvbo->kmap);
+
ttm_bo_unreserve(&nvbo->bo);
return ret;
}
@@ -398,7 +409,14 @@ nouveau_bo_map(struct nouveau_bo *nvbo)
void
nouveau_bo_unmap(struct nouveau_bo *nvbo)
{
- if (nvbo)
+ if (!nvbo)
+ return;
+
+ /*
+ * TTM buffers allocated using the DMA API already had a coherent
+ * mapping which we used, no need to unmap.
+ */
+ if (!nvbo->force_coherent)
ttm_bo_kunmap(&nvbo->kmap);
}

@@ -482,12 +500,36 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
return 0;
}

+static inline void *
+_nouveau_bo_mem_index(struct nouveau_bo *nvbo, unsigned index, void *mem, u8 sz)
+{
+ struct ttm_dma_tt *dma_tt;
+ u8 *m = mem;
+
+ index *= sz;
+
+ if (m) {
+ /* kmap'd address, return the corresponding offset */
+ m += index;
+ } else {
+ /* DMA-API mapping, lookup the right address */
+ dma_tt = (struct ttm_dma_tt *)nvbo->bo.ttm;
+ m = dma_tt->cpu_address[index / PAGE_SIZE];
+ m += index % PAGE_SIZE;
+ }
+
+ return m;
+}
+#define nouveau_bo_mem_index(o, i, m) _nouveau_bo_mem_index(o, i, m, sizeof(*m))
+
u16
nouveau_bo_rd16(struct nouveau_bo *nvbo, unsigned index)
{
bool is_iomem;
u16 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
- mem = &mem[index];
+
+ mem = nouveau_bo_mem_index(nvbo, index, mem);
+
if (is_iomem)
return ioread16_native((void __force __iomem *)mem);
else
@@ -499,7 +541,9 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
{
bool is_iomem;
u16 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
- mem = &mem[index];
+
+ mem = nouveau_bo_mem_index(nvbo, index, mem);
+
if (is_iomem)
iowrite16_native(val, (void __force __iomem *)mem);
else
@@ -511,7 +555,9 @@ nouveau_bo_rd32(struct nouveau_bo *nvbo, unsigned index)
{
bool is_iomem;
u32 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
- mem = &mem[index];
+
+ mem = nouveau_bo_mem_index(nvbo, index, mem);
+
if (is_iomem)
return ioread32_native((void __force __iomem *)mem);
else
@@ -523,7 +569,9 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
{
bool is_iomem;
u32 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
- mem = &mem[index];
+
+ mem = nouveau_bo_mem_index(nvbo, index, mem);
+
if (is_iomem)
iowrite32_native(val, (void __force __iomem *)mem);
else
@@ -1426,6 +1474,14 @@ nouveau_ttm_tt_populate(struct ttm_tt *ttm)
device = nv_device(drm->device);
dev = drm->dev;

+ /*
+ * Objects matching this condition have been marked as force_coherent,
+ * so use the DMA API for them.
+ */
+ if (!nv_device_is_cpu_coherent(device) &&
+ ttm->caching_state == tt_uncached)
+ return ttm_dma_populate(ttm_dma, dev->dev);
+
#if __OS_HAS_AGP
if (drm->agp.stat == ENABLED) {
return ttm_agp_tt_populate(ttm);
@@ -1476,6 +1532,10 @@ nouveau_ttm_tt_unpopulate(struct ttm_tt *ttm)
device = nv_device(drm->device);
dev = drm->dev;

+ if (!nv_device_is_cpu_coherent(device) &&
+ ttm->caching_state == tt_uncached)
+ ttm_dma_unpopulate(ttm_dma, dev->dev);
+
#if __OS_HAS_AGP
if (drm->agp.stat == ENABLED) {
ttm_agp_tt_unpopulate(ttm);
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
index fa42298d2dca..9a111b92fb34 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.h
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.h
@@ -11,6 +11,7 @@ struct nouveau_bo {
u32 valid_domains;
u32 placements[3];
u32 busy_placements[3];
+ bool force_coherent;
struct ttm_bo_kmap_obj kmap;
struct list_head head;

--
2.0.0

2014-07-08 08:28:39

by Alexandre Courbot

[permalink] [raw]
Subject: [PATCH v4 3/6] drm/nouveau: introduce nv_device_is_cpu_coherent()

Add a function allowing us to know whether a device is CPU-coherent,
i.e. accesses performed by the CPU on GPU-mapped buffers will
be immediately visible on the GPU side and vice-versa.

For now, a device is considered to be coherent if it uses the PCI bus on
a non-ARM architecture.

Signed-off-by: Alexandre Courbot <[email protected]>
---
drivers/gpu/drm/nouveau/core/engine/device/base.c | 6 ++++++
drivers/gpu/drm/nouveau/core/include/core/device.h | 3 +++
2 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
index e4e9e64988fe..23c7061aac3c 100644
--- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
+++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
@@ -520,6 +520,12 @@ nv_device_get_irq(struct nouveau_device *device, bool stall)
}
}

+bool
+nv_device_is_cpu_coherent(struct nouveau_device *device)
+{
+ return (!IS_ENABLED(CONFIG_ARM) && nv_device_is_pci(device));
+}
+
static struct nouveau_oclass
nouveau_device_oclass = {
.handle = NV_ENGINE(DEVICE, 0x00),
diff --git a/drivers/gpu/drm/nouveau/core/include/core/device.h b/drivers/gpu/drm/nouveau/core/include/core/device.h
index a8a9a9cf16cb..1f9d5d87cf06 100644
--- a/drivers/gpu/drm/nouveau/core/include/core/device.h
+++ b/drivers/gpu/drm/nouveau/core/include/core/device.h
@@ -171,4 +171,7 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr);
int
nv_device_get_irq(struct nouveau_device *device, bool stall);

+bool
+nv_device_is_cpu_coherent(struct nouveau_device *device);
+
#endif
--
2.0.0

2014-07-08 08:26:41

by Alexandre Courbot

[permalink] [raw]
Subject: [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices

page_to_phys() is not the correct way to obtain the DMA address of a
buffer on a non-PCI system. Use the DMA API functions for this, which
are portable and will allow us to use other DMA API functions for
buffer synchronization.

Signed-off-by: Alexandre Courbot <[email protected]>
---
drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
index 18c8c7245b73..e4e9e64988fe 100644
--- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
+++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
@@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device, struct page *page)
if (pci_dma_mapping_error(device->pdev, ret))
ret = 0;
} else {
- ret = page_to_phys(page);
+ ret = dma_map_page(&device->platformdev->dev, page, 0,
+ PAGE_SIZE, DMA_BIDIRECTIONAL);
+ if (dma_mapping_error(&device->platformdev->dev, ret))
+ ret = 0;
}

return ret;
@@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr)
if (nv_device_is_pci(device))
pci_unmap_page(device->pdev, addr, PAGE_SIZE,
PCI_DMA_BIDIRECTIONAL);
+ else
+ dma_unmap_page(&device->platformdev->dev, addr,
+ PAGE_SIZE, DMA_BIDIRECTIONAL);
}

int
--
2.0.0

2014-07-08 08:31:56

by Alexandre Courbot

[permalink] [raw]
Subject: [PATCH v4 1/6] drm/ttm: expose CPU address of DMA-allocated pages

Pages allocated using the DMA API have a coherent memory mapping. Make
this mapping visible to drivers so they can decide to use it instead of
creating their own redundant one.

Signed-off-by: Alexandre Courbot <[email protected]>
---
drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 2 ++
drivers/gpu/drm/ttm/ttm_tt.c | 6 +++++-
include/drm/ttm/ttm_bo_driver.h | 2 ++
3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index fb8259f69839..0301fac5badd 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -847,6 +847,7 @@ static int ttm_dma_pool_get_pages(struct dma_pool *pool,
if (count) {
d_page = list_first_entry(&pool->free_list, struct dma_page, page_list);
ttm->pages[index] = d_page->p;
+ ttm_dma->cpu_address[index] = d_page->vaddr;
ttm_dma->dma_address[index] = d_page->dma;
list_move_tail(&d_page->page_list, &ttm_dma->pages_list);
r = 0;
@@ -978,6 +979,7 @@ void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
INIT_LIST_HEAD(&ttm_dma->pages_list);
for (i = 0; i < ttm->num_pages; i++) {
ttm->pages[i] = NULL;
+ ttm_dma->cpu_address[i] = 0;
ttm_dma->dma_address[i] = 0;
}

diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 75f319090043..341594ede596 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -58,6 +58,8 @@ static void ttm_dma_tt_alloc_page_directory(struct ttm_dma_tt *ttm)
ttm->ttm.pages = drm_calloc_large(ttm->ttm.num_pages, sizeof(void*));
ttm->dma_address = drm_calloc_large(ttm->ttm.num_pages,
sizeof(*ttm->dma_address));
+ ttm->cpu_address = drm_calloc_large(ttm->ttm.num_pages,
+ sizeof(*ttm->cpu_address));
}

#ifdef CONFIG_X86
@@ -228,7 +230,7 @@ int ttm_dma_tt_init(struct ttm_dma_tt *ttm_dma, struct ttm_bo_device *bdev,

INIT_LIST_HEAD(&ttm_dma->pages_list);
ttm_dma_tt_alloc_page_directory(ttm_dma);
- if (!ttm->pages || !ttm_dma->dma_address) {
+ if (!ttm->pages || !ttm_dma->dma_address || !ttm_dma->cpu_address) {
ttm_tt_destroy(ttm);
pr_err("Failed allocating page table\n");
return -ENOMEM;
@@ -243,6 +245,8 @@ void ttm_dma_tt_fini(struct ttm_dma_tt *ttm_dma)

drm_free_large(ttm->pages);
ttm->pages = NULL;
+ drm_free_large(ttm_dma->cpu_address);
+ ttm_dma->cpu_address = NULL;
drm_free_large(ttm_dma->dma_address);
ttm_dma->dma_address = NULL;
}
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
index a5183da3ef92..7d30f0666d24 100644
--- a/include/drm/ttm/ttm_bo_driver.h
+++ b/include/drm/ttm/ttm_bo_driver.h
@@ -133,6 +133,7 @@ struct ttm_tt {
* struct ttm_dma_tt
*
* @ttm: Base ttm_tt struct.
+ * @cpu_address: The CPU address of the pages
* @dma_address: The DMA (bus) addresses of the pages
* @pages_list: used by some page allocation backend
*
@@ -142,6 +143,7 @@ struct ttm_tt {
*/
struct ttm_dma_tt {
struct ttm_tt ttm;
+ void **cpu_address;
dma_addr_t *dma_address;
struct list_head pages_list;
};
--
2.0.0

2014-07-10 12:58:43

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices

On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
> page_to_phys() is not the correct way to obtain the DMA address of a
> buffer on a non-PCI system. Use the DMA API functions for this, which
> are portable and will allow us to use other DMA API functions for
> buffer synchronization.
>
> Signed-off-by: Alexandre Courbot <[email protected]>
> ---
> drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
> index 18c8c7245b73..e4e9e64988fe 100644
> --- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
> +++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
> @@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device, struct page *page)
> if (pci_dma_mapping_error(device->pdev, ret))
> ret = 0;
> } else {
> - ret = page_to_phys(page);
> + ret = dma_map_page(&device->platformdev->dev, page, 0,
> + PAGE_SIZE, DMA_BIDIRECTIONAL);
> + if (dma_mapping_error(&device->platformdev->dev, ret))
> + ret = 0;
> }
>
> return ret;
> @@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr)
> if (nv_device_is_pci(device))
> pci_unmap_page(device->pdev, addr, PAGE_SIZE,
> PCI_DMA_BIDIRECTIONAL);

pci_map/unmap alias to dma_unmap/map when called on the underlying struct
device embedded in pci_device (like for platform drivers). Dunno whether
it's worth to track a pointer to the struct device directly and always
call dma_unmap/map.

Just drive-by comment since I'm interested in how you solve this - i915
has similar fun with buffer sharing and coherent and non-coherent
platforms. Although we don't have fun with pci and non-pci based
platforms.
-Daniel

> + else
> + dma_unmap_page(&device->platformdev->dev, addr,
> + PAGE_SIZE, DMA_BIDIRECTIONAL);
> }
>
> int
> --
> 2.0.0
>
> _______________________________________________
> Nouveau mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/nouveau

--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2014-07-10 13:04:43

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Nouveau] [PATCH v4 4/6] drm/nouveau: synchronize BOs when required

On Tue, Jul 08, 2014 at 05:25:59PM +0900, Alexandre Courbot wrote:
> On architectures for which access to GPU memory is non-coherent,
> caches need to be flushed and invalidated explicitly when BO control
> changes between CPU and GPU.
>
> This patch adds buffer synchronization functions which invokes the
> correct API (PCI or DMA) to ensure synchronization is effective.
>
> Based on the TTM DMA cache helper patches by Lucas Stach.
>
> Signed-off-by: Lucas Stach <[email protected]>
> Signed-off-by: Alexandre Courbot <[email protected]>
> ---
> drivers/gpu/drm/nouveau/nouveau_bo.c | 56 +++++++++++++++++++++++++++++++++++
> drivers/gpu/drm/nouveau/nouveau_bo.h | 2 ++
> drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
> 3 files changed, 70 insertions(+)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 67e9e8e2e2ec..47e4e8886769 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
> ttm_bo_kunmap(&nvbo->kmap);
> }
>
> +void
> +nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
> +{
> + struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
> + struct nouveau_device *device = nouveau_dev(drm->dev);
> + struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
> + int i;
> +
> + if (!ttm_dma)
> + return;
> +
> + if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
> + return;

Is the is_cpu_coherent check really required? On coherent platforms the
sync_for_foo should be a noop. It's the dma api's job to encapsulate this
knowledge so that drivers can be blissfully ignorant. The explicit
is_coherent check makes this a bit leaky. And same comment that underlying
the bus-specifics dma-mapping functions are identical.
-Daniel

> +
> + if (nv_device_is_pci(device)) {
> + for (i = 0; i < ttm_dma->ttm.num_pages; i++)
> + pci_dma_sync_single_for_device(device->pdev,
> + ttm_dma->dma_address[i], PAGE_SIZE,
> + PCI_DMA_TODEVICE);
> + } else {
> + for (i = 0; i < ttm_dma->ttm.num_pages; i++)
> + dma_sync_single_for_device(nv_device_base(device),
> + ttm_dma->dma_address[i], PAGE_SIZE,
> + DMA_TO_DEVICE);
> + }
> +}
> +
> +void
> +nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
> +{
> + struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
> + struct nouveau_device *device = nouveau_dev(drm->dev);
> + struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
> + int i;
> +
> + if (!ttm_dma)
> + return;
> +
> + if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
> + return;
> +
> + if (nv_device_is_pci(device)) {
> + for (i = 0; i < ttm_dma->ttm.num_pages; i++)
> + pci_dma_sync_single_for_cpu(device->pdev,
> + ttm_dma->dma_address[i], PAGE_SIZE,
> + PCI_DMA_FROMDEVICE);
> + } else {
> + for (i = 0; i < ttm_dma->ttm.num_pages; i++)
> + dma_sync_single_for_cpu(nv_device_base(device),
> + ttm_dma->dma_address[i], PAGE_SIZE,
> + DMA_FROM_DEVICE);
> + }
> +}
> +
> int
> nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
> bool no_wait_gpu)
> @@ -418,6 +472,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
> if (!ttm)
> return ret;
>
> + nouveau_bo_sync_for_device(nvbo);
> +
> device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
> nv_wr32(device, 0x70004, 0x00000001);
> if (!nv_wait(device, 0x070004, 0x00000001, 0x00000000))
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
> index ff17c1f432fc..fa42298d2dca 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.h
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.h
> @@ -81,6 +81,8 @@ void nouveau_bo_wr32(struct nouveau_bo *, unsigned index, u32 val);
> void nouveau_bo_fence(struct nouveau_bo *, struct nouveau_fence *);
> int nouveau_bo_validate(struct nouveau_bo *, bool interruptible,
> bool no_wait_gpu);
> +void nouveau_bo_sync_for_device(struct nouveau_bo *nvbo);
> +void nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo);
>
> struct nouveau_vma *
> nouveau_bo_vma_find(struct nouveau_bo *, struct nouveau_vm *);
> diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
> index c90c0dc0afe8..08829a720891 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
> @@ -896,6 +896,7 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
> spin_lock(&nvbo->bo.bdev->fence_lock);
> ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
> spin_unlock(&nvbo->bo.bdev->fence_lock);
> + nouveau_bo_sync_for_cpu(nvbo);
> drm_gem_object_unreference_unlocked(gem);
> return ret;
> }
> @@ -904,6 +905,17 @@ int
> nouveau_gem_ioctl_cpu_fini(struct drm_device *dev, void *data,
> struct drm_file *file_priv)
> {
> + struct drm_nouveau_gem_cpu_fini *req = data;
> + struct drm_gem_object *gem;
> + struct nouveau_bo *nvbo;
> +
> + gem = drm_gem_object_lookup(dev, file_priv, req->handle);
> + if (!gem)
> + return -ENOENT;
> + nvbo = nouveau_gem_object(gem);
> +
> + nouveau_bo_sync_for_device(nvbo);
> + drm_gem_object_unreference_unlocked(gem);
> return 0;
> }
>
> --
> 2.0.0
>
> _______________________________________________
> Nouveau mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/nouveau

--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2014-07-11 02:35:30

by Alexandre Courbot

[permalink] [raw]
Subject: Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices

On 07/10/2014 09:58 PM, Daniel Vetter wrote:
> On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
>> page_to_phys() is not the correct way to obtain the DMA address of a
>> buffer on a non-PCI system. Use the DMA API functions for this, which
>> are portable and will allow us to use other DMA API functions for
>> buffer synchronization.
>>
>> Signed-off-by: Alexandre Courbot <[email protected]>
>> ---
>> drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>> index 18c8c7245b73..e4e9e64988fe 100644
>> --- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>> +++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>> @@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device, struct page *page)
>> if (pci_dma_mapping_error(device->pdev, ret))
>> ret = 0;
>> } else {
>> - ret = page_to_phys(page);
>> + ret = dma_map_page(&device->platformdev->dev, page, 0,
>> + PAGE_SIZE, DMA_BIDIRECTIONAL);
>> + if (dma_mapping_error(&device->platformdev->dev, ret))
>> + ret = 0;
>> }
>>
>> return ret;
>> @@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr)
>> if (nv_device_is_pci(device))
>> pci_unmap_page(device->pdev, addr, PAGE_SIZE,
>> PCI_DMA_BIDIRECTIONAL);
>
> pci_map/unmap alias to dma_unmap/map when called on the underlying struct
> device embedded in pci_device (like for platform drivers). Dunno whether
> it's worth to track a pointer to the struct device directly and always
> call dma_unmap/map.

Isn't it (theoretically) possible to have a platform that does not use
the DMA API for its PCI implementation and thus requires the pci_*
functions to be called? I could not find such a case in -next, which
suggests that all PCI platforms have been converted to the DMA API
already and that we could indeed refactor this to always use the DMA
functions.

But at the same time the way we use APIs should not be directed by their
implementation, but by their intent - and unless the PCI API has been
deprecated in some way (something I am not aware of), the rule is still
that you should use it on a PCI device.

>
> Just drive-by comment since I'm interested in how you solve this - i915
> has similar fun with buffer sharing and coherent and non-coherent
> platforms. Although we don't have fun with pci and non-pci based
> platforms.

Yeah, I am not familiar with i915 but it seems like we are on a similar
boat here (excepted ARM is more constrained as to its memory mappings).
The strategy in this series is, map buffers used by user-space cached
and explicitly synchronize them (since the ownership transition from
user to GPU is always clearly performed by syscalls), and use coherent
mappings for buffers used by the kernel which are accessed more
randomly. This has solved all our coherency issues and resulted in the
best performance so far.

2014-07-11 02:40:34

by Alexandre Courbot

[permalink] [raw]
Subject: Re: [Nouveau] [PATCH v4 4/6] drm/nouveau: synchronize BOs when required

On 07/10/2014 10:04 PM, Daniel Vetter wrote:
> On Tue, Jul 08, 2014 at 05:25:59PM +0900, Alexandre Courbot wrote:
>> On architectures for which access to GPU memory is non-coherent,
>> caches need to be flushed and invalidated explicitly when BO control
>> changes between CPU and GPU.
>>
>> This patch adds buffer synchronization functions which invokes the
>> correct API (PCI or DMA) to ensure synchronization is effective.
>>
>> Based on the TTM DMA cache helper patches by Lucas Stach.
>>
>> Signed-off-by: Lucas Stach <[email protected]>
>> Signed-off-by: Alexandre Courbot <[email protected]>
>> ---
>> drivers/gpu/drm/nouveau/nouveau_bo.c | 56 +++++++++++++++++++++++++++++++++++
>> drivers/gpu/drm/nouveau/nouveau_bo.h | 2 ++
>> drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
>> 3 files changed, 70 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> index 67e9e8e2e2ec..47e4e8886769 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> @@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
>> ttm_bo_kunmap(&nvbo->kmap);
>> }
>>
>> +void
>> +nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
>> +{
>> + struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
>> + struct nouveau_device *device = nouveau_dev(drm->dev);
>> + struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
>> + int i;
>> +
>> + if (!ttm_dma)
>> + return;
>> +
>> + if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
>> + return;
>
> Is the is_cpu_coherent check really required? On coherent platforms the
> sync_for_foo should be a noop. It's the dma api's job to encapsulate this
> knowledge so that drivers can be blissfully ignorant. The explicit
> is_coherent check makes this a bit leaky. And same comment that underlying
> the bus-specifics dma-mapping functions are identical.

I think you are right, the is_cpu_coherent check should not be needed
here. I still think we should have separate paths for the PCI/DMA cases
though, unless you can point me to a source that clearly states that the
PCI API is deprecated and that DMA should be used instead.

2014-07-11 02:50:15

by Ben Skeggs

[permalink] [raw]
Subject: Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices

On Fri, Jul 11, 2014 at 12:35 PM, Alexandre Courbot <[email protected]> wrote:
> On 07/10/2014 09:58 PM, Daniel Vetter wrote:
>>
>> On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
>>>
>>> page_to_phys() is not the correct way to obtain the DMA address of a
>>> buffer on a non-PCI system. Use the DMA API functions for this, which
>>> are portable and will allow us to use other DMA API functions for
>>> buffer synchronization.
>>>
>>> Signed-off-by: Alexandre Courbot <[email protected]>
>>> ---
>>> drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
>>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>> b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>> index 18c8c7245b73..e4e9e64988fe 100644
>>> --- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>> +++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>> @@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device,
>>> struct page *page)
>>> if (pci_dma_mapping_error(device->pdev, ret))
>>> ret = 0;
>>> } else {
>>> - ret = page_to_phys(page);
>>> + ret = dma_map_page(&device->platformdev->dev, page, 0,
>>> + PAGE_SIZE, DMA_BIDIRECTIONAL);
>>> + if (dma_mapping_error(&device->platformdev->dev, ret))
>>> + ret = 0;
>>> }
>>>
>>> return ret;
>>> @@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device,
>>> dma_addr_t addr)
>>> if (nv_device_is_pci(device))
>>> pci_unmap_page(device->pdev, addr, PAGE_SIZE,
>>> PCI_DMA_BIDIRECTIONAL);
>>
>>
>> pci_map/unmap alias to dma_unmap/map when called on the underlying struct
>> device embedded in pci_device (like for platform drivers). Dunno whether
>> it's worth to track a pointer to the struct device directly and always
>> call dma_unmap/map.
>
>
> Isn't it (theoretically) possible to have a platform that does not use the
> DMA API for its PCI implementation and thus requires the pci_* functions to
> be called? I could not find such a case in -next, which suggests that all
> PCI platforms have been converted to the DMA API already and that we could
> indeed refactor this to always use the DMA functions.
>
> But at the same time the way we use APIs should not be directed by their
> implementation, but by their intent - and unless the PCI API has been
> deprecated in some way (something I am not aware of), the rule is still that
> you should use it on a PCI device.
>
>
>>
>> Just drive-by comment since I'm interested in how you solve this - i915
>> has similar fun with buffer sharing and coherent and non-coherent
>> platforms. Although we don't have fun with pci and non-pci based
>> platforms.
>
>
> Yeah, I am not familiar with i915 but it seems like we are on a similar boat
> here (excepted ARM is more constrained as to its memory mappings). The
> strategy in this series is, map buffers used by user-space cached and
> explicitly synchronize them (since the ownership transition from user to GPU
> is always clearly performed by syscalls), and use coherent mappings for
> buffers used by the kernel which are accessed more randomly. This has solved
> all our coherency issues and resulted in the best performance so far.
I wonder if we might want to use unsnooped cached mappings of pages on
non-ARM platforms also, to avoid the overhead of the cache snooping?

>
>
> _______________________________________________
> dri-devel mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

2014-07-11 02:57:42

by Alexandre Courbot

[permalink] [raw]
Subject: Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices

On 07/11/2014 11:50 AM, Ben Skeggs wrote:
> On Fri, Jul 11, 2014 at 12:35 PM, Alexandre Courbot <[email protected]> wrote:
>> On 07/10/2014 09:58 PM, Daniel Vetter wrote:
>>>
>>> On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
>>>>
>>>> page_to_phys() is not the correct way to obtain the DMA address of a
>>>> buffer on a non-PCI system. Use the DMA API functions for this, which
>>>> are portable and will allow us to use other DMA API functions for
>>>> buffer synchronization.
>>>>
>>>> Signed-off-by: Alexandre Courbot <[email protected]>
>>>> ---
>>>> drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
>>>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>>> b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>>> index 18c8c7245b73..e4e9e64988fe 100644
>>>> --- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>>> +++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>>> @@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device,
>>>> struct page *page)
>>>> if (pci_dma_mapping_error(device->pdev, ret))
>>>> ret = 0;
>>>> } else {
>>>> - ret = page_to_phys(page);
>>>> + ret = dma_map_page(&device->platformdev->dev, page, 0,
>>>> + PAGE_SIZE, DMA_BIDIRECTIONAL);
>>>> + if (dma_mapping_error(&device->platformdev->dev, ret))
>>>> + ret = 0;
>>>> }
>>>>
>>>> return ret;
>>>> @@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device,
>>>> dma_addr_t addr)
>>>> if (nv_device_is_pci(device))
>>>> pci_unmap_page(device->pdev, addr, PAGE_SIZE,
>>>> PCI_DMA_BIDIRECTIONAL);
>>>
>>>
>>> pci_map/unmap alias to dma_unmap/map when called on the underlying struct
>>> device embedded in pci_device (like for platform drivers). Dunno whether
>>> it's worth to track a pointer to the struct device directly and always
>>> call dma_unmap/map.
>>
>>
>> Isn't it (theoretically) possible to have a platform that does not use the
>> DMA API for its PCI implementation and thus requires the pci_* functions to
>> be called? I could not find such a case in -next, which suggests that all
>> PCI platforms have been converted to the DMA API already and that we could
>> indeed refactor this to always use the DMA functions.
>>
>> But at the same time the way we use APIs should not be directed by their
>> implementation, but by their intent - and unless the PCI API has been
>> deprecated in some way (something I am not aware of), the rule is still that
>> you should use it on a PCI device.
>>
>>
>>>
>>> Just drive-by comment since I'm interested in how you solve this - i915
>>> has similar fun with buffer sharing and coherent and non-coherent
>>> platforms. Although we don't have fun with pci and non-pci based
>>> platforms.
>>
>>
>> Yeah, I am not familiar with i915 but it seems like we are on a similar boat
>> here (excepted ARM is more constrained as to its memory mappings). The
>> strategy in this series is, map buffers used by user-space cached and
>> explicitly synchronize them (since the ownership transition from user to GPU
>> is always clearly performed by syscalls), and use coherent mappings for
>> buffers used by the kernel which are accessed more randomly. This has solved
>> all our coherency issues and resulted in the best performance so far.
> I wonder if we might want to use unsnooped cached mappings of pages on
> non-ARM platforms also, to avoid the overhead of the cache snooping?

You might want to indeed, now that coherency is guaranteed by the sync
functions originally introduced by Lucas. The only issue I could see is
that they always invalidate the full buffer whereas bus snooping only
affects pages that are actually touched. Someone would need to try this
on a desktop machine and see how it affects performance.

I'd be all for it though, since it would also allow us to get rid of
this ungraceful nv_device_is_cpu_coherent() function and result in
simplifying nouveau_bo.c a bit.

2014-07-11 07:38:07

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices

On Fri, Jul 11, 2014 at 11:35:23AM +0900, Alexandre Courbot wrote:
> On 07/10/2014 09:58 PM, Daniel Vetter wrote:
> >On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
> >>page_to_phys() is not the correct way to obtain the DMA address of a
> >>buffer on a non-PCI system. Use the DMA API functions for this, which
> >>are portable and will allow us to use other DMA API functions for
> >>buffer synchronization.
> >>
> >>Signed-off-by: Alexandre Courbot <[email protected]>
> >>---
> >> drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
> >> 1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
> >>index 18c8c7245b73..e4e9e64988fe 100644
> >>--- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
> >>+++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
> >>@@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device, struct page *page)
> >> if (pci_dma_mapping_error(device->pdev, ret))
> >> ret = 0;
> >> } else {
> >>- ret = page_to_phys(page);
> >>+ ret = dma_map_page(&device->platformdev->dev, page, 0,
> >>+ PAGE_SIZE, DMA_BIDIRECTIONAL);
> >>+ if (dma_mapping_error(&device->platformdev->dev, ret))
> >>+ ret = 0;
> >> }
> >>
> >> return ret;
> >>@@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr)
> >> if (nv_device_is_pci(device))
> >> pci_unmap_page(device->pdev, addr, PAGE_SIZE,
> >> PCI_DMA_BIDIRECTIONAL);
> >
> >pci_map/unmap alias to dma_unmap/map when called on the underlying struct
> >device embedded in pci_device (like for platform drivers). Dunno whether
> >it's worth to track a pointer to the struct device directly and always
> >call dma_unmap/map.
>
> Isn't it (theoretically) possible to have a platform that does not use the
> DMA API for its PCI implementation and thus requires the pci_* functions to
> be called? I could not find such a case in -next, which suggests that all
> PCI platforms have been converted to the DMA API already and that we could
> indeed refactor this to always use the DMA functions.
>
> But at the same time the way we use APIs should not be directed by their
> implementation, but by their intent - and unless the PCI API has been
> deprecated in some way (something I am not aware of), the rule is still that
> you should use it on a PCI device.

Hm, somehow I've thought that it's recommended to just use the dma api
directly. It's what we're doing in i915 at least, but now I'm not so sure
any more. My guess is that this is just history really when the dma api
was pci-only.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2014-07-11 07:41:34

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Nouveau] [PATCH v4 4/6] drm/nouveau: synchronize BOs when required

On Fri, Jul 11, 2014 at 11:40:27AM +0900, Alexandre Courbot wrote:
> On 07/10/2014 10:04 PM, Daniel Vetter wrote:
> >On Tue, Jul 08, 2014 at 05:25:59PM +0900, Alexandre Courbot wrote:
> >>On architectures for which access to GPU memory is non-coherent,
> >>caches need to be flushed and invalidated explicitly when BO control
> >>changes between CPU and GPU.
> >>
> >>This patch adds buffer synchronization functions which invokes the
> >>correct API (PCI or DMA) to ensure synchronization is effective.
> >>
> >>Based on the TTM DMA cache helper patches by Lucas Stach.
> >>
> >>Signed-off-by: Lucas Stach <[email protected]>
> >>Signed-off-by: Alexandre Courbot <[email protected]>
> >>---
> >> drivers/gpu/drm/nouveau/nouveau_bo.c | 56 +++++++++++++++++++++++++++++++++++
> >> drivers/gpu/drm/nouveau/nouveau_bo.h | 2 ++
> >> drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
> >> 3 files changed, 70 insertions(+)
> >>
> >>diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> >>index 67e9e8e2e2ec..47e4e8886769 100644
> >>--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> >>+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> >>@@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
> >> ttm_bo_kunmap(&nvbo->kmap);
> >> }
> >>
> >>+void
> >>+nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
> >>+{
> >>+ struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
> >>+ struct nouveau_device *device = nouveau_dev(drm->dev);
> >>+ struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
> >>+ int i;
> >>+
> >>+ if (!ttm_dma)
> >>+ return;
> >>+
> >>+ if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
> >>+ return;
> >
> >Is the is_cpu_coherent check really required? On coherent platforms the
> >sync_for_foo should be a noop. It's the dma api's job to encapsulate this
> >knowledge so that drivers can be blissfully ignorant. The explicit
> >is_coherent check makes this a bit leaky. And same comment that underlying
> >the bus-specifics dma-mapping functions are identical.
>
> I think you are right, the is_cpu_coherent check should not be needed here.
> I still think we should have separate paths for the PCI/DMA cases though,
> unless you can point me to a source that clearly states that the PCI API is
> deprecated and that DMA should be used instead.

Ah, on 2nd look I've found it again. Quoting
Documentation/DMA-API-HOWTO.txt:

"Note that the DMA API works with any bus independent of the underlying
microprocessor architecture. You should use the DMA API rather than the
bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the
pci_map_*() interfaces."

The advice is fairly strong here I think ;-) And imo the idea makes sense,
since it allows drivers like nouveau here to care much less about the
actual bus used to get data to/from the ip block. And if you look at intel
gfx it makes even more sense since the pci layer we have is really just a
thin fake shim whacked on top of the hw (on SoCs at least).

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2014-07-11 09:35:25

by Alexandre Courbot

[permalink] [raw]
Subject: Re: [Nouveau] [PATCH v4 4/6] drm/nouveau: synchronize BOs when required

On 07/11/2014 04:41 PM, Daniel Vetter wrote:
> On Fri, Jul 11, 2014 at 11:40:27AM +0900, Alexandre Courbot wrote:
>> On 07/10/2014 10:04 PM, Daniel Vetter wrote:
>>> On Tue, Jul 08, 2014 at 05:25:59PM +0900, Alexandre Courbot wrote:
>>>> On architectures for which access to GPU memory is non-coherent,
>>>> caches need to be flushed and invalidated explicitly when BO control
>>>> changes between CPU and GPU.
>>>>
>>>> This patch adds buffer synchronization functions which invokes the
>>>> correct API (PCI or DMA) to ensure synchronization is effective.
>>>>
>>>> Based on the TTM DMA cache helper patches by Lucas Stach.
>>>>
>>>> Signed-off-by: Lucas Stach <[email protected]>
>>>> Signed-off-by: Alexandre Courbot <[email protected]>
>>>> ---
>>>> drivers/gpu/drm/nouveau/nouveau_bo.c | 56 +++++++++++++++++++++++++++++++++++
>>>> drivers/gpu/drm/nouveau/nouveau_bo.h | 2 ++
>>>> drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
>>>> 3 files changed, 70 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>> index 67e9e8e2e2ec..47e4e8886769 100644
>>>> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>> @@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
>>>> ttm_bo_kunmap(&nvbo->kmap);
>>>> }
>>>>
>>>> +void
>>>> +nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
>>>> +{
>>>> + struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
>>>> + struct nouveau_device *device = nouveau_dev(drm->dev);
>>>> + struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
>>>> + int i;
>>>> +
>>>> + if (!ttm_dma)
>>>> + return;
>>>> +
>>>> + if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
>>>> + return;
>>>
>>> Is the is_cpu_coherent check really required? On coherent platforms the
>>> sync_for_foo should be a noop. It's the dma api's job to encapsulate this
>>> knowledge so that drivers can be blissfully ignorant. The explicit
>>> is_coherent check makes this a bit leaky. And same comment that underlying
>>> the bus-specifics dma-mapping functions are identical.
>>
>> I think you are right, the is_cpu_coherent check should not be needed here.
>> I still think we should have separate paths for the PCI/DMA cases though,
>> unless you can point me to a source that clearly states that the PCI API is
>> deprecated and that DMA should be used instead.
>
> Ah, on 2nd look I've found it again. Quoting
> Documentation/DMA-API-HOWTO.txt:
>
> "Note that the DMA API works with any bus independent of the underlying
> microprocessor architecture. You should use the DMA API rather than the
> bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the
> pci_map_*() interfaces."
>
> The advice is fairly strong here I think ;-) And imo the idea makes sense,
> since it allows drivers like nouveau here to care much less about the
> actual bus used to get data to/from the ip block. And if you look at intel
> gfx it makes even more sense since the pci layer we have is really just a
> thin fake shim whacked on top of the hw (on SoCs at least).

Indeed, I stand corrected. :) That's good news actually, as it will
simplify the code. Thanks for pointing that out!

I will send a new revision that makes use of the DMA API exclusively and
will remove the nv_device_map/unmap() functions which are pretty useless
now.

2014-07-11 09:55:07

by Lucas Stach

[permalink] [raw]
Subject: Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices

Am Freitag, den 11.07.2014, 11:57 +0900 schrieb Alexandre Courbot:
[...]
> >> Yeah, I am not familiar with i915 but it seems like we are on a similar boat
> >> here (excepted ARM is more constrained as to its memory mappings). The
> >> strategy in this series is, map buffers used by user-space cached and
> >> explicitly synchronize them (since the ownership transition from user to GPU
> >> is always clearly performed by syscalls), and use coherent mappings for
> >> buffers used by the kernel which are accessed more randomly. This has solved
> >> all our coherency issues and resulted in the best performance so far.
> > I wonder if we might want to use unsnooped cached mappings of pages on
> > non-ARM platforms also, to avoid the overhead of the cache snooping?
>
> You might want to indeed, now that coherency is guaranteed by the sync
> functions originally introduced by Lucas. The only issue I could see is
> that they always invalidate the full buffer whereas bus snooping only
> affects pages that are actually touched. Someone would need to try this
> on a desktop machine and see how it affects performance.
>
> I'd be all for it though, since it would also allow us to get rid of
> this ungraceful nv_device_is_cpu_coherent() function and result in
> simplifying nouveau_bo.c a bit.

This will need some testing to get hard numbers, but I suspect that
invalidating the whole buffer isn't to bad as the prefetch machinery
works very well with the access patterns we see in graphics drivers.

Flushing out the whole buffer should be even less problematic, as it
will only flush out dirty lines that would need to be flushed on GPU
read snooping anyways.

In the long run we might want a separate cpu prepare/finish ioctl where
we can indicate the area of interest. This might help to avoid some of
the invalidate overhead especially for userspace suballocated buffers.

Regards,
Lucas

--
Pengutronix e.K. | Lucas Stach |
Industrial Linux Solutions | http://www.pengutronix.de/ |