LinuxLists.cc - [PATCH v5 0/4] drm: nouveau: memory coherency on ARM

2014-10-27 09:49:35

Subject: [PATCH v5 0/4] drm: nouveau: memory coherency on ARM

It has been a couple of months since v4 - apologies for this. v4 has not
received many comments, but this version addresses them and makes a new
attempt at pushing the critical bit for GK20A and Nouveau on ARM in
general.

As a reminder, this series addresses the memory coherency issue that we
are seeing on ARM platforms. Contrary to x86 which invalidates the PCI
caches whenever a write is made by the CPU to a GPU-accessed area (and
vice-versa), such accesses on ARM might result in the other accessor to
end up in an incoherent state.

To address this, patches 1-3 add the ability to understand whether we
are on a non-coherent architecture, implement a way to explicitly allocate
coherent buffers buffers using the DMA API, and uses it for GPFIFOS and
fences. Patch 4 also uses the DMA API to synchronize user-space allocated
buffers when they are passed from the CPU to the GPU and vice-versa.

Thanks to the feedback received on the previous revisions I believe this
code looks rather good now. I also have extensively tested it and could
not see any buffer corruption issue anymore. There is still one point
which is not completely satisfying in my opinion:

TTMs for TTM-backed objects are allocated in nouveau_sgdma_create_ttm()
and populated in nouveau_ttm_tt_populate(). Coherently-allocated buffers
need to use the ttm_dma API instead of the pool-based TTM API, and whether
an object is coherent or not is stored in its instance of nouveau_bo.

The problem is that neither nouveau_sgdma_create_ttm() nor
nouveau_ttm_tt_populate() have a way to access the nouveau_bo they are
working for. This is in particular a problem for nouveau_ttm_tt_populate()
since we need to rely on a purely TTM-based heuristic to decide how to
allocate the memory. The heuristic we are using works, but it makes the
code harder to understand than if we could just access the nouveau_bo.
nouveau_sgdma_create_ttm() always allocates a ttm_dma_tt structure,
which is wrong but happens to suit us for now. Still, this part of the
code could be rewritten much more cleanly if only we could access the
nouveau_bo instance in these functions.

I proposed some time ago to address this by making the ttm_tt_create
hook take a pointer to a ttm_bo_object instead of a ttm_bo_device.
This would still allow us to access the ttm_bo_device, while letting
us retrieve the nouveau_bo and store it into whatever structure we
embed our TTM into. For some reason David was not fond of the idea - I
am taking another chance at submitting it since the issue is still
not resolved and leads in inferior-looking code in at least Nouveau.

Phew, sorry for the long cover letter - thanks if you have read until
here! :)

Changes since v4:
- Only use DMA API for sync, as suggested by Daniel

Alexandre Courbot (4):
drm: introduce nv_device_is_cpu_coherent()
drm: implement explicitly coherent BOs
drm: allocate GPFIFOs and fences coherently
drm: synchronize BOs when required

drm/nouveau_bo.c | 122 ++++++++++++++++++++++++++++++++++++++++++---
drm/nouveau_bo.h | 3 ++
drm/nouveau_chan.c | 2 +-
drm/nouveau_gem.c | 12 +++++
drm/nv84_fence.c | 4 +-
lib/core/os.h | 2 +
nvkm/include/core/device.h | 6 +++
7 files changed, 140 insertions(+), 11 deletions(-)

--
2.1.2

2014-10-27 09:49:40

by Alexandre Courbot

[permalink] [raw]

Subject: [PATCH v5 1/4] drm: introduce nv_device_is_cpu_coherent()

Add a function allowing us to know whether a device is CPU-coherent,
i.e. accesses performed by the CPU on GPU-mapped buffers will
be immediately visible on the GPU side and vice-versa.

For now, a device is considered to be coherent if it uses the PCI bus on
a non-ARM architecture.

Signed-off-by: Alexandre Courbot <[email protected]>
---
lib/core/os.h | 2 ++
nvkm/include/core/device.h | 6 ++++++
2 files changed, 8 insertions(+)

diff --git a/lib/core/os.h b/lib/core/os.h
index fba9542292ac..79462eb2cfd4 100644
--- a/lib/core/os.h
+++ b/lib/core/os.h
@@ -101,6 +101,8 @@ typedef dma_addr_t resource_size_t;
#define __printf(a,b)
#define __user

+#define IS_ENABLED(x) (0)
+
static inline int
order_base_2(u64 base)
{
diff --git a/nvkm/include/core/device.h b/nvkm/include/core/device.h
index 1d9d893929bb..0d839e1ddaf4 100644
--- a/nvkm/include/core/device.h
+++ b/nvkm/include/core/device.h
@@ -158,6 +158,12 @@ nv_device_is_pci(struct nouveau_device *device)
return device->pdev != NULL;
}

+static inline bool
+nv_device_is_cpu_coherent(struct nouveau_device *device)
+{
+ return (!IS_ENABLED(CONFIG_ARM) && nv_device_is_pci(device));
+}
+
static inline struct device *
nv_device_base(struct nouveau_device *device)
{
--
2.1.2

2014-10-27 09:49:43

by Alexandre Courbot

[permalink] [raw]

Subject: [PATCH v5 3/4] drm: allocate GPFIFOs and fences coherently

Specify TTM_PL_FLAG_UNCACHED when allocating GPFIFOs and fences to
allow them to be safely accessed by the kernel without being synced
on non-coherent architectures.

Signed-off-by: Alexandre Courbot <[email protected]>
---
drm/nouveau_chan.c | 2 +-
drm/nv84_fence.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drm/nouveau_chan.c b/drm/nouveau_chan.c
index 77c81d6b45ee..0f3da86840f2 100644
--- a/drm/nouveau_chan.c
+++ b/drm/nouveau_chan.c
@@ -102,7 +102,7 @@ nouveau_channel_prep(struct nouveau_drm *drm, struct nvif_device *device,
chan->drm = drm;

/* allocate memory for dma push buffer */
- target = TTM_PL_FLAG_TT;
+ target = TTM_PL_FLAG_TT | TTM_PL_FLAG_UNCACHED;
if (nouveau_vram_pushbuf)
target = TTM_PL_FLAG_VRAM;

diff --git a/drm/nv84_fence.c b/drm/nv84_fence.c
index d6c6c87c3f07..4d79be7558d8 100644
--- a/drm/nv84_fence.c
+++ b/drm/nv84_fence.c
@@ -246,8 +246,8 @@ nv84_fence_create(struct nouveau_drm *drm)

if (ret == 0)
ret = nouveau_bo_new(drm->dev, 16 * priv->base.contexts, 0,
- TTM_PL_FLAG_TT, 0, 0, NULL, NULL,
- &priv->bo_gart);
+ TTM_PL_FLAG_TT | TTM_PL_FLAG_UNCACHED, 0,
+ 0, NULL, NULL, &priv->bo_gart);
if (ret == 0) {
ret = nouveau_bo_pin(priv->bo_gart, TTM_PL_FLAG_TT);
if (ret == 0) {
--
2.1.2

2014-10-27 09:49:48

by Alexandre Courbot

[permalink] [raw]

Subject: [PATCH v5 4/4] drm: synchronize BOs when required

On architectures for which access to GPU memory is non-coherent,
caches need to be flushed and invalidated explicitly when BO control
changes between CPU and GPU.

This patch adds buffer synchronization functions which invokes the
correct API (PCI or DMA) to ensure synchronization is effective.

Based on the TTM DMA cache helper patches by Lucas Stach.

Signed-off-by: Lucas Stach <[email protected]>
Signed-off-by: Alexandre Courbot <[email protected]>
---
drm/nouveau_bo.c | 42 ++++++++++++++++++++++++++++++++++++++++++
drm/nouveau_bo.h | 2 ++
drm/nouveau_gem.c | 12 ++++++++++++
3 files changed, 56 insertions(+)

diff --git a/drm/nouveau_bo.c b/drm/nouveau_bo.c
index ed9a6946f6d6..d2a4768e3efd 100644
--- a/drm/nouveau_bo.c
+++ b/drm/nouveau_bo.c
@@ -426,6 +426,46 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
ttm_bo_kunmap(&nvbo->kmap);
}

+void
+nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
+{
+ struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
+ struct nouveau_device *device = nvkm_device(&drm->device);
+ struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
+ int i;
+
+ if (!ttm_dma)
+ return;
+
+ /* Don't waste time looping if the object is coherent */
+ if (nvbo->force_coherent)
+ return;
+
+ for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+ dma_sync_single_for_device(nv_device_base(device),
+ ttm_dma->dma_address[i], PAGE_SIZE, DMA_TO_DEVICE);
+}
+
+void
+nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
+{
+ struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
+ struct nouveau_device *device = nvkm_device(&drm->device);
+ struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
+ int i;
+
+ if (!ttm_dma)
+ return;
+
+ /* Don't waste time looping if the object is coherent */
+ if (nvbo->force_coherent)
+ return;
+
+ for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+ dma_sync_single_for_cpu(nv_device_base(device),
+ ttm_dma->dma_address[i], PAGE_SIZE, DMA_FROM_DEVICE);
+}
+
int
nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
bool no_wait_gpu)
@@ -437,6 +477,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
if (ret)
return ret;

+ nouveau_bo_sync_for_device(nvbo);
+
return 0;
}

diff --git a/drm/nouveau_bo.h b/drm/nouveau_bo.h
index 0f8bbd48a0b9..c827f233e41d 100644
--- a/drm/nouveau_bo.h
+++ b/drm/nouveau_bo.h
@@ -85,6 +85,8 @@ void nouveau_bo_wr32(struct nouveau_bo *, unsigned index, u32 val);
void nouveau_bo_fence(struct nouveau_bo *, struct nouveau_fence *, bool exclusive);
int nouveau_bo_validate(struct nouveau_bo *, bool interruptible,
bool no_wait_gpu);
+void nouveau_bo_sync_for_device(struct nouveau_bo *nvbo);
+void nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo);

struct nouveau_vma *
nouveau_bo_vma_find(struct nouveau_bo *, struct nouveau_vm *);
diff --git a/drm/nouveau_gem.c b/drm/nouveau_gem.c
index 36951ee4b157..42c34babc2e5 100644
--- a/drm/nouveau_gem.c
+++ b/drm/nouveau_gem.c
@@ -867,6 +867,7 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
else
ret = lret;
}
+ nouveau_bo_sync_for_cpu(nvbo);
drm_gem_object_unreference_unlocked(gem);

return ret;
@@ -876,6 +877,17 @@ int
nouveau_gem_ioctl_cpu_fini(struct drm_device *dev, void *data,
struct drm_file *file_priv)
{
+ struct drm_nouveau_gem_cpu_fini *req = data;
+ struct drm_gem_object *gem;
+ struct nouveau_bo *nvbo;
+
+ gem = drm_gem_object_lookup(dev, file_priv, req->handle);
+ if (!gem)
+ return -ENOENT;
+ nvbo = nouveau_gem_object(gem);
+
+ nouveau_bo_sync_for_device(nvbo);
+ drm_gem_object_unreference_unlocked(gem);
return 0;
}

--
2.1.2

2014-10-27 09:50:39

by Alexandre Courbot

[permalink] [raw]

Subject: [PATCH v5 2/4] drm: implement explicitly coherent BOs

Allow nouveau_bo_new() to recognize the TTM_PL_FLAG_UNCACHED flag, which
means that we want the allocated BO to be perfectly coherent between the
CPU and GPU. This is useful on non-coherent architectures for which we
do not want to manually sync some rarely-accessed buffers: typically,
fences and pushbuffers.

A TTM BO allocated with the TTM_PL_FLAG_UNCACHED on a non-coherent
architecture will be populated using the DMA API, and accesses to it
performed using the coherent mapping performed by dma_alloc_coherent().

Signed-off-by: Alexandre Courbot <[email protected]>
---
drm/nouveau_bo.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++------
drm/nouveau_bo.h | 1 +
2 files changed, 73 insertions(+), 8 deletions(-)

diff --git a/drm/nouveau_bo.c b/drm/nouveau_bo.c
index 9a8adeec80cd..ed9a6946f6d6 100644
--- a/drm/nouveau_bo.c
+++ b/drm/nouveau_bo.c
@@ -214,6 +214,9 @@ nouveau_bo_new(struct drm_device *dev, int size, int align,
nvbo->tile_flags = tile_flags;
nvbo->bo.bdev = &drm->ttm.bdev;

+ if (!nv_device_is_cpu_coherent(nvkm_device(&drm->device)))
+ nvbo->force_coherent = flags & TTM_PL_FLAG_UNCACHED;
+
nvbo->page_shift = 12;
if (drm->client.vm) {
if (!(flags & TTM_PL_FLAG_TT) && size > 256 * 1024)
@@ -291,8 +294,9 @@ void
nouveau_bo_placement_set(struct nouveau_bo *nvbo, uint32_t type, uint32_t busy)
{
struct ttm_placement *pl = &nvbo->placement;
- uint32_t flags = TTM_PL_MASK_CACHING |
- (nvbo->pin_refcnt ? TTM_PL_FLAG_NO_EVICT : 0);
+ uint32_t flags = (nvbo->force_coherent ? TTM_PL_FLAG_UNCACHED :
+ TTM_PL_MASK_CACHING) |
+ (nvbo->pin_refcnt ? TTM_PL_FLAG_NO_EVICT : 0);

pl->placement = nvbo->placements;
set_placement_list(nvbo->placements, &pl->num_placement,
@@ -396,7 +400,14 @@ nouveau_bo_map(struct nouveau_bo *nvbo)
if (ret)
return ret;

- ret = ttm_bo_kmap(&nvbo->bo, 0, nvbo->bo.mem.num_pages, &nvbo->kmap);
+ /*
+ * TTM buffers allocated using the DMA API already have a mapping, let's
+ * use it instead.
+ */
+ if (!nvbo->force_coherent)
+ ret = ttm_bo_kmap(&nvbo->bo, 0, nvbo->bo.mem.num_pages,
+ &nvbo->kmap);
+
ttm_bo_unreserve(&nvbo->bo);
return ret;
}
@@ -404,7 +415,14 @@ nouveau_bo_map(struct nouveau_bo *nvbo)
void
nouveau_bo_unmap(struct nouveau_bo *nvbo)
{
- if (nvbo)
+ if (!nvbo)
+ return;
+
+ /*
+ * TTM buffers allocated using the DMA API already had a coherent
+ * mapping which we used, no need to unmap.
+ */
+ if (!nvbo->force_coherent)
ttm_bo_kunmap(&nvbo->kmap);
}

@@ -422,12 +440,36 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
return 0;
}

+static inline void *
+_nouveau_bo_mem_index(struct nouveau_bo *nvbo, unsigned index, void *mem, u8 sz)
+{
+ struct ttm_dma_tt *dma_tt;
+ u8 *m = mem;
+
+ index *= sz;
+
+ if (m) {
+ /* kmap'd address, return the corresponding offset */
+ m += index;
+ } else {
+ /* DMA-API mapping, lookup the right address */
+ dma_tt = (struct ttm_dma_tt *)nvbo->bo.ttm;
+ m = dma_tt->cpu_address[index / PAGE_SIZE];
+ m += index % PAGE_SIZE;
+ }
+
+ return m;
+}
+#define nouveau_bo_mem_index(o, i, m) _nouveau_bo_mem_index(o, i, m, sizeof(*m))
+
u16
nouveau_bo_rd16(struct nouveau_bo *nvbo, unsigned index)
{
bool is_iomem;
u16 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
- mem = &mem[index];
+
+ mem = nouveau_bo_mem_index(nvbo, index, mem);
+
if (is_iomem)
return ioread16_native((void __force __iomem *)mem);
else
@@ -439,7 +481,9 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
{
bool is_iomem;
u16 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
- mem = &mem[index];
+
+ mem = nouveau_bo_mem_index(nvbo, index, mem);
+
if (is_iomem)
iowrite16_native(val, (void __force __iomem *)mem);
else
@@ -451,7 +495,9 @@ nouveau_bo_rd32(struct nouveau_bo *nvbo, unsigned index)
{
bool is_iomem;
u32 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
- mem = &mem[index];
+
+ mem = nouveau_bo_mem_index(nvbo, index, mem);
+
if (is_iomem)
return ioread32_native((void __force __iomem *)mem);
else
@@ -463,7 +509,9 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
{
bool is_iomem;
u32 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
- mem = &mem[index];
+
+ mem = nouveau_bo_mem_index(nvbo, index, mem);
+
if (is_iomem)
iowrite32_native(val, (void __force __iomem *)mem);
else
@@ -1383,6 +1431,14 @@ nouveau_ttm_tt_populate(struct ttm_tt *ttm)
dev = drm->dev;
pdev = nv_device_base(device);

+ /*
+ * Objects matching this condition have been marked as force_coherent,
+ * so use the DMA API for them.
+ */
+ if (!nv_device_is_cpu_coherent(device) &&
+ ttm->caching_state == tt_uncached)
+ return ttm_dma_populate(ttm_dma, dev->dev);
+
#if __OS_HAS_AGP
if (drm->agp.stat == ENABLED) {
return ttm_agp_tt_populate(ttm);
@@ -1440,6 +1496,14 @@ nouveau_ttm_tt_unpopulate(struct ttm_tt *ttm)
dev = drm->dev;
pdev = nv_device_base(device);

+ /*
+ * Objects matching this condition have been marked as force_coherent,
+ * so use the DMA API for them.
+ */
+ if (!nv_device_is_cpu_coherent(device) &&
+ ttm->caching_state == tt_uncached)
+ ttm_dma_unpopulate(ttm_dma, dev->dev);
+
#if __OS_HAS_AGP
if (drm->agp.stat == ENABLED) {
ttm_agp_tt_unpopulate(ttm);
diff --git a/drm/nouveau_bo.h b/drm/nouveau_bo.h
index 22d2c764d80b..0f8bbd48a0b9 100644
--- a/drm/nouveau_bo.h
+++ b/drm/nouveau_bo.h
@@ -13,6 +13,7 @@ struct nouveau_bo {
u32 valid_domains;
struct ttm_place placements[3];
struct ttm_place busy_placements[3];
+ bool force_coherent;
struct ttm_bo_kmap_obj kmap;
struct list_head head;

--
2.1.2