2021-07-05 13:04:52

by Oded Gabbay

[permalink] [raw]
Subject: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

Hi,
I'm sending v4 of this patch-set following the long email thread.
I want to thank Jason for reviewing v3 and pointing out the errors, saving
us time later to debug it :)

I consulted with Christian on how to fix patch 2 (the implementation) and
at the end of the day I shamelessly copied the relevant content from
amdgpu_vram_mgr_alloc_sgt() and amdgpu_dma_buf_attach(), regarding the
usage of dma_map_resource() and pci_p2pdma_distance_many(), respectively.

I also made a few improvements after looking at the relevant code in amdgpu.
The details are in the changelog of patch 2.

I took the time to write an import code into the driver, allowing me to
check real P2P with two Gaudi devices, one as exporter and the other as
importer. I'm not going to include the import code in the product, it was
just for testing purposes (although I can share it if anyone wants).

I run it on a bare-metal environment with IOMMU enabled, on a sky-lake CPU
with a white-listed PCIe bridge (to make the pci_p2pdma_distance_many happy).

Greg, I hope this will be good enough for you to merge this code.

Thanks,
Oded

Oded Gabbay (1):
habanalabs: define uAPI to export FD for DMA-BUF

Tomer Tayar (1):
habanalabs: add support for dma-buf exporter

drivers/misc/habanalabs/Kconfig | 1 +
drivers/misc/habanalabs/common/habanalabs.h | 26 ++
drivers/misc/habanalabs/common/memory.c | 480 +++++++++++++++++++-
drivers/misc/habanalabs/gaudi/gaudi.c | 1 +
drivers/misc/habanalabs/goya/goya.c | 1 +
include/uapi/misc/habanalabs.h | 28 +-
6 files changed, 532 insertions(+), 5 deletions(-)

--
2.25.1


2021-07-05 13:04:59

by Oded Gabbay

[permalink] [raw]
Subject: [PATCH v4 1/2] habanalabs: define uAPI to export FD for DMA-BUF

User process might want to share the device memory with another
driver/device, and to allow it to access it over PCIe (P2P).

To enable this, we utilize the dma-buf mechanism and add a dma-buf
exporter support, so the other driver can import the device memory and
access it.

The device memory is allocated using our existing allocation uAPI,
where the user will get a handle that represents the allocation.

The user will then need to call the new
uAPI (HL_MEM_OP_EXPORT_DMABUF_FD) and give the handle as a parameter.

The driver will return a FD that represents the DMA-BUF object that
was created to match that allocation.

Signed-off-by: Oded Gabbay <[email protected]>
Reviewed-by: Tomer Tayar <[email protected]>
---
include/uapi/misc/habanalabs.h | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
index 18765eb75b65..c5cbd60696d7 100644
--- a/include/uapi/misc/habanalabs.h
+++ b/include/uapi/misc/habanalabs.h
@@ -808,6 +808,10 @@ union hl_wait_cs_args {
#define HL_MEM_OP_UNMAP 3
/* Opcode to map a hw block */
#define HL_MEM_OP_MAP_BLOCK 4
+/* Opcode to create DMA-BUF object for an existing device memory allocation
+ * and to export an FD of that DMA-BUF back to the caller
+ */
+#define HL_MEM_OP_EXPORT_DMABUF_FD 5

/* Memory flags */
#define HL_MEM_CONTIGUOUS 0x1
@@ -879,11 +883,26 @@ struct hl_mem_in {
/* Virtual address returned from HL_MEM_OP_MAP */
__u64 device_virt_addr;
} unmap;
+
+ /* HL_MEM_OP_EXPORT_DMABUF_FD */
+ struct {
+ /* Handle returned from HL_MEM_OP_ALLOC. In Gaudi,
+ * where we don't have MMU for the device memory, the
+ * driver expects a physical address (instead of
+ * a handle) in the device memory space.
+ */
+ __u64 handle;
+ /* Size of memory allocation. Relevant only for GAUDI */
+ __u64 mem_size;
+ } export_dmabuf_fd;
};

/* HL_MEM_OP_* */
__u32 op;
- /* HL_MEM_* flags */
+ /* HL_MEM_* flags.
+ * For the HL_MEM_OP_EXPORT_DMABUF_FD opcode, this field holds the
+ * DMA-BUF file/FD flags.
+ */
__u32 flags;
/* Context ID - Currently not in use */
__u32 ctx_id;
@@ -920,6 +939,13 @@ struct hl_mem_out {

__u32 pad;
};
+
+ /* Returned in HL_MEM_OP_EXPORT_DMABUF_FD. Represents the
+ * DMA-BUF object that was created to describe a memory
+ * allocation on the device's memory space. The FD should be
+ * passed to the importer driver
+ */
+ __u64 fd;
};
};

--
2.25.1

2021-07-05 13:05:03

by Oded Gabbay

[permalink] [raw]
Subject: [PATCH v4 2/2] habanalabs: add support for dma-buf exporter

From: Tomer Tayar <[email protected]>

Implement the calls to the dma-buf kernel api to create a dma-buf
object backed by FD.

We block the option to mmap the DMA-BUF object because we don't support
DIRECT_IO and implicit P2P. We only implement support for explicit P2P
through importing the FD of the DMA-BUF.

In the export phase, we provide a static SG list to the DMA-BUF object
because in Habanalabs's ASICs, the device memory is pinned and
immutable. Therefore, there is no need for dynamic mappings and pinning
callbacks.

Note that in GAUDI we don't have an MMU towards the device memory and
the user works on physical addresses. Therefore, the user doesn't pass
through the kernel driver to allocate memory there. As a result, only
for GAUDI we receive from the user a device memory physical address
(instead of a handle) and a size.

To get the DMA address of the PCI bar, we use the dma_map_resources()
kernel API, because our device memory is not backed by page struct
and this API doesn't need page struct to map the physical address to
a DMA address.

We check the p2p distance using pci_p2pdma_distance_many() and refusing
to map dmabuf in case the distance doesn't allow p2p.

Signed-off-by: Tomer Tayar <[email protected]>
Reviewed-by: Oded Gabbay <[email protected]>
Reviewed-by: Gal Pressman <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>
---
Changes in v4:
- Using dma_map_resources() to map the physical address of the PCI BAR
to a DMA'able address that the other device can use

- Check the p2p distance using pci_p2pdma_distance_many() and refusing
to map dmabuf in case the distance doesn't allow p2p.

- Move the creation of the SGT to the dmabuf map callback to use the
PCI device of the importer in the call to dma_map_resources()

- Move validity check whether the device address is exposed on the bar
to the export phase

- Allocate sgt per map instead of using single one for dmabuf

- Move split pages code only to export from address

- Check this works between two Gaudi devices. For that, I implemented
a simple import code to import the FD of the DMABUF into Gaudi's MMU.
This code is not upstreamed as it isn't a functionality I would like
to support but it was done to verify that the export works correctly
in real p2p scenario.

drivers/misc/habanalabs/Kconfig | 1 +
drivers/misc/habanalabs/common/habanalabs.h | 26 ++
drivers/misc/habanalabs/common/memory.c | 480 +++++++++++++++++++-
drivers/misc/habanalabs/gaudi/gaudi.c | 1 +
drivers/misc/habanalabs/goya/goya.c | 1 +
5 files changed, 505 insertions(+), 4 deletions(-)

diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig
index 293d79811372..c82d2e7b2035 100644
--- a/drivers/misc/habanalabs/Kconfig
+++ b/drivers/misc/habanalabs/Kconfig
@@ -8,6 +8,7 @@ config HABANA_AI
depends on PCI && HAS_IOMEM
select GENERIC_ALLOCATOR
select HWMON
+ select DMA_SHARED_BUFFER
help
Enables PCIe card driver for Habana's AI Processors (AIP) that are
designed to accelerate Deep Learning inference and training workloads.
diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h
index 9aedea471ebe..f2605030286e 100644
--- a/drivers/misc/habanalabs/common/habanalabs.h
+++ b/drivers/misc/habanalabs/common/habanalabs.h
@@ -26,6 +26,7 @@
#include <linux/sched/signal.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/coresight.h>
+#include <linux/dma-buf.h>

#define HL_NAME "habanalabs"

@@ -1326,6 +1327,27 @@ struct hl_pending_cb {
u32 hw_queue_id;
};

+/**
+ * struct hl_dmabuf_wrapper - a dma-buf wrapper object.
+ * @dmabuf: pointer to dma-buf object.
+ * @ctx: pointer to the dma-buf owner's context.
+ * @phys_pg_pack: pointer to physical page pack if the dma-buf was exported for
+ * memory allocation handle.
+ * @pages: array of pages that represents the physical addresses in the device
+ * memory. This is relevant in case phys_pg_pack is NULL (For Gaudi).
+ * If phys_pg_pack is valid, we take the pages array from there.
+ * @npages: number of entries in pages array, relevant if phys_pg_pack is NULL
+ * @page_size: size of page in pages array, relevant if phys_pg_pack is NULL
+ */
+struct hl_dmabuf_wrapper {
+ struct dma_buf *dmabuf;
+ struct hl_ctx *ctx;
+ struct hl_vm_phys_pg_pack *phys_pg_pack;
+ u64 *pages;
+ u64 npages;
+ u32 page_size;
+};
+
/**
* struct hl_ctx - user/kernel context.
* @mem_hash: holds mapping from virtual address to virtual memory area
@@ -1634,6 +1656,7 @@ struct hl_vm_hw_block_list_node {
* @npages: num physical pages in the pack.
* @total_size: total size of all the pages in this list.
* @mapping_cnt: number of shared mappings.
+ * @exporting_cnt: number of dma-buf exporting.
* @asid: the context related to this list.
* @page_size: size of each page in the pack.
* @flags: HL_MEM_* flags related to this list.
@@ -1648,6 +1671,7 @@ struct hl_vm_phys_pg_pack {
u64 npages;
u64 total_size;
atomic_t mapping_cnt;
+ u32 exporting_cnt;
u32 asid;
u32 page_size;
u32 flags;
@@ -2311,6 +2335,7 @@ struct hl_mmu_funcs {
* the error will be ignored by the driver during
* device initialization. Mainly used to debug and
* workaround firmware bugs
+ * @dram_pci_bar_start: start bus address of PCIe bar towards DRAM.
* @last_successful_open_jif: timestamp (jiffies) of the last successful
* device open.
* @last_open_session_duration_jif: duration (jiffies) of the last device open
@@ -2448,6 +2473,7 @@ struct hl_device {
u64 max_power;
u64 clock_gating_mask;
u64 boot_error_status_mask;
+ u64 dram_pci_bar_start;
u64 last_successful_open_jif;
u64 last_open_session_duration_jif;
u64 open_counter;
diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c
index a05d98db4857..68e27a853812 100644
--- a/drivers/misc/habanalabs/common/memory.c
+++ b/drivers/misc/habanalabs/common/memory.c
@@ -1,7 +1,7 @@
// SPDX-License-Identifier: GPL-2.0

/*
- * Copyright 2016-2019 HabanaLabs, Ltd.
+ * Copyright 2016-2021 HabanaLabs, Ltd.
* All Rights Reserved.
*/

@@ -11,11 +11,19 @@

#include <linux/uaccess.h>
#include <linux/slab.h>
+#include <linux/pci-p2pdma.h>

#define HL_MMU_DEBUG 0

/* use small pages for supporting non-pow2 (32M/40M/48M) DRAM phys page sizes */
-#define DRAM_POOL_PAGE_SIZE SZ_8M
+#define DRAM_POOL_PAGE_SIZE SZ_8M
+
+/* dma-buf alignment requirements when exporting memory with address/size */
+#define DMA_BUF_MEM_ADDR_ALIGNMENT SZ_32M
+#define DMA_BUF_MEM_SIZE_ALIGNMENT SZ_32M
+
+/* dma-buf chunk size cannot exceed the scatterlist "unsigned int" length */
+#define DMA_BUF_CHUNK_MAX_SIZE SZ_512M

/*
* The va ranges in context object contain a list with the available chunks of
@@ -347,6 +355,13 @@ static int free_device_memory(struct hl_ctx *ctx, struct hl_mem_in *args)
return -EINVAL;
}

+ if (phys_pg_pack->exporting_cnt) {
+ dev_err(hdev->dev,
+ "handle %u is exported, cannot free\n", handle);
+ spin_unlock(&vm->idr_lock);
+ return -EINVAL;
+ }
+
/*
* must remove from idr before the freeing of the physical
* pages as the refcount of the pool is also the trigger of the
@@ -1504,13 +1519,444 @@ int hl_hw_block_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma)
return 0;
}

+static int alloc_sgt_from_device_pages(struct hl_device *hdev,
+ struct sg_table **sgt, u64 *pages,
+ u64 npages, u64 page_size,
+ struct device *dev,
+ enum dma_data_direction dir)
+{
+ struct asic_fixed_properties *prop;
+ int rc, i, j, nents, cur_page;
+ u64 chunk_size, bar_address;
+ struct scatterlist *sg;
+ dma_addr_t addr;
+
+ prop = &hdev->asic_prop;
+
+ *sgt = kzalloc(sizeof(**sgt), GFP_KERNEL);
+ if (!*sgt)
+ return -ENOMEM;
+
+ /* Get number of non-contiguous chunks */
+ for (i = 1, nents = 1, chunk_size = page_size ; i < npages ; i++) {
+ if (pages[i - 1] + page_size != pages[i] ||
+ chunk_size + page_size >
+ DMA_BUF_CHUNK_MAX_SIZE) {
+ nents++;
+ chunk_size = page_size;
+ continue;
+ }
+
+ chunk_size += page_size;
+ }
+
+ rc = sg_alloc_table(*sgt, nents, GFP_KERNEL | __GFP_ZERO);
+ if (rc)
+ goto error_free;
+
+ /* Merge pages and put them into the scatterlist */
+ cur_page = 0;
+ for_each_sgtable_sg((*sgt), sg, i) {
+ chunk_size = page_size;
+ for (j = cur_page + 1 ; j < npages ; j++) {
+ if (pages[j - 1] + page_size != pages[j] ||
+ chunk_size + page_size >
+ DMA_BUF_CHUNK_MAX_SIZE)
+ break;
+ chunk_size += page_size;
+ }
+
+ bar_address = hdev->dram_pci_bar_start +
+ (pages[cur_page] - prop->dram_base_address);
+
+ addr = dma_map_resource(dev, bar_address, chunk_size, dir,
+ DMA_ATTR_SKIP_CPU_SYNC);
+ rc = dma_mapping_error(dev, addr);
+ if (rc)
+ goto error_unmap;
+
+ sg_set_page(sg, NULL, chunk_size, 0);
+ sg_dma_address(sg) = addr;
+ sg_dma_len(sg) = chunk_size;
+
+ cur_page = j;
+ }
+
+ return 0;
+
+error_unmap:
+ for_each_sgtable_sg((*sgt), sg, i) {
+ if (!sg_dma_len(sg))
+ continue;
+
+ dma_unmap_resource(dev, sg_dma_address(sg),
+ sg_dma_len(sg), dir,
+ DMA_ATTR_SKIP_CPU_SYNC);
+ }
+
+ sg_free_table(*sgt);
+
+error_free:
+ kfree(*sgt);
+ return rc;
+}
+
+static int hl_dmabuf_attach(struct dma_buf *dmabuf,
+ struct dma_buf_attachment *attachment)
+{
+ struct hl_dmabuf_wrapper *hl_dmabuf;
+ struct hl_device *hdev;
+ int rc;
+
+ hl_dmabuf = dmabuf->priv;
+ hdev = hl_dmabuf->ctx->hdev;
+
+ rc = pci_p2pdma_distance_many(hdev->pdev, &attachment->dev, 1, true);
+
+ if (rc < 0)
+ attachment->peer2peer = false;
+
+ return 0;
+}
+
+static struct sg_table *hl_map_dmabuf(struct dma_buf_attachment *attachment,
+ enum dma_data_direction dir)
+{
+ struct dma_buf *dma_buf = attachment->dmabuf;
+ struct hl_vm_phys_pg_pack *phys_pg_pack;
+ struct hl_dmabuf_wrapper *hl_dmabuf;
+ struct hl_device *hdev;
+ struct sg_table *sgt;
+ int rc;
+
+ hl_dmabuf = dma_buf->priv;
+ hdev = hl_dmabuf->ctx->hdev;
+ phys_pg_pack = hl_dmabuf->phys_pg_pack;
+
+ if (!attachment->peer2peer) {
+ dev_err(hdev->dev,
+ "Failed to map dmabuf because p2p is disabled\n");
+ return ERR_PTR(-EPERM);
+ }
+
+ if (phys_pg_pack)
+ rc = alloc_sgt_from_device_pages(hdev, &sgt,
+ phys_pg_pack->pages,
+ phys_pg_pack->npages,
+ phys_pg_pack->page_size,
+ attachment->dev,
+ dir);
+ else
+ rc = alloc_sgt_from_device_pages(hdev, &sgt,
+ hl_dmabuf->pages,
+ hl_dmabuf->npages,
+ hl_dmabuf->page_size,
+ attachment->dev,
+ dir);
+
+ if (rc) {
+ dev_err(hdev->dev,
+ "failed (%d) to initialize sgt for dmabuf\n",
+ rc);
+ return ERR_PTR(rc);
+ }
+
+ return sgt;
+}
+
+static void hl_unmap_dmabuf(struct dma_buf_attachment *attachment,
+ struct sg_table *sgt,
+ enum dma_data_direction dir)
+{
+ struct scatterlist *sg;
+ int i;
+
+ for_each_sgtable_sg(sgt, sg, i)
+ dma_unmap_resource(attachment->dev, sg_dma_address(sg),
+ sg_dma_len(sg), dir,
+ DMA_ATTR_SKIP_CPU_SYNC);
+ sg_free_table(sgt);
+ kfree(sgt);
+}
+
+static void hl_release_dmabuf(struct dma_buf *dmabuf)
+{
+ struct hl_dmabuf_wrapper *hl_dmabuf = dmabuf->priv;
+ struct hl_ctx *ctx = hl_dmabuf->ctx;
+ struct hl_device *hdev = ctx->hdev;
+ struct hl_vm *vm = &hdev->vm;
+
+ if (hl_dmabuf->phys_pg_pack) {
+ spin_lock(&vm->idr_lock);
+ hl_dmabuf->phys_pg_pack->exporting_cnt--;
+ spin_unlock(&vm->idr_lock);
+ }
+
+ hl_ctx_put(hl_dmabuf->ctx);
+
+ kfree(hl_dmabuf->pages);
+ kfree(hl_dmabuf);
+}
+
+static const struct dma_buf_ops habanalabs_dmabuf_ops = {
+ .attach = hl_dmabuf_attach,
+ .map_dma_buf = hl_map_dmabuf,
+ .unmap_dma_buf = hl_unmap_dmabuf,
+ .release = hl_release_dmabuf,
+};
+
+static int export_dmabuf_common(struct hl_ctx *ctx,
+ struct hl_dmabuf_wrapper *hl_dmabuf,
+ u64 total_size, int flags, int *dmabuf_fd)
+{
+ DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+ struct hl_device *hdev = ctx->hdev;
+ int rc, fd;
+
+ exp_info.ops = &habanalabs_dmabuf_ops;
+ exp_info.size = total_size;
+ exp_info.flags = flags;
+ exp_info.priv = hl_dmabuf;
+
+ hl_dmabuf->dmabuf = dma_buf_export(&exp_info);
+ if (IS_ERR(hl_dmabuf->dmabuf)) {
+ dev_err(hdev->dev, "failed to export dma-buf\n");
+ return PTR_ERR(hl_dmabuf->dmabuf);
+ }
+
+ fd = dma_buf_fd(hl_dmabuf->dmabuf, flags);
+ if (fd < 0) {
+ dev_err(hdev->dev,
+ "failed to get a file descriptor for a dma-buf\n");
+ rc = fd;
+ goto err_dma_buf_put;
+ }
+
+ hl_dmabuf->ctx = ctx;
+ hl_ctx_get(hdev, hl_dmabuf->ctx);
+
+ *dmabuf_fd = fd;
+
+ return 0;
+
+err_dma_buf_put:
+ dma_buf_put(hl_dmabuf->dmabuf);
+ return rc;
+}
+
+/**
+ * export_dmabuf_from_addr() - export a dma-buf object for the given memory
+ * address and size.
+ * @ctx: pointer to the context structure.
+ * @device_addr: device memory physical address.
+ * @size: size of device memory.
+ * @flags: DMA-BUF file/FD flags.
+ * @dmabuf_fd: pointer to result FD that represents the dma-buf object.
+ *
+ * Create and export a dma-buf object for an existing memory allocation inside
+ * the device memory, and return a FD which is associated with the dma-buf
+ * object.
+ *
+ * Return: 0 on success, non-zero for failure.
+ */
+static int export_dmabuf_from_addr(struct hl_ctx *ctx, u64 device_addr,
+ u64 size, int flags, int *dmabuf_fd)
+{
+ struct hl_dmabuf_wrapper *hl_dmabuf;
+ struct hl_device *hdev = ctx->hdev;
+ struct asic_fixed_properties *prop;
+ u64 bar_address;
+ int rc, i;
+
+ prop = &hdev->asic_prop;
+
+ if (!IS_ALIGNED(device_addr, DMA_BUF_MEM_ADDR_ALIGNMENT)) {
+ dev_err_ratelimited(hdev->dev,
+ "address of exported device memory should be aligned to 0x%x, address 0x%llx\n",
+ DMA_BUF_MEM_ADDR_ALIGNMENT, device_addr);
+ return -EINVAL;
+ }
+
+ if (!size) {
+ dev_err_ratelimited(hdev->dev,
+ "size of exported device memory should be greater than 0\n");
+ return -EINVAL;
+ }
+
+ if (!IS_ALIGNED(size, DMA_BUF_MEM_SIZE_ALIGNMENT)) {
+ dev_err_ratelimited(hdev->dev,
+ "size of exported device memory should be aligned to 0x%x, size 0x%llx\n",
+ DMA_BUF_MEM_SIZE_ALIGNMENT, device_addr);
+ return -EINVAL;
+ }
+
+ if (device_addr < prop->dram_user_base_address ||
+ device_addr + size > prop->dram_end_address ||
+ device_addr + size < device_addr) {
+ dev_err_ratelimited(hdev->dev,
+ "DRAM memory range is outside of DRAM boundaries, address 0x%llx, size 0x%llx\n",
+ device_addr, size);
+ return -EINVAL;
+ }
+
+ bar_address = hdev->dram_pci_bar_start +
+ (device_addr - prop->dram_base_address);
+
+ if (bar_address + size >
+ hdev->dram_pci_bar_start + prop->dram_pci_bar_size ||
+ bar_address + size < bar_address) {
+ dev_err_ratelimited(hdev->dev,
+ "DRAM memory range is outside of PCI BAR boundaries, address 0x%llx, size 0x%llx\n",
+ device_addr, size);
+ return -EINVAL;
+ }
+
+ hl_dmabuf = kzalloc(sizeof(*hl_dmabuf), GFP_KERNEL);
+ if (!hl_dmabuf)
+ return -ENOMEM;
+
+ /* In case we got a large memory area to export, we need to divide it
+ * to smaller areas because each entry in the dmabuf sgt can only
+ * describe unsigned int.
+ */
+ if (size > DMA_BUF_CHUNK_MAX_SIZE) {
+ hl_dmabuf->page_size = DMA_BUF_MEM_SIZE_ALIGNMENT;
+ hl_dmabuf->npages = div_u64(size, hl_dmabuf->page_size);
+ } else {
+ hl_dmabuf->page_size = size;
+ hl_dmabuf->npages = 1;
+ }
+
+ hl_dmabuf->pages = kcalloc(hl_dmabuf->npages, sizeof(*hl_dmabuf->pages),
+ GFP_KERNEL);
+ if (!hl_dmabuf->pages) {
+ rc = -ENOMEM;
+ goto err_free_dmabuf_wrapper;
+ }
+
+ for (i = 0 ; i < hl_dmabuf->npages ; i++)
+ hl_dmabuf->pages[i] = device_addr +
+ i * hl_dmabuf->page_size;
+
+ rc = export_dmabuf_common(ctx, hl_dmabuf, size, flags, dmabuf_fd);
+ if (rc)
+ goto err_free_pages;
+
+ return 0;
+
+err_free_pages:
+ kfree(hl_dmabuf->pages);
+err_free_dmabuf_wrapper:
+ kfree(hl_dmabuf);
+ return rc;
+}
+
+/**
+ * export_dmabuf_from_handle() - export a dma-buf object for the given memory
+ * handle.
+ * @ctx: pointer to the context structure.
+ * @handle: device memory allocation handle.
+ * @flags: DMA-BUF file/FD flags.
+ * @dmabuf_fd: pointer to result FD that represents the dma-buf object.
+ *
+ * Create and export a dma-buf object for an existing memory allocation inside
+ * the device memory, and return a FD which is associated with the dma-buf
+ * object.
+ *
+ * Return: 0 on success, non-zero for failure.
+ */
+static int export_dmabuf_from_handle(struct hl_ctx *ctx, u64 handle, int flags,
+ int *dmabuf_fd)
+{
+ struct hl_vm_phys_pg_pack *phys_pg_pack;
+ struct hl_dmabuf_wrapper *hl_dmabuf;
+ struct hl_device *hdev = ctx->hdev;
+ struct asic_fixed_properties *prop;
+ struct hl_vm *vm = &hdev->vm;
+ u64 bar_address;
+ u32 idr_handle;
+ int rc, i;
+
+ prop = &hdev->asic_prop;
+
+ idr_handle = lower_32_bits(handle);
+
+ spin_lock(&vm->idr_lock);
+
+ phys_pg_pack = idr_find(&vm->phys_pg_pack_handles, idr_handle);
+ if (!phys_pg_pack) {
+ spin_unlock(&vm->idr_lock);
+ dev_err_ratelimited(hdev->dev, "no match for handle 0x%x\n",
+ idr_handle);
+ return -EINVAL;
+ }
+
+ /* increment now to avoid freeing device memory while exporting */
+ phys_pg_pack->exporting_cnt++;
+
+ spin_unlock(&vm->idr_lock);
+
+ if (phys_pg_pack->vm_type != VM_TYPE_PHYS_PACK) {
+ dev_err_ratelimited(hdev->dev,
+ "handle 0x%llx is not for DRAM memory\n",
+ handle);
+ rc = -EINVAL;
+ goto err_dec_exporting_cnt;
+ }
+
+ for (i = 0 ; i < phys_pg_pack->npages ; i++) {
+
+ bar_address = hdev->dram_pci_bar_start +
+ (phys_pg_pack->pages[i] -
+ prop->dram_base_address);
+
+ if (bar_address + phys_pg_pack->page_size >
+ hdev->dram_pci_bar_start + prop->dram_pci_bar_size ||
+ bar_address + phys_pg_pack->page_size < bar_address) {
+
+ dev_err_ratelimited(hdev->dev,
+ "DRAM memory range is outside of PCI BAR boundaries, address 0x%llx, size 0x%x\n",
+ phys_pg_pack->pages[i],
+ phys_pg_pack->page_size);
+
+ rc = -EINVAL;
+ goto err_dec_exporting_cnt;
+ }
+ }
+
+ hl_dmabuf = kzalloc(sizeof(*hl_dmabuf), GFP_KERNEL);
+ if (!hl_dmabuf) {
+ rc = -ENOMEM;
+ goto err_dec_exporting_cnt;
+ }
+
+ hl_dmabuf->phys_pg_pack = phys_pg_pack;
+
+ rc = export_dmabuf_common(ctx, hl_dmabuf, phys_pg_pack->total_size,
+ flags, dmabuf_fd);
+ if (rc)
+ goto err_free_dmabuf_wrapper;
+
+ return 0;
+
+err_free_dmabuf_wrapper:
+ kfree(hl_dmabuf);
+
+err_dec_exporting_cnt:
+ spin_lock(&vm->idr_lock);
+ phys_pg_pack->exporting_cnt--;
+ spin_unlock(&vm->idr_lock);
+
+ return rc;
+}
+
static int mem_ioctl_no_mmu(struct hl_fpriv *hpriv, union hl_mem_args *args)
{
struct hl_device *hdev = hpriv->hdev;
struct hl_ctx *ctx = hpriv->ctx;
u64 block_handle, device_addr = 0;
u32 handle = 0, block_size;
- int rc;
+ int rc, dmabuf_fd = -EBADF;

switch (args->in.op) {
case HL_MEM_OP_ALLOC:
@@ -1559,6 +2005,16 @@ static int mem_ioctl_no_mmu(struct hl_fpriv *hpriv, union hl_mem_args *args)
args->out.block_size = block_size;
break;

+ case HL_MEM_OP_EXPORT_DMABUF_FD:
+ rc = export_dmabuf_from_addr(ctx,
+ args->in.export_dmabuf_fd.handle,
+ args->in.export_dmabuf_fd.mem_size,
+ args->in.flags,
+ &dmabuf_fd);
+ memset(args, 0, sizeof(*args));
+ args->out.fd = dmabuf_fd;
+ break;
+
default:
dev_err(hdev->dev, "Unknown opcode for memory IOCTL\n");
rc = -ENOTTY;
@@ -1577,7 +2033,7 @@ int hl_mem_ioctl(struct hl_fpriv *hpriv, void *data)
struct hl_ctx *ctx = hpriv->ctx;
u64 block_handle, device_addr = 0;
u32 handle = 0, block_size;
- int rc;
+ int rc, dmabuf_fd = -EBADF;

if (!hl_device_operational(hdev, &status)) {
dev_warn_ratelimited(hdev->dev,
@@ -1668,6 +2124,22 @@ int hl_mem_ioctl(struct hl_fpriv *hpriv, void *data)
args->out.block_size = block_size;
break;

+ case HL_MEM_OP_EXPORT_DMABUF_FD:
+ if (hdev->asic_prop.dram_supports_virtual_memory)
+ rc = export_dmabuf_from_handle(ctx,
+ args->in.export_dmabuf_fd.handle,
+ args->in.flags,
+ &dmabuf_fd);
+ else
+ rc = export_dmabuf_from_addr(ctx,
+ args->in.export_dmabuf_fd.handle,
+ args->in.export_dmabuf_fd.mem_size,
+ args->in.flags,
+ &dmabuf_fd);
+ memset(args, 0, sizeof(*args));
+ args->out.fd = dmabuf_fd;
+ break;
+
default:
dev_err(hdev->dev, "Unknown opcode for memory IOCTL\n");
rc = -ENOTTY;
diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index 7dd36d1cb39e..b2ac0e9f10cb 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -772,6 +772,7 @@ static int gaudi_early_init(struct hl_device *hdev)
}

prop->dram_pci_bar_size = pci_resource_len(pdev, HBM_BAR_ID);
+ hdev->dram_pci_bar_start = pci_resource_start(pdev, HBM_BAR_ID);

/* If FW security is enabled at this point it means no access to ELBI */
if (hdev->asic_prop.fw_security_enabled) {
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 4144a8445eef..522f66146458 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -619,6 +619,7 @@ static int goya_early_init(struct hl_device *hdev)
}

prop->dram_pci_bar_size = pci_resource_len(pdev, DDR_BAR_ID);
+ hdev->dram_pci_bar_start = pci_resource_start(pdev, DDR_BAR_ID);

/* If FW security is enabled at this point it means no access to ELBI */
if (hdev->asic_prop.fw_security_enabled) {
--
2.25.1

2021-07-05 16:55:09

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] habanalabs: add support for dma-buf exporter

On Mon, Jul 05, 2021 at 04:03:14PM +0300, Oded Gabbay wrote:

> + rc = sg_alloc_table(*sgt, nents, GFP_KERNEL | __GFP_ZERO);
> + if (rc)
> + goto error_free;

If you are not going to include a CPU list then I suggest setting
sg_table->orig_nents == 0

And using only the nents which is the length of the DMA list.

At least it gives some hope that other parts of the system could
detect this.

> +
> + /* Merge pages and put them into the scatterlist */
> + cur_page = 0;
> + for_each_sgtable_sg((*sgt), sg, i) {

for_each_sgtable_sg should never be used when working with
sg_dma_address() type stuff, here and everywhere else. The DMA list
should be iterated using the for_each_sgtable_dma_sg() macro.

> + /* In case we got a large memory area to export, we need to divide it
> + * to smaller areas because each entry in the dmabuf sgt can only
> + * describe unsigned int.
> + */

Huh? This is forming a SGL, it should follow the SGL rules which means
you have to fragment based on the dma_get_max_seg_size() of the
importer device.

> + hl_dmabuf->pages = kcalloc(hl_dmabuf->npages, sizeof(*hl_dmabuf->pages),
> + GFP_KERNEL);
> + if (!hl_dmabuf->pages) {
> + rc = -ENOMEM;
> + goto err_free_dmabuf_wrapper;
> + }

Why not just create the SGL directly? Is there a reason it needs to
make a page list?

Jason

2021-07-06 08:41:23

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Mon, Jul 05, 2021 at 04:03:12PM +0300, Oded Gabbay wrote:
> Hi,
> I'm sending v4 of this patch-set following the long email thread.
> I want to thank Jason for reviewing v3 and pointing out the errors, saving
> us time later to debug it :)
>
> I consulted with Christian on how to fix patch 2 (the implementation) and
> at the end of the day I shamelessly copied the relevant content from
> amdgpu_vram_mgr_alloc_sgt() and amdgpu_dma_buf_attach(), regarding the
> usage of dma_map_resource() and pci_p2pdma_distance_many(), respectively.
>
> I also made a few improvements after looking at the relevant code in amdgpu.
> The details are in the changelog of patch 2.
>
> I took the time to write an import code into the driver, allowing me to
> check real P2P with two Gaudi devices, one as exporter and the other as
> importer. I'm not going to include the import code in the product, it was
> just for testing purposes (although I can share it if anyone wants).
>
> I run it on a bare-metal environment with IOMMU enabled, on a sky-lake CPU
> with a white-listed PCIe bridge (to make the pci_p2pdma_distance_many happy).
>
> Greg, I hope this will be good enough for you to merge this code.

So we're officially going to use dri-devel for technical details review
and then Greg for merging so we don't have to deal with other merge
criteria dri-devel folks have?

I don't expect anything less by now, but it does make the original claim
that drivers/misc will not step all over accelerators folks a complete
farce under the totally-not-a-gpu banner.

This essentially means that for any other accelerator stack that doesn't
fit the dri-devel merge criteria, even if it's acting like a gpu and uses
other gpu driver stuff, you can just send it to Greg and it's good to go.

There's quite a lot of these floating around actually (and many do have
semi-open runtimes, like habanalabs have now too, just not open enough to
be actually useful). It's going to be absolutely lovely having to explain
to these companies in background chats why habanalabs gets away with their
stack and they don't.

Or maybe we should just merge them all and give up on the idea of having
open cross-vendor driver stacks for these accelerators.

Thanks, Daniel

>
> Thanks,
> Oded
>
> Oded Gabbay (1):
> habanalabs: define uAPI to export FD for DMA-BUF
>
> Tomer Tayar (1):
> habanalabs: add support for dma-buf exporter
>
> drivers/misc/habanalabs/Kconfig | 1 +
> drivers/misc/habanalabs/common/habanalabs.h | 26 ++
> drivers/misc/habanalabs/common/memory.c | 480 +++++++++++++++++++-
> drivers/misc/habanalabs/gaudi/gaudi.c | 1 +
> drivers/misc/habanalabs/goya/goya.c | 1 +
> include/uapi/misc/habanalabs.h | 28 +-
> 6 files changed, 532 insertions(+), 5 deletions(-)
>
> --
> 2.25.1
>

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-06 09:49:25

by Oded Gabbay

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] habanalabs: add support for dma-buf exporter

On Mon, Jul 5, 2021 at 7:52 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Mon, Jul 05, 2021 at 04:03:14PM +0300, Oded Gabbay wrote:
>
> > + rc = sg_alloc_table(*sgt, nents, GFP_KERNEL | __GFP_ZERO);
> > + if (rc)
> > + goto error_free;
>
> If you are not going to include a CPU list then I suggest setting
> sg_table->orig_nents == 0
>
> And using only the nents which is the length of the DMA list.
>
> At least it gives some hope that other parts of the system could
> detect this.

Will do.
>
> > +
> > + /* Merge pages and put them into the scatterlist */
> > + cur_page = 0;
> > + for_each_sgtable_sg((*sgt), sg, i) {
>
> for_each_sgtable_sg should never be used when working with
> sg_dma_address() type stuff, here and everywhere else. The DMA list
> should be iterated using the for_each_sgtable_dma_sg() macro.

Thanks, will change that.

>
> > + /* In case we got a large memory area to export, we need to divide it
> > + * to smaller areas because each entry in the dmabuf sgt can only
> > + * describe unsigned int.
> > + */
>
> Huh? This is forming a SGL, it should follow the SGL rules which means
> you have to fragment based on the dma_get_max_seg_size() of the
> importer device.
>
hmm
I don't see anyone in drm checking this value (and using it) when
creating the SGL when exporting dmabuf. (e.g.
amdgpu_vram_mgr_alloc_sgt)
Are you sure we need this check ? Maybe I'm mistaken but if that's a
real concern, then most of the drm drivers are broken.

> > + hl_dmabuf->pages = kcalloc(hl_dmabuf->npages, sizeof(*hl_dmabuf->pages),
> > + GFP_KERNEL);
> > + if (!hl_dmabuf->pages) {
> > + rc = -ENOMEM;
> > + goto err_free_dmabuf_wrapper;
> > + }
>
> Why not just create the SGL directly? Is there a reason it needs to
> make a page list?
Well the idea behind it was that we have two points of entry to this
code path (export dmabuf from address, and export dmabuf from handle)
and
we want that the map function later on will be agnostic to it and
behave the same in both cases.

Having said that, if we need to consider the max segment size
according to dma_get_max_seg_size() when creating the SGL, I agree
this code is redundant and I will remove it.

>
> Jason

2021-07-06 10:05:11

by Oded Gabbay

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 11:40 AM Daniel Vetter <[email protected]> wrote:
>
> On Mon, Jul 05, 2021 at 04:03:12PM +0300, Oded Gabbay wrote:
> > Hi,
> > I'm sending v4 of this patch-set following the long email thread.
> > I want to thank Jason for reviewing v3 and pointing out the errors, saving
> > us time later to debug it :)
> >
> > I consulted with Christian on how to fix patch 2 (the implementation) and
> > at the end of the day I shamelessly copied the relevant content from
> > amdgpu_vram_mgr_alloc_sgt() and amdgpu_dma_buf_attach(), regarding the
> > usage of dma_map_resource() and pci_p2pdma_distance_many(), respectively.
> >
> > I also made a few improvements after looking at the relevant code in amdgpu.
> > The details are in the changelog of patch 2.
> >
> > I took the time to write an import code into the driver, allowing me to
> > check real P2P with two Gaudi devices, one as exporter and the other as
> > importer. I'm not going to include the import code in the product, it was
> > just for testing purposes (although I can share it if anyone wants).
> >
> > I run it on a bare-metal environment with IOMMU enabled, on a sky-lake CPU
> > with a white-listed PCIe bridge (to make the pci_p2pdma_distance_many happy).
> >
> > Greg, I hope this will be good enough for you to merge this code.
>
> So we're officially going to use dri-devel for technical details review
> and then Greg for merging so we don't have to deal with other merge
> criteria dri-devel folks have?
I'm glad to receive any help or review, regardless of the subsystem
the person giving that help belongs to.

>
> I don't expect anything less by now, but it does make the original claim
> that drivers/misc will not step all over accelerators folks a complete
> farce under the totally-not-a-gpu banner.
>
> This essentially means that for any other accelerator stack that doesn't
> fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> other gpu driver stuff, you can just send it to Greg and it's good to go.

What's wrong with Greg ??? ;)

On a more serious note, yes, I do think the dri-devel merge criteria
is very extreme, and effectively drives-out many AI accelerator
companies that want to contribute to the kernel but can't/won't open
their software IP and patents.

I think the expectation from AI startups (who are 90% of the deep
learning field) to cooperate outside of company boundaries is not
realistic, especially on the user-side, where the real IP of the
company resides.

Personally I don't think there is a real justification for that at
this point of time, but if it will make you (and other people here)
happy I really don't mind creating a non-gpu accelerator subsystem
that will contain all the totally-not-a-gpu accelerators, and will
have a more relaxed criteria for upstreaming. Something along an
"rdma-core" style library looks like the correct amount of user-level
open source that should be enough.

The question is, what will happen later ? Will it be sufficient to
"allow" us to use dmabuf and maybe other gpu stuff in the future (e.g.
hmm) ?

If the community and dri-devel maintainers (and you among them) will
assure me it is good enough, then I'll happily contribute my work and
personal time to organize this effort and implement it.

Thanks,
oded

>
> There's quite a lot of these floating around actually (and many do have
> semi-open runtimes, like habanalabs have now too, just not open enough to
> be actually useful). It's going to be absolutely lovely having to explain
> to these companies in background chats why habanalabs gets away with their
> stack and they don't.
>
> Or maybe we should just merge them all and give up on the idea of having
> open cross-vendor driver stacks for these accelerators.
>
> Thanks, Daniel
>
> >
> > Thanks,
> > Oded
> >
> > Oded Gabbay (1):
> > habanalabs: define uAPI to export FD for DMA-BUF
> >
> > Tomer Tayar (1):
> > habanalabs: add support for dma-buf exporter
> >
> > drivers/misc/habanalabs/Kconfig | 1 +
> > drivers/misc/habanalabs/common/habanalabs.h | 26 ++
> > drivers/misc/habanalabs/common/memory.c | 480 +++++++++++++++++++-
> > drivers/misc/habanalabs/gaudi/gaudi.c | 1 +
> > drivers/misc/habanalabs/goya/goya.c | 1 +
> > include/uapi/misc/habanalabs.h | 28 +-
> > 6 files changed, 532 insertions(+), 5 deletions(-)
> >
> > --
> > 2.25.1
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

2021-07-06 10:38:28

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 12:03 PM Oded Gabbay <[email protected]> wrote:
>
> On Tue, Jul 6, 2021 at 11:40 AM Daniel Vetter <[email protected]> wrote:
> >
> > On Mon, Jul 05, 2021 at 04:03:12PM +0300, Oded Gabbay wrote:
> > > Hi,
> > > I'm sending v4 of this patch-set following the long email thread.
> > > I want to thank Jason for reviewing v3 and pointing out the errors, saving
> > > us time later to debug it :)
> > >
> > > I consulted with Christian on how to fix patch 2 (the implementation) and
> > > at the end of the day I shamelessly copied the relevant content from
> > > amdgpu_vram_mgr_alloc_sgt() and amdgpu_dma_buf_attach(), regarding the
> > > usage of dma_map_resource() and pci_p2pdma_distance_many(), respectively.
> > >
> > > I also made a few improvements after looking at the relevant code in amdgpu.
> > > The details are in the changelog of patch 2.
> > >
> > > I took the time to write an import code into the driver, allowing me to
> > > check real P2P with two Gaudi devices, one as exporter and the other as
> > > importer. I'm not going to include the import code in the product, it was
> > > just for testing purposes (although I can share it if anyone wants).
> > >
> > > I run it on a bare-metal environment with IOMMU enabled, on a sky-lake CPU
> > > with a white-listed PCIe bridge (to make the pci_p2pdma_distance_many happy).
> > >
> > > Greg, I hope this will be good enough for you to merge this code.
> >
> > So we're officially going to use dri-devel for technical details review
> > and then Greg for merging so we don't have to deal with other merge
> > criteria dri-devel folks have?
> I'm glad to receive any help or review, regardless of the subsystem
> the person giving that help belongs to.
>
> >
> > I don't expect anything less by now, but it does make the original claim
> > that drivers/misc will not step all over accelerators folks a complete
> > farce under the totally-not-a-gpu banner.
> >
> > This essentially means that for any other accelerator stack that doesn't
> > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > other gpu driver stuff, you can just send it to Greg and it's good to go.
>
> What's wrong with Greg ??? ;)
>
> On a more serious note, yes, I do think the dri-devel merge criteria
> is very extreme, and effectively drives-out many AI accelerator
> companies that want to contribute to the kernel but can't/won't open
> their software IP and patents.
>
> I think the expectation from AI startups (who are 90% of the deep
> learning field) to cooperate outside of company boundaries is not
> realistic, especially on the user-side, where the real IP of the
> company resides.
>
> Personally I don't think there is a real justification for that at
> this point of time, but if it will make you (and other people here)
> happy I really don't mind creating a non-gpu accelerator subsystem
> that will contain all the totally-not-a-gpu accelerators, and will
> have a more relaxed criteria for upstreaming. Something along an
> "rdma-core" style library looks like the correct amount of user-level
> open source that should be enough.
>
> The question is, what will happen later ? Will it be sufficient to
> "allow" us to use dmabuf and maybe other gpu stuff in the future (e.g.
> hmm) ?
>
> If the community and dri-devel maintainers (and you among them) will
> assure me it is good enough, then I'll happily contribute my work and
> personal time to organize this effort and implement it.

I think dri-devel stance is pretty clear and well known: We want the
userspace to be open, because that's where most of the driver stack
is. Without an open driver stack there's no way to ever have anything
cross-vendor.

And that includes the compiler and anything else you need to drive the hardware.

Afaik linux cpu arch ports are also not accepted if there's no open
gcc or llvm port around, because without that the overall stack just
becomes useless.

If that means AI companies don't want to open our their hw specs
enough to allow that, so be it - all you get in that case is
offloading the kernel side of the stack for convenience, with zero
long term prospects to ever make this into a cross vendor subsystem
stack that does something useful. If the business case says you can't
open up your hw enough for that, I really don't see the point in
merging such a driver, it'll be an unmaintainable stack by anyone else
who's not having access to those NDA covered specs and patents and
everything.

If the stack is actually cross vendor to begin with that's just bonus,
but generally that doesn't happen voluntarily and needs a few years to
decades to get there. So that's not really something we require.

tldr; just a runtime isn't enough for dri-devel.

Now Greg seems to be happy to merge kernel drivers that aren't useful
with the open bits provided, so *shrug*.

Cheers, Daniel

PS: If requiring an actually useful open driver stack is somehow
*extreme* I have no idea why we even bother with merging device
drivers to upstream. Just make a stable driver api and done, vendors
can then do whatever they feel like and protect their "valuable IP and
patents" or whatever it is.

> Thanks,
> oded
>
> >
> > There's quite a lot of these floating around actually (and many do have
> > semi-open runtimes, like habanalabs have now too, just not open enough to
> > be actually useful). It's going to be absolutely lovely having to explain
> > to these companies in background chats why habanalabs gets away with their
> > stack and they don't.
> >
> > Or maybe we should just merge them all and give up on the idea of having
> > open cross-vendor driver stacks for these accelerators.
> >
> > Thanks, Daniel
> >
> > >
> > > Thanks,
> > > Oded
> > >
> > > Oded Gabbay (1):
> > > habanalabs: define uAPI to export FD for DMA-BUF
> > >
> > > Tomer Tayar (1):
> > > habanalabs: add support for dma-buf exporter
> > >
> > > drivers/misc/habanalabs/Kconfig | 1 +
> > > drivers/misc/habanalabs/common/habanalabs.h | 26 ++
> > > drivers/misc/habanalabs/common/memory.c | 480 +++++++++++++++++++-
> > > drivers/misc/habanalabs/gaudi/gaudi.c | 1 +
> > > drivers/misc/habanalabs/goya/goya.c | 1 +
> > > include/uapi/misc/habanalabs.h | 28 +-
> > > 6 files changed, 532 insertions(+), 5 deletions(-)
> > >
> > > --
> > > 2.25.1
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch



--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-06 10:50:44

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 12:36 PM Daniel Vetter <[email protected]> wrote:
> On Tue, Jul 6, 2021 at 12:03 PM Oded Gabbay <[email protected]> wrote:
> >
> > On Tue, Jul 6, 2021 at 11:40 AM Daniel Vetter <[email protected]> wrote:
> > >
> > > On Mon, Jul 05, 2021 at 04:03:12PM +0300, Oded Gabbay wrote:
> > > > Hi,
> > > > I'm sending v4 of this patch-set following the long email thread.
> > > > I want to thank Jason for reviewing v3 and pointing out the errors, saving
> > > > us time later to debug it :)
> > > >
> > > > I consulted with Christian on how to fix patch 2 (the implementation) and
> > > > at the end of the day I shamelessly copied the relevant content from
> > > > amdgpu_vram_mgr_alloc_sgt() and amdgpu_dma_buf_attach(), regarding the
> > > > usage of dma_map_resource() and pci_p2pdma_distance_many(), respectively.
> > > >
> > > > I also made a few improvements after looking at the relevant code in amdgpu.
> > > > The details are in the changelog of patch 2.
> > > >
> > > > I took the time to write an import code into the driver, allowing me to
> > > > check real P2P with two Gaudi devices, one as exporter and the other as
> > > > importer. I'm not going to include the import code in the product, it was
> > > > just for testing purposes (although I can share it if anyone wants).
> > > >
> > > > I run it on a bare-metal environment with IOMMU enabled, on a sky-lake CPU
> > > > with a white-listed PCIe bridge (to make the pci_p2pdma_distance_many happy).
> > > >
> > > > Greg, I hope this will be good enough for you to merge this code.
> > >
> > > So we're officially going to use dri-devel for technical details review
> > > and then Greg for merging so we don't have to deal with other merge
> > > criteria dri-devel folks have?
> > I'm glad to receive any help or review, regardless of the subsystem
> > the person giving that help belongs to.
> >
> > >
> > > I don't expect anything less by now, but it does make the original claim
> > > that drivers/misc will not step all over accelerators folks a complete
> > > farce under the totally-not-a-gpu banner.
> > >
> > > This essentially means that for any other accelerator stack that doesn't
> > > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > > other gpu driver stuff, you can just send it to Greg and it's good to go.
> >
> > What's wrong with Greg ??? ;)
> >
> > On a more serious note, yes, I do think the dri-devel merge criteria
> > is very extreme, and effectively drives-out many AI accelerator
> > companies that want to contribute to the kernel but can't/won't open
> > their software IP and patents.
> >
> > I think the expectation from AI startups (who are 90% of the deep
> > learning field) to cooperate outside of company boundaries is not
> > realistic, especially on the user-side, where the real IP of the
> > company resides.
> >
> > Personally I don't think there is a real justification for that at
> > this point of time, but if it will make you (and other people here)
> > happy I really don't mind creating a non-gpu accelerator subsystem
> > that will contain all the totally-not-a-gpu accelerators, and will
> > have a more relaxed criteria for upstreaming. Something along an
> > "rdma-core" style library looks like the correct amount of user-level
> > open source that should be enough.
> >
> > The question is, what will happen later ? Will it be sufficient to
> > "allow" us to use dmabuf and maybe other gpu stuff in the future (e.g.
> > hmm) ?
> >
> > If the community and dri-devel maintainers (and you among them) will
> > assure me it is good enough, then I'll happily contribute my work and
> > personal time to organize this effort and implement it.
>
> I think dri-devel stance is pretty clear and well known: We want the
> userspace to be open, because that's where most of the driver stack
> is. Without an open driver stack there's no way to ever have anything
> cross-vendor.
>
> And that includes the compiler and anything else you need to drive the hardware.
>
> Afaik linux cpu arch ports are also not accepted if there's no open
> gcc or llvm port around, because without that the overall stack just
> becomes useless.
>
> If that means AI companies don't want to open our their hw specs
> enough to allow that, so be it - all you get in that case is
> offloading the kernel side of the stack for convenience, with zero
> long term prospects to ever make this into a cross vendor subsystem
> stack that does something useful. If the business case says you can't
> open up your hw enough for that, I really don't see the point in
> merging such a driver, it'll be an unmaintainable stack by anyone else
> who's not having access to those NDA covered specs and patents and
> everything.
>
> If the stack is actually cross vendor to begin with that's just bonus,
> but generally that doesn't happen voluntarily and needs a few years to
> decades to get there. So that's not really something we require.
>
> tldr; just a runtime isn't enough for dri-devel.
>
> Now Greg seems to be happy to merge kernel drivers that aren't useful
> with the open bits provided, so *shrug*.
>
> Cheers, Daniel
>
> PS: If requiring an actually useful open driver stack is somehow
> *extreme* I have no idea why we even bother with merging device
> drivers to upstream. Just make a stable driver api and done, vendors
> can then do whatever they feel like and protect their "valuable IP and
> patents" or whatever it is.

So perhaps this isn't clear, so let's explain this differently.

The deal when having a driver in upstream is that both the vendor and
upstream benefits:
- vendor gets their driver carried and adjusted in upstream, because
there's no stable uapi, and the benefit of being included everywhere
by default
- upstream gets the benefit to be able to hack around in more drivers,
which generally leads to a more robust subsystem and driver
architecture

Now what you want is to have the benefits for you, without giving the
wider community the benefit of actually being able to hack on your
driver stack. Because you prefer to keep critical pieces of it
protected and closed, which makes sure no one can create a new
cross-vendor stack without your permission. Or without investing a lot
of time into reverse-engineering the hardware. That's not extreme,
that's just preferring to have your cake and eat it too.

And frankly on dri-devel we don't take such a loopsided deal. Greg
otoh seems to be totally fine, or not really understand what it takes
to build an accelerator stack, or I dunno what, but he's happy merging
them.

Cheers, Daniel


> > Thanks,
> > oded
> >
> > >
> > > There's quite a lot of these floating around actually (and many do have
> > > semi-open runtimes, like habanalabs have now too, just not open enough to
> > > be actually useful). It's going to be absolutely lovely having to explain
> > > to these companies in background chats why habanalabs gets away with their
> > > stack and they don't.
> > >
> > > Or maybe we should just merge them all and give up on the idea of having
> > > open cross-vendor driver stacks for these accelerators.
> > >
> > > Thanks, Daniel
> > >
> > > >
> > > > Thanks,
> > > > Oded
> > > >
> > > > Oded Gabbay (1):
> > > > habanalabs: define uAPI to export FD for DMA-BUF
> > > >
> > > > Tomer Tayar (1):
> > > > habanalabs: add support for dma-buf exporter
> > > >
> > > > drivers/misc/habanalabs/Kconfig | 1 +
> > > > drivers/misc/habanalabs/common/habanalabs.h | 26 ++
> > > > drivers/misc/habanalabs/common/memory.c | 480 +++++++++++++++++++-
> > > > drivers/misc/habanalabs/gaudi/gaudi.c | 1 +
> > > > drivers/misc/habanalabs/goya/goya.c | 1 +
> > > > include/uapi/misc/habanalabs.h | 28 +-
> > > > 6 files changed, 532 insertions(+), 5 deletions(-)
> > > >
> > > > --
> > > > 2.25.1
> > > >
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-06 12:26:53

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 12:47 PM Daniel Vetter <[email protected]> wrote:
> On Tue, Jul 6, 2021 at 12:36 PM Daniel Vetter <[email protected]> wrote:
> > On Tue, Jul 6, 2021 at 12:03 PM Oded Gabbay <[email protected]> wrote:
> > >
> > > On Tue, Jul 6, 2021 at 11:40 AM Daniel Vetter <[email protected]> wrote:
> > > >
> > > > On Mon, Jul 05, 2021 at 04:03:12PM +0300, Oded Gabbay wrote:
> > > > > Hi,
> > > > > I'm sending v4 of this patch-set following the long email thread.
> > > > > I want to thank Jason for reviewing v3 and pointing out the errors, saving
> > > > > us time later to debug it :)
> > > > >
> > > > > I consulted with Christian on how to fix patch 2 (the implementation) and
> > > > > at the end of the day I shamelessly copied the relevant content from
> > > > > amdgpu_vram_mgr_alloc_sgt() and amdgpu_dma_buf_attach(), regarding the
> > > > > usage of dma_map_resource() and pci_p2pdma_distance_many(), respectively.
> > > > >
> > > > > I also made a few improvements after looking at the relevant code in amdgpu.
> > > > > The details are in the changelog of patch 2.
> > > > >
> > > > > I took the time to write an import code into the driver, allowing me to
> > > > > check real P2P with two Gaudi devices, one as exporter and the other as
> > > > > importer. I'm not going to include the import code in the product, it was
> > > > > just for testing purposes (although I can share it if anyone wants).
> > > > >
> > > > > I run it on a bare-metal environment with IOMMU enabled, on a sky-lake CPU
> > > > > with a white-listed PCIe bridge (to make the pci_p2pdma_distance_many happy).
> > > > >
> > > > > Greg, I hope this will be good enough for you to merge this code.
> > > >
> > > > So we're officially going to use dri-devel for technical details review
> > > > and then Greg for merging so we don't have to deal with other merge
> > > > criteria dri-devel folks have?
> > > I'm glad to receive any help or review, regardless of the subsystem
> > > the person giving that help belongs to.
> > >
> > > >
> > > > I don't expect anything less by now, but it does make the original claim
> > > > that drivers/misc will not step all over accelerators folks a complete
> > > > farce under the totally-not-a-gpu banner.
> > > >
> > > > This essentially means that for any other accelerator stack that doesn't
> > > > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > > > other gpu driver stuff, you can just send it to Greg and it's good to go.
> > >
> > > What's wrong with Greg ??? ;)
> > >
> > > On a more serious note, yes, I do think the dri-devel merge criteria
> > > is very extreme, and effectively drives-out many AI accelerator
> > > companies that want to contribute to the kernel but can't/won't open
> > > their software IP and patents.
> > >
> > > I think the expectation from AI startups (who are 90% of the deep
> > > learning field) to cooperate outside of company boundaries is not
> > > realistic, especially on the user-side, where the real IP of the
> > > company resides.
> > >
> > > Personally I don't think there is a real justification for that at
> > > this point of time, but if it will make you (and other people here)
> > > happy I really don't mind creating a non-gpu accelerator subsystem
> > > that will contain all the totally-not-a-gpu accelerators, and will
> > > have a more relaxed criteria for upstreaming. Something along an
> > > "rdma-core" style library looks like the correct amount of user-level
> > > open source that should be enough.

On the "rdma-core" idea, afaik rdma NIC do not have fully programmable
cores in their hw, for which you'd need some kind of compiler to make
use of the hardware and the interfaces the kernel provides? So not
really compareable, but also my understanding is that rdma-core does
actually allow you to reasonable use&drive all the hw features and
kernel interfaces fully.

So we actually want less on dri-devel, because for compute/accel chips
we're currently happy with a vendor userspace. It just needs to be
functional and complete, and open in its entirety.

Now if there's going to be a AI/NN/spatial compute core runtime with
all the things included that's cross-vendor that's obviously going to
be great, but that's strictly a bonus. And eventually the long-term
goal, once we have a few open stacks from various vendors. But atm we
have 0 open stacks, so one thing at a time.

> > > The question is, what will happen later ? Will it be sufficient to
> > > "allow" us to use dmabuf and maybe other gpu stuff in the future (e.g.
> > > hmm) ?
> > >
> > > If the community and dri-devel maintainers (and you among them) will
> > > assure me it is good enough, then I'll happily contribute my work and
> > > personal time to organize this effort and implement it.
> >
> > I think dri-devel stance is pretty clear and well known: We want the
> > userspace to be open, because that's where most of the driver stack
> > is. Without an open driver stack there's no way to ever have anything
> > cross-vendor.
> >
> > And that includes the compiler and anything else you need to drive the hardware.
> >
> > Afaik linux cpu arch ports are also not accepted if there's no open
> > gcc or llvm port around, because without that the overall stack just
> > becomes useless.
> >
> > If that means AI companies don't want to open our their hw specs
> > enough to allow that, so be it - all you get in that case is
> > offloading the kernel side of the stack for convenience, with zero
> > long term prospects to ever make this into a cross vendor subsystem
> > stack that does something useful. If the business case says you can't
> > open up your hw enough for that, I really don't see the point in
> > merging such a driver, it'll be an unmaintainable stack by anyone else
> > who's not having access to those NDA covered specs and patents and
> > everything.
> >
> > If the stack is actually cross vendor to begin with that's just bonus,
> > but generally that doesn't happen voluntarily and needs a few years to
> > decades to get there. So that's not really something we require.
> >
> > tldr; just a runtime isn't enough for dri-devel.
> >
> > Now Greg seems to be happy to merge kernel drivers that aren't useful
> > with the open bits provided, so *shrug*.
> >
> > Cheers, Daniel
> >
> > PS: If requiring an actually useful open driver stack is somehow
> > *extreme* I have no idea why we even bother with merging device
> > drivers to upstream. Just make a stable driver api and done, vendors
> > can then do whatever they feel like and protect their "valuable IP and
> > patents" or whatever it is.
>
> So perhaps this isn't clear, so let's explain this differently.
>
> The deal when having a driver in upstream is that both the vendor and
> upstream benefits:
> - vendor gets their driver carried and adjusted in upstream, because
> there's no stable uapi, and the benefit of being included everywhere

s/uapi/kernel driver api/ ofc, but I got it right in the first reply at least.
-Daniel

> by default
> - upstream gets the benefit to be able to hack around in more drivers,
> which generally leads to a more robust subsystem and driver
> architecture
>
> Now what you want is to have the benefits for you, without giving the
> wider community the benefit of actually being able to hack on your
> driver stack. Because you prefer to keep critical pieces of it
> protected and closed, which makes sure no one can create a new
> cross-vendor stack without your permission. Or without investing a lot
> of time into reverse-engineering the hardware. That's not extreme,
> that's just preferring to have your cake and eat it too.
>
> And frankly on dri-devel we don't take such a loopsided deal. Greg
> otoh seems to be totally fine, or not really understand what it takes
> to build an accelerator stack, or I dunno what, but he's happy merging
> them.
>
> Cheers, Daniel
>
>
> > > Thanks,
> > > oded
> > >
> > > >
> > > > There's quite a lot of these floating around actually (and many do have
> > > > semi-open runtimes, like habanalabs have now too, just not open enough to
> > > > be actually useful). It's going to be absolutely lovely having to explain
> > > > to these companies in background chats why habanalabs gets away with their
> > > > stack and they don't.
> > > >
> > > > Or maybe we should just merge them all and give up on the idea of having
> > > > open cross-vendor driver stacks for these accelerators.
> > > >
> > > > Thanks, Daniel
> > > >
> > > > >
> > > > > Thanks,
> > > > > Oded
> > > > >
> > > > > Oded Gabbay (1):
> > > > > habanalabs: define uAPI to export FD for DMA-BUF
> > > > >
> > > > > Tomer Tayar (1):
> > > > > habanalabs: add support for dma-buf exporter
> > > > >
> > > > > drivers/misc/habanalabs/Kconfig | 1 +
> > > > > drivers/misc/habanalabs/common/habanalabs.h | 26 ++
> > > > > drivers/misc/habanalabs/common/memory.c | 480 +++++++++++++++++++-
> > > > > drivers/misc/habanalabs/gaudi/gaudi.c | 1 +
> > > > > drivers/misc/habanalabs/goya/goya.c | 1 +
> > > > > include/uapi/misc/habanalabs.h | 28 +-
> > > > > 6 files changed, 532 insertions(+), 5 deletions(-)
> > > > >
> > > > > --
> > > > > 2.25.1
> > > > >
> > > >
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-06 12:29:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 06, 2021 at 12:36:51PM +0200, Daniel Vetter wrote:
> Afaik linux cpu arch ports are also not accepted if there's no open
> gcc or llvm port around, because without that the overall stack just
> becomes useless.

Yes. And the one architecture that has an open but not upstream
compiler already is more than enough of a pain to not repeat that
mistake ever again.

2021-07-06 12:31:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
> > Greg, I hope this will be good enough for you to merge this code.
>
> So we're officially going to use dri-devel for technical details review
> and then Greg for merging so we don't have to deal with other merge
> criteria dri-devel folks have?
>
> I don't expect anything less by now, but it does make the original claim
> that drivers/misc will not step all over accelerators folks a complete
> farce under the totally-not-a-gpu banner.
>
> This essentially means that for any other accelerator stack that doesn't
> fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> other gpu driver stuff, you can just send it to Greg and it's good to go.
>
> There's quite a lot of these floating around actually (and many do have
> semi-open runtimes, like habanalabs have now too, just not open enough to
> be actually useful). It's going to be absolutely lovely having to explain
> to these companies in background chats why habanalabs gets away with their
> stack and they don't.

FYI, I fully agree with Daniel here. Habanlabs needs to open up their
runtime if they want to push any additional feature in the kernel.
The current situation is not sustainable.

2021-07-06 12:46:42

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 06, 2021 at 02:21:10PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
> > > Greg, I hope this will be good enough for you to merge this code.
> >
> > So we're officially going to use dri-devel for technical details review
> > and then Greg for merging so we don't have to deal with other merge
> > criteria dri-devel folks have?
> >
> > I don't expect anything less by now, but it does make the original claim
> > that drivers/misc will not step all over accelerators folks a complete
> > farce under the totally-not-a-gpu banner.
> >
> > This essentially means that for any other accelerator stack that doesn't
> > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > other gpu driver stuff, you can just send it to Greg and it's good to go.
> >
> > There's quite a lot of these floating around actually (and many do have
> > semi-open runtimes, like habanalabs have now too, just not open enough to
> > be actually useful). It's going to be absolutely lovely having to explain
> > to these companies in background chats why habanalabs gets away with their
> > stack and they don't.
>
> FYI, I fully agree with Daniel here. Habanlabs needs to open up their
> runtime if they want to push any additional feature in the kernel.
> The current situation is not sustainable.

Before anyone replies: The runtime is open, the compiler is still closed.
This has become the new default for accel driver submissions, I think
mostly because all the interesting bits for non-3d accelerators are in the
accel ISA, and no longer in the runtime. So vendors are fairly happy to
throw in the runtime as a freebie.

It's still incomplete, and it's still useless if you want to actually hack
on the driver stack.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-06 12:49:01

by Oded Gabbay

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 3:23 PM Daniel Vetter <[email protected]> wrote:
>
> On Tue, Jul 06, 2021 at 02:21:10PM +0200, Christoph Hellwig wrote:
> > On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
> > > > Greg, I hope this will be good enough for you to merge this code.
> > >
> > > So we're officially going to use dri-devel for technical details review
> > > and then Greg for merging so we don't have to deal with other merge
> > > criteria dri-devel folks have?
> > >
> > > I don't expect anything less by now, but it does make the original claim
> > > that drivers/misc will not step all over accelerators folks a complete
> > > farce under the totally-not-a-gpu banner.
> > >
> > > This essentially means that for any other accelerator stack that doesn't
> > > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > > other gpu driver stuff, you can just send it to Greg and it's good to go.
> > >
> > > There's quite a lot of these floating around actually (and many do have
> > > semi-open runtimes, like habanalabs have now too, just not open enough to
> > > be actually useful). It's going to be absolutely lovely having to explain
> > > to these companies in background chats why habanalabs gets away with their
> > > stack and they don't.
> >
> > FYI, I fully agree with Daniel here. Habanlabs needs to open up their
> > runtime if they want to push any additional feature in the kernel.
> > The current situation is not sustainable.
Well, that's like, your opinion...

>
> Before anyone replies: The runtime is open, the compiler is still closed.
> This has become the new default for accel driver submissions, I think
> mostly because all the interesting bits for non-3d accelerators are in the
> accel ISA, and no longer in the runtime. So vendors are fairly happy to
> throw in the runtime as a freebie.
>
> It's still incomplete, and it's still useless if you want to actually hack
> on the driver stack.
> -Daniel
> --
I don't understand what's not sustainable here.

There is zero code inside the driver that communicates or interacts
with our TPC code (TPC is the Tensor Processing Core).
Even submitting works to the TPC is done via a generic queue
interface. And that queue IP is common between all our engines
(TPC/DMA/NIC). The driver provides all the specs of that queue IP,
because the driver's code is handling that queue. But why is the TPC
compiler code even relevant here ?

btw, you can today see our TPC code at
https://github.com/HabanaAI/Habana_Custom_Kernel
There is a link there to the TPC user guide and link to download the
LLVM compiler.

Oded

2021-07-06 13:19:34

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 2:46 PM Oded Gabbay <[email protected]> wrote:
>
> On Tue, Jul 6, 2021 at 3:23 PM Daniel Vetter <[email protected]> wrote:
> >
> > On Tue, Jul 06, 2021 at 02:21:10PM +0200, Christoph Hellwig wrote:
> > > On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
> > > > > Greg, I hope this will be good enough for you to merge this code.
> > > >
> > > > So we're officially going to use dri-devel for technical details review
> > > > and then Greg for merging so we don't have to deal with other merge
> > > > criteria dri-devel folks have?
> > > >
> > > > I don't expect anything less by now, but it does make the original claim
> > > > that drivers/misc will not step all over accelerators folks a complete
> > > > farce under the totally-not-a-gpu banner.
> > > >
> > > > This essentially means that for any other accelerator stack that doesn't
> > > > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > > > other gpu driver stuff, you can just send it to Greg and it's good to go.
> > > >
> > > > There's quite a lot of these floating around actually (and many do have
> > > > semi-open runtimes, like habanalabs have now too, just not open enough to
> > > > be actually useful). It's going to be absolutely lovely having to explain
> > > > to these companies in background chats why habanalabs gets away with their
> > > > stack and they don't.
> > >
> > > FYI, I fully agree with Daniel here. Habanlabs needs to open up their
> > > runtime if they want to push any additional feature in the kernel.
> > > The current situation is not sustainable.
> Well, that's like, your opinion...
>
> >
> > Before anyone replies: The runtime is open, the compiler is still closed.
> > This has become the new default for accel driver submissions, I think
> > mostly because all the interesting bits for non-3d accelerators are in the
> > accel ISA, and no longer in the runtime. So vendors are fairly happy to
> > throw in the runtime as a freebie.
> >
> > It's still incomplete, and it's still useless if you want to actually hack
> > on the driver stack.
> > -Daniel
> > --
> I don't understand what's not sustainable here.
>
> There is zero code inside the driver that communicates or interacts
> with our TPC code (TPC is the Tensor Processing Core).
> Even submitting works to the TPC is done via a generic queue
> interface. And that queue IP is common between all our engines
> (TPC/DMA/NIC). The driver provides all the specs of that queue IP,
> because the driver's code is handling that queue. But why is the TPC
> compiler code even relevant here ?

Can I use the hw how it's intended to be used without it?

If the answer is no, then essentially what you're doing with your
upstream driver is getting all the benefits of an upstream driver,
while upstream gets nothing. We can't use your stack, not as-is. Sure
we can use the queue, but we can't actually submit anything
interesting. And I'm pretty sure the point of your hw is to do more
than submit no-op packets to a queue.

This is all "I want my cake and eat it too" approach to upstreaming,
and it's totally fine attitude to have, but if you don't see why
there's maybe an different side to it then I don't get what you're
arguing. Upstream isn't free lunch for nothing.

Frankly I'm starting to assume you're arguing this all in bad faith
just because habanalabds doesn't want to actually have an open driver
stack, so any attack is good, no matter what. Which is also what
everyone else does who submits their accel driver to upstream, and
which gets us back to the starting point of this sub-thread of me
really appreciation how this will improve background discussions going
forward for everyone.

Like if the requirement for accel drivers truly is that you can submit
a dummy command to the queues then I have about 5-10 drivers at least
I could merge instantly. For something like the intel gpu driver it
would be about 50 lines of code (including all the structure boiler
plate the ioctls require)in userspace to submit a dummy queue command.
GPU and accel vendors would really love that, because it would allow
them to freeload on upstream and do essentially nothing in return.

And we'd end up with an unmaintainable disaster of a gpu or well
accelerator subsystem because there's nothing you can change or
improve because all the really useful bits of the stack are closed.
And ofc that's not any companies problem anymore, so ofc you with the
habanalabs hat on don't care and call this *extreme*.

> btw, you can today see our TPC code at
> https://github.com/HabanaAI/Habana_Custom_Kernel
> There is a link there to the TPC user guide and link to download the
> LLVM compiler.

I got stuck clicking links before I found the source for that llvm
compiler. Can you give me a direct link to the repo with sourcecode
instead please?

Thanks, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-06 14:44:03

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 4:23 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Jul 06, 2021 at 12:36:51PM +0200, Daniel Vetter wrote:
>
> > If that means AI companies don't want to open our their hw specs
> > enough to allow that, so be it - all you get in that case is
> > offloading the kernel side of the stack for convenience, with zero
> > long term prospects to ever make this into a cross vendor subsystem
> > stack that does something useful.
>
> I don't think this is true at all - nouveau is probably the best
> example.
>
> nouveau reverse engineered a userspace stack for one of these devices.
>
> How much further ahead would they have been by now if they had a
> vendor supported, fully featured, open kernel driver to build the
> userspace upon?

There is actually tons of example here, most of the arm socs have
fully open kernel drivers, supported by the vendor (out of tree).

The hard part is the userspace driver and all the things you're
submitting to it. We've had open kernel drivers for mail/qualcomm/...
years before any believable open implementation started existing.
Typing up the memory manager and hw submission queue handling is
comparitively trivial. Generally the kernel driver is also done last,
you bring up the userspace first, often by just directly programming
the hw from userspace. Kernel driver only gets in the way with this
stuff (nouveau is entirely developed as a userspace driver, as the
most extreme example).

This is a bit different for the display side, but nowadays those
drivers are fully in-kernel so they're all open. Well except the
nvidia one, and I've not heard of nvidia working on even an
out-of-tree open display driver, so that won't help the in-tree effort
at all.

Where it would have helped is if this open driver would come with
redistributable firmware, because that is right now the thing making
nouveau reverse-engineering painful enough to be non-feasible. Well
not the reverse-engineering, but the "shipping the result as a working
driver stack".

I don't think the facts on the ground support your claim here, aside
from the practical problem that nvidia is unwilling to even create an
open driver to begin with. So there isn't anything to merge.

> > open up your hw enough for that, I really don't see the point in
> > merging such a driver, it'll be an unmaintainable stack by anyone else
> > who's not having access to those NDA covered specs and patents and
> > everything.
>
> My perspective from RDMA is that the drivers are black boxes. I can
> hack around the interface layers but there is a lot of wild stuff in
> there that can't be understood without access to the HW documentation.

There's shipping gpu drivers with entirely reverse-engineered stacks.
And I don't mean "shipping in fedora" but "shipping in Chrome tablets
sold by OEM partners of Google". So it's very much possible, even if
the vendor is maximally stubborn about things.

> I think only HW that has open specs, like say NVMe, can really be
> properly community oriented. Otherwise we have to work in a community
> partnership with the vendor.

Well sure that's the ideal case, but most vendors in the accel space
arent interested actual partnership with the wider community. It's
"merge this kernel driver and have no further demands about anything
else". Well there are some who are on board, but it does take pretty
enormous amounts of coercion.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-06 14:44:51

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 06, 2021 at 12:36:51PM +0200, Daniel Vetter wrote:

> If that means AI companies don't want to open our their hw specs
> enough to allow that, so be it - all you get in that case is
> offloading the kernel side of the stack for convenience, with zero
> long term prospects to ever make this into a cross vendor subsystem
> stack that does something useful.

I don't think this is true at all - nouveau is probably the best
example.

nouveau reverse engineered a userspace stack for one of these devices.

How much further ahead would they have been by now if they had a
vendor supported, fully featured, open kernel driver to build the
userspace upon?

> open up your hw enough for that, I really don't see the point in
> merging such a driver, it'll be an unmaintainable stack by anyone else
> who's not having access to those NDA covered specs and patents and
> everything.

My perspective from RDMA is that the drivers are black boxes. I can
hack around the interface layers but there is a lot of wild stuff in
there that can't be understood without access to the HW documentation.

I think only HW that has open specs, like say NVMe, can really be
properly community oriented. Otherwise we have to work in a community
partnership with the vendor.

Jason

2021-07-06 14:48:37

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 3:44 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Jul 06, 2021 at 02:07:16PM +0200, Daniel Vetter wrote:
>
> > On the "rdma-core" idea, afaik rdma NIC do not have fully programmable
> > cores in their hw, for which you'd need some kind of compiler to make
> > use of the hardware and the interfaces the kernel provides? So not
> > really compareable, but also my understanding is that rdma-core does
> > actually allow you to reasonable use&drive all the hw features and
> > kernel interfaces fully.
>
> The whole HPC stack has speciality compilers of course. OpenMP, PGAS,
> etc. These compilers map onto library primitives that eventually boil
> down into rdma-core calls. Even the HW devices have various
> programmability that are being targetted with compilers now. People
> are making NIC devices with ARM cores/etc - P4 is emerging for some
> packet processing tasks.

Well it depends which compilers we're talking about here, and what
kind of features. Higher level compilers that break down some fancy
language like OpenMP into what that actually should do on a given
hardware like gpu, or rdma-connected cluster, or whatever, we really
don't care about. You don't need that to drive the hardware. Usually
that stuff works by breaking some of the code down into cpu compiler
IR (most of this is built on top of LLVM IR nowadays), interspersed
with library calls to the runtime.

Now the thing I care about here is if things doen't get compiled down
to cpu code, but to some other IR (SPIR-V is starting to win, but very
often ist still a hacked up version of LLVM IR), which then in a
hw-specific backend gets compiled down to instructions that run on the
hw. I had no idea that rdma NICs can do that, but it sounds like? I
guess maybe some openmpi operations could be done directly on the rdma
chip, but I'm not sure why you'd want a backend compiler here.

Anyway, for anything that works like a gpu accelerator, like 3d accel,
or parallel compute accel (aka gpgpu) or spatial compute accel (aka
NN/AI) or maybe even fpga accel most of the magic to use the hardware
is in this backend compiler, which translates from an IR into whatever
your accelerator consumes. That's the part we really care about for
modern accelerators because without that defacto the hardware is
useless. Generally these chips have full-blown, if special purpose
ISA, with register files, spilling, branches, loops and other control
flow (sometimes only execution masks on simpler hw).

> rdma-core can drive all the kernel interfaces with at least an ioctl
> wrapper, and it has a test suite that tries to cover this. It does not
> exercise the full HW capability, programmability, etc of every single
> device.
>
> I actually don't entirely know what everyone has built on top of
> rdma-core, or how I'd try to map it the DRI ideas you are trying to
> explain.
>
> Should we ban all Intel RDMA drivers because they are shipping
> proprietary Intel HPC compilers and proprietary Intel MPI which drives
> their RDMA HW? Or is that OK because there are open analogs for some
> of that stuff? And yes, the open versions are inferior in various
> metrics.
>
> Pragmatically what I want to see is enough RDMA common/open user space
> to understand the uAPI and thus more about how the kernel driver
> works. Forcing everyone into rdma-core has already prevented a number
> of uAPI mistakes in drivers that would have been bad - so at least
> this level really is valuable.
>
> > So we actually want less on dri-devel, because for compute/accel chips
> > we're currently happy with a vendor userspace. It just needs to be
> > functional and complete, and open in its entirety.
>
> In a sense yes: DRI doesn't insist on a single code base to act as the
> kernel interface, but that is actually the thing that has brought the
> most value to RDMA, IMHO.

So in practice we're not that different in DRI wrt userspace - if
there is an established cross-vendor project in the given area, we do
expect the userspace side to be merged there. And nowadays most of the
feature work is done that way, it's just that we don't have a single
project like rdma-core for this. We do still allow per-driver submit
interfaces because hw is just not standardized enough there, the
standards are at a higher level. Which is why it just doesn't make
sense to talk about a kernel driver as something that's useful
stand-alone at all.

> We've certainly had some interesting successes because of this. The
> first submission for AWS's EFA driver proposed to skip the rdma-core
> step, which was rejected. However since EFA has been in that ecosystem
> it has benefited greatly, I think.
>
> However, in another sense no: RDMA hasn't been blocking, say Intel,
> just because they have built proprietary stuff on top of our open
> stack.

Oh we allow this too. We only block the initial submission if the
proprietary stuff is the only thing out there.

> Honestly, I think GPU is approaching this backwards. Wayland should
> have been designed to prevent proprietary userspace stacks.

That's not possible without some serious cans of worms though. Wayland
is a protocol, and you can't forbid people from implementing it.
Otherwise all the compatible open implementations of closed protocols
wouldn't be possible either.

Now the implementation is a different thing, and there a few
compositors have succumbed to market pressure and enabled the nvidia
stack, as a mostly separate piece from supporting the open stack. And
that's largely because nvidia managed to completely kill the open
source r/e effort through firmware licensing and crypto-key based
verified loading, so unless you install the proprietary stack you
actually can't make use of the hardware at all - well display works
without the firmware, but 3d/compute just doesn't. So you just can't
use nvidia hw without accepting their proprietary driver licenses and
all that entails for the latest hardware.

So I'm not clear what you're suggesting here we should do different.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-06 14:58:05

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 06, 2021 at 04:09:25PM +0200, Daniel Vetter wrote:
> Anyway, for anything that works like a gpu accelerator, like 3d accel,
> or parallel compute accel (aka gpgpu) or spatial compute accel (aka
> NN/AI) or maybe even fpga accel most of the magic to use the hardware
> is in this backend compiler, which translates from an IR into whatever
> your accelerator consumes. That's the part we really care about for
> modern accelerators because without that defacto the hardware is
> useless. Generally these chips have full-blown, if special purpose
> ISA, with register files, spilling, branches, loops and other control
> flow (sometimes only execution masks on simpler hw).

I don't know if I see it so clearly as you do - at the end of the day
the user keys in the program in some proprietary (or open!) language
and and wack of propritary magic transforms it to "make it work".

There are many barriers that prevent someone without the secret
knowledge from duplicating the end result of a working program. An
accelerator ISA is certainly one example, but I wouldn't overly focus
on it as the only blocker.

Like you said below the NVIDIA GPU ISA seems known but the HW is still
not really useful for other reasons.

Habana seems to have gone the other way, the HW is fully useful but we
don't have the ISA transformation and other details.

Both cases seem to have ended up with something useless, and I have a
hard time saying nouveau has more right to be in the kernel tree than
Habana does.

> > Honestly, I think GPU is approaching this backwards. Wayland should
> > have been designed to prevent proprietary userspace stacks.
>
> That's not possible without some serious cans of worms though. Wayland
> is a protocol, and you can't forbid people from implementing it.
> Otherwise all the compatible open implementations of closed protocols
> wouldn't be possible either.

Well, in many ways so is Linux, but nobody would seriously
re-implement Linux just to produce a driver.

> So I'm not clear what you're suggesting here we should do different.

Not enabling proprietary stacks as above would be a good start.

Jason

2021-07-06 15:14:04

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] habanalabs: add support for dma-buf exporter

On Tue, Jul 06, 2021 at 12:44:49PM +0300, Oded Gabbay wrote:

> > > + /* In case we got a large memory area to export, we need to divide it
> > > + * to smaller areas because each entry in the dmabuf sgt can only
> > > + * describe unsigned int.
> > > + */
> >
> > Huh? This is forming a SGL, it should follow the SGL rules which means
> > you have to fragment based on the dma_get_max_seg_size() of the
> > importer device.
> >
> hmm
> I don't see anyone in drm checking this value (and using it) when
> creating the SGL when exporting dmabuf. (e.g.
> amdgpu_vram_mgr_alloc_sgt)

For dmabuf the only importer is RDMA and it doesn't care, but you
certainly should not introduce a hardwired constant instead of using
the correct function.

Jason

2021-07-06 15:14:09

by Oded Gabbay

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 4:17 PM Daniel Vetter <[email protected]> wrote:
>
> On Tue, Jul 6, 2021 at 2:46 PM Oded Gabbay <[email protected]> wrote:
> >
> > On Tue, Jul 6, 2021 at 3:23 PM Daniel Vetter <[email protected]> wrote:
> > >
> > > On Tue, Jul 06, 2021 at 02:21:10PM +0200, Christoph Hellwig wrote:
> > > > On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
> > > > > > Greg, I hope this will be good enough for you to merge this code.
> > > > >
> > > > > So we're officially going to use dri-devel for technical details review
> > > > > and then Greg for merging so we don't have to deal with other merge
> > > > > criteria dri-devel folks have?
> > > > >
> > > > > I don't expect anything less by now, but it does make the original claim
> > > > > that drivers/misc will not step all over accelerators folks a complete
> > > > > farce under the totally-not-a-gpu banner.
> > > > >
> > > > > This essentially means that for any other accelerator stack that doesn't
> > > > > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > > > > other gpu driver stuff, you can just send it to Greg and it's good to go.
> > > > >
> > > > > There's quite a lot of these floating around actually (and many do have
> > > > > semi-open runtimes, like habanalabs have now too, just not open enough to
> > > > > be actually useful). It's going to be absolutely lovely having to explain
> > > > > to these companies in background chats why habanalabs gets away with their
> > > > > stack and they don't.
> > > >
> > > > FYI, I fully agree with Daniel here. Habanlabs needs to open up their
> > > > runtime if they want to push any additional feature in the kernel.
> > > > The current situation is not sustainable.
> > Well, that's like, your opinion...
> >
> > >
> > > Before anyone replies: The runtime is open, the compiler is still closed.
> > > This has become the new default for accel driver submissions, I think
> > > mostly because all the interesting bits for non-3d accelerators are in the
> > > accel ISA, and no longer in the runtime. So vendors are fairly happy to
> > > throw in the runtime as a freebie.
> > >
> > > It's still incomplete, and it's still useless if you want to actually hack
> > > on the driver stack.
> > > -Daniel
> > > --
> > I don't understand what's not sustainable here.
> >
> > There is zero code inside the driver that communicates or interacts
> > with our TPC code (TPC is the Tensor Processing Core).
> > Even submitting works to the TPC is done via a generic queue
> > interface. And that queue IP is common between all our engines
> > (TPC/DMA/NIC). The driver provides all the specs of that queue IP,
> > because the driver's code is handling that queue. But why is the TPC
> > compiler code even relevant here ?
>
> Can I use the hw how it's intended to be used without it?
You can use the h/w with the userspace stack we are providing in our
github repos + website.
Part of the userspace stack is open sourced, part is closed source.
And I'm actively working on opening up more stuff as we go along.

>
> If the answer is no, then essentially what you're doing with your
> upstream driver is getting all the benefits of an upstream driver,
> while upstream gets nothing. We can't use your stack, not as-is. Sure
> we can use the queue, but we can't actually submit anything
> interesting. And I'm pretty sure the point of your hw is to do more
> than submit no-op packets to a queue.
>
> This is all "I want my cake and eat it too" approach to upstreaming,
> and it's totally fine attitude to have, but if you don't see why
> there's maybe an different side to it then I don't get what you're
> arguing. Upstream isn't free lunch for nothing.
>
> Frankly I'm starting to assume you're arguing this all in bad faith
> just because habanalabds doesn't want to actually have an open driver
> stack, so any attack is good, no matter what. Which is also what
> everyone else does who submits their accel driver to upstream, and
> which gets us back to the starting point of this sub-thread of me
> really appreciation how this will improve background discussions going
> forward for everyone.
>
> Like if the requirement for accel drivers truly is that you can submit
> a dummy command to the queues then I have about 5-10 drivers at least
> I could merge instantly. For something like the intel gpu driver it
> would be about 50 lines of code (including all the structure boiler
> plate the ioctls require)in userspace to submit a dummy queue command.
> GPU and accel vendors would really love that, because it would allow
> them to freeload on upstream and do essentially nothing in return.
>
> And we'd end up with an unmaintainable disaster of a gpu or well
> accelerator subsystem because there's nothing you can change or
> improve because all the really useful bits of the stack are closed.
> And ofc that's not any companies problem anymore, so ofc you with the
> habanalabs hat on don't care and call this *extreme*.
>
> > btw, you can today see our TPC code at
> > https://github.com/HabanaAI/Habana_Custom_Kernel
> > There is a link there to the TPC user guide and link to download the
> > LLVM compiler.
>
> I got stuck clicking links before I found the source for that llvm
> compiler. Can you give me a direct link to the repo with sourcecode
> instead please?
The source code for the LLVM compiler is not available yet. That's one
of the parts I'm working on getting in the open.
Having said that, I don't think (and I'm not alone at this) that this
should be a pre-requirement for upstreaming kernel drivers of any
type.
And we had this discussion in the past, I'm sure we are both tired of
repeating ourselves.

>
> Thanks, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

2021-07-06 15:24:47

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 06, 2021 at 02:07:16PM +0200, Daniel Vetter wrote:

> On the "rdma-core" idea, afaik rdma NIC do not have fully programmable
> cores in their hw, for which you'd need some kind of compiler to make
> use of the hardware and the interfaces the kernel provides? So not
> really compareable, but also my understanding is that rdma-core does
> actually allow you to reasonable use&drive all the hw features and
> kernel interfaces fully.

The whole HPC stack has speciality compilers of course. OpenMP, PGAS,
etc. These compilers map onto library primitives that eventually boil
down into rdma-core calls. Even the HW devices have various
programmability that are being targetted with compilers now. People
are making NIC devices with ARM cores/etc - P4 is emerging for some
packet processing tasks.

rdma-core can drive all the kernel interfaces with at least an ioctl
wrapper, and it has a test suite that tries to cover this. It does not
exercise the full HW capability, programmability, etc of every single
device.

I actually don't entirely know what everyone has built on top of
rdma-core, or how I'd try to map it the DRI ideas you are trying to
explain.

Should we ban all Intel RDMA drivers because they are shipping
proprietary Intel HPC compilers and proprietary Intel MPI which drives
their RDMA HW? Or is that OK because there are open analogs for some
of that stuff? And yes, the open versions are inferior in various
metrics.

Pragmatically what I want to see is enough RDMA common/open user space
to understand the uAPI and thus more about how the kernel driver
works. Forcing everyone into rdma-core has already prevented a number
of uAPI mistakes in drivers that would have been bad - so at least
this level really is valuable.

> So we actually want less on dri-devel, because for compute/accel chips
> we're currently happy with a vendor userspace. It just needs to be
> functional and complete, and open in its entirety.

In a sense yes: DRI doesn't insist on a single code base to act as the
kernel interface, but that is actually the thing that has brought the
most value to RDMA, IMHO.

We've certainly had some interesting successes because of this. The
first submission for AWS's EFA driver proposed to skip the rdma-core
step, which was rejected. However since EFA has been in that ecosystem
it has benefited greatly, I think.

However, in another sense no: RDMA hasn't been blocking, say Intel,
just because they have built proprietary stuff on top of our open
stack.

Honestly, I think GPU is approaching this backwards. Wayland should
have been designed to prevent proprietary userspace stacks.

Jason

2021-07-06 15:26:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 06, 2021 at 04:39:19PM +0200, Daniel Vetter wrote:
> On Tue, Jul 6, 2021 at 4:23 PM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Tue, Jul 06, 2021 at 12:36:51PM +0200, Daniel Vetter wrote:
> >
> > > If that means AI companies don't want to open our their hw specs
> > > enough to allow that, so be it - all you get in that case is
> > > offloading the kernel side of the stack for convenience, with zero
> > > long term prospects to ever make this into a cross vendor subsystem
> > > stack that does something useful.
> >
> > I don't think this is true at all - nouveau is probably the best
> > example.
> >
> > nouveau reverse engineered a userspace stack for one of these devices.
> >
> > How much further ahead would they have been by now if they had a
> > vendor supported, fully featured, open kernel driver to build the
> > userspace upon?
>
> There is actually tons of example here, most of the arm socs have
> fully open kernel drivers, supported by the vendor (out of tree).

I choose nouveau because of this:

$ git ls-files drivers/gpu/drm/arm/ | xargs wc -l
15039 total
$ git ls-files drivers/gpu/drm/nouveau/ | xargs wc -l
204198 total

At 13x the size of mali this is not just some easy to wire up memory
manager and command submission. And after all that typing it still
isn't very good. The fully supported AMD vendor driver is over 3
million lines, so nouveau probably needs to grow several times.

My argument is that an in-tree open kernel driver is a big help to
reverse engineering an open userspace. Having the vendors
collaboration to build that monstrous thing can only help the end goal
of an end to end open stack.

For instance a vendor with an in-tree driver has a strong incentive to
sort out their FW licensing issues so it can be redistributed.

I'm not sure about this all or nothing approach. AFAIK DRM has the
worst problems with out of tree drivers right now.

> Where it would have helped is if this open driver would come with
> redistributable firmware, because that is right now the thing making
> nouveau reverse-engineering painful enough to be non-feasible. Well
> not the reverse-engineering, but the "shipping the result as a working
> driver stack".

I don't think much of the out of tree but open drivers. The goal must
be to get vendors in tree.

I would applaud Habana for getting an intree driver at least, even if
the userspace is not what we'd all want to see.

> I don't think the facts on the ground support your claim here, aside
> from the practical problem that nvidia is unwilling to even create an
> open driver to begin with. So there isn't anything to merge.

The internet tells me there is nvgpu, it doesn't seem to have helped.

Jason

2021-07-06 15:50:02

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 5:25 PM Jason Gunthorpe <[email protected]> wrote:
> On Tue, Jul 06, 2021 at 04:39:19PM +0200, Daniel Vetter wrote:
> > On Tue, Jul 6, 2021 at 4:23 PM Jason Gunthorpe <[email protected]> wrote:
> > >
> > > On Tue, Jul 06, 2021 at 12:36:51PM +0200, Daniel Vetter wrote:
> > >
> > > > If that means AI companies don't want to open our their hw specs
> > > > enough to allow that, so be it - all you get in that case is
> > > > offloading the kernel side of the stack for convenience, with zero
> > > > long term prospects to ever make this into a cross vendor subsystem
> > > > stack that does something useful.
> > >
> > > I don't think this is true at all - nouveau is probably the best
> > > example.
> > >
> > > nouveau reverse engineered a userspace stack for one of these devices.
> > >
> > > How much further ahead would they have been by now if they had a
> > > vendor supported, fully featured, open kernel driver to build the
> > > userspace upon?
> >
> > There is actually tons of example here, most of the arm socs have
> > fully open kernel drivers, supported by the vendor (out of tree).
>
> I choose nouveau because of this:
>
> $ git ls-files drivers/gpu/drm/arm/ | xargs wc -l
> 15039 total
> $ git ls-files drivers/gpu/drm/nouveau/ | xargs wc -l
> 204198 total

drm/arm is the arm display driver, which isn't actually shipping
anywhere afaik. Also it's not including the hdmi/dp output drivers,
those are generally external on socs, but integrated in discrete gpu.

The other thing to keep in mind is that one of these drivers supports
25 years of product generations, and the other one doesn't. So I think
adding it all up it's not that much different. Last time I looked if
you look at just command submission and rendering/compute, and not
include display, which heavily skews the stats, it's about 10% kernel,
90% userspace driver parts. Not including anything that's shared,
which is most of it (compiler frontend, intermediate optimizer, entire
runtime/state tracker and all the integration and glue pieces
largely).

> At 13x the size of mali this is not just some easy to wire up memory
> manager and command submission. And after all that typing it still
> isn't very good. The fully supported AMD vendor driver is over 3
> million lines, so nouveau probably needs to grow several times.

AMD is 3 million lines the size because it includes per-generation
generated header files.

And of course once you throw an entire vendor team at a driver all
those engineers will produce something, and there's the usual that the
last 10% of features produce about 90% of the complexity and code
problem. E.g. the kbase driver for arm mali gpu is 20x the size of the
in-tree panfrost driver - they need to keep typing to justify their
continued employement, or something like that. Usually it's because
they reinvent the world.

> My argument is that an in-tree open kernel driver is a big help to
> reverse engineering an open userspace. Having the vendors
> collaboration to build that monstrous thing can only help the end goal
> of an end to end open stack.

Not sure where this got lost, but we're totally fine with vendors
using the upstream driver together with their closed stack. And most
of the drivers we do have in upstream are actually, at least in parts,
supported by the vendor. E.g. if you'd have looked the drm/arm driver
you picked is actually 100% written by ARM engineers. So kinda
unfitting example.

> For instance a vendor with an in-tree driver has a strong incentive to
> sort out their FW licensing issues so it can be redistributed.

Nvidia has been claiming to try and sort out the FW problem for years.
They even managed to release a few things, but I think the last one is
2-3 years late now. Partially the reason is that there don't have a
stable api between the firmware and driver, it's all internal from the
same source tree, and they don't really want to change that.

> I'm not sure about this all or nothing approach. AFAIK DRM has the
> worst problems with out of tree drivers right now.

Well I guess someone could stand up a drivers/totally-not-gpu and just
let the flood in. Even duplicated drivers and everything included,
because the vendor drivers are better. Worth a shot, we've practically
started this already, I'm just not going to help with the cleanup.

> > Where it would have helped is if this open driver would come with
> > redistributable firmware, because that is right now the thing making
> > nouveau reverse-engineering painful enough to be non-feasible. Well
> > not the reverse-engineering, but the "shipping the result as a working
> > driver stack".
>
> I don't think much of the out of tree but open drivers. The goal must
> be to get vendors in tree.

Agreed. We actually got them in-tree largely. Nvidia even contributes
the oddball thing, and I think the tegra line is still fully supported
in upstream with the upstream driver.

I'm not sure the bleak picture you're drawing is reality, aside from
the fact that Nvidia discrete gpu drivers being a disaster with no
redistributable firmware, no open kernel driver that works, and
nothing else really either.

> I would applaud Habana for getting an intree driver at least, even if
> the userspace is not what we'd all want to see.
>
> > I don't think the facts on the ground support your claim here, aside
> > from the practical problem that nvidia is unwilling to even create an
> > open driver to begin with. So there isn't anything to merge.
>
> The internet tells me there is nvgpu, it doesn't seem to have helped.

Not sure which one you mean, but every once in a while they open up a
few headers, or a few programming specs, or a small driver somewhere
for a very specific thing, and then it dies again or gets obfuscated
for the next platform, or just never updated. I've never seen anything
that comes remotely to something complete, aside from tegra socs,
which are fully supported in upstream afaik.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-06 15:54:40

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 4:56 PM Jason Gunthorpe <[email protected]> wrote:
> On Tue, Jul 06, 2021 at 04:09:25PM +0200, Daniel Vetter wrote:
> > Anyway, for anything that works like a gpu accelerator, like 3d accel,
> > or parallel compute accel (aka gpgpu) or spatial compute accel (aka
> > NN/AI) or maybe even fpga accel most of the magic to use the hardware
> > is in this backend compiler, which translates from an IR into whatever
> > your accelerator consumes. That's the part we really care about for
> > modern accelerators because without that defacto the hardware is
> > useless. Generally these chips have full-blown, if special purpose
> > ISA, with register files, spilling, branches, loops and other control
> > flow (sometimes only execution masks on simpler hw).
>
> I don't know if I see it so clearly as you do - at the end of the day
> the user keys in the program in some proprietary (or open!) language
> and and wack of propritary magic transforms it to "make it work".
>
> There are many barriers that prevent someone without the secret
> knowledge from duplicating the end result of a working program. An
> accelerator ISA is certainly one example, but I wouldn't overly focus
> on it as the only blocker.

Well we don't, we do just ask for the full driver stack to make the hw
work. It's just that in the past most vendors choose to leave out the
compiler/ISA from their open stack/specs. Well except nvidia, which
still chooses to leave out everything aside from some very, very
minimal thing around documenting display functionality.

> Like you said below the NVIDIA GPU ISA seems known but the HW is still
> not really useful for other reasons.
>
> Habana seems to have gone the other way, the HW is fully useful but we
> don't have the ISA transformation and other details.

You can actually use nvidia gpus, they're fully functional.

If you install the blobby stack. Which is exactly the same thing as
with habanalabs, plus/minus a few things at the fringes.

In the end it's about drawing the line somewhere, so maybe we should
merge the nvidia glue code that makes their blobby stack work better
with upstream? There's quite a few pieces there, e.g. their display
driver is by design a userspace driver, whereas with kernel
modesetting it needs to be in the kernel to expose the common kms
ioctl interfaces, so they've built up a glue layer to forward
everything to userspace and back. On windows it works because there
kernel code can have growing stacks and fun stuff like that, at least
that's my understanding. Not really an option to just run the code in
linux.

I'm pretty sure nvidia would appreciate that, and maybe every once in
a while they open up a header for a generation or two of products like
they've done in the past.

> Both cases seem to have ended up with something useless, and I have a
> hard time saying nouveau has more right to be in the kernel tree than
> Habana does.
>
> > > Honestly, I think GPU is approaching this backwards. Wayland should
> > > have been designed to prevent proprietary userspace stacks.
> >
> > That's not possible without some serious cans of worms though. Wayland
> > is a protocol, and you can't forbid people from implementing it.
> > Otherwise all the compatible open implementations of closed protocols
> > wouldn't be possible either.
>
> Well, in many ways so is Linux, but nobody would seriously
> re-implement Linux just to produce a driver.

Well in the gpu space for 2+ decades nvidia has been setting the
standard, and the open stack has been trying to catch up by
reimplementing the entire thing. It took a fair while.

> > So I'm not clear what you're suggesting here we should do different.
>
> Not enabling proprietary stacks as above would be a good start.

I'm still not sure what exactly you mean here. Like on the 3d side
there's opengl and vulkan, and nvidia just has an entirely different
implementation of that compared to any of the open drivers. That is a
bit less code than linux, but it's not small, and reimplementing over
decades is pretty much what happened. And if it's not allowed we'd
actually not have an open 3d gpu stack at all, because only very
recently did we get an agreement around the tracemark/licensing issues
of that stuff with Khronos. Recently compared to the history of opengl
at least.

So I'm still not clear what exactly it is you're suggesting we should
do? Not implement the industry standards for 3d (and accept we stay
irrelevant forever)? Reject nvidia blobs harder than we do already?
Distros will continue to ship an auto-installer for that stack, at
least some, so we're pretty much maxed out already. Like in what way
do you think the upstream stack does enable the proprietary nvidia
stack? Should we permanently ban any contributions from anyone with an
@nvidia.com address, even if it helps the open stack improve?

Like I'm not seeing something concrete that could be done, which would
actually prevent nvidia from having their completely independent
stack, with exact same functionality and not a line of code shared.
Which is were we are right now. The only thing where we could be more
strict is to reject any contributions from them at all, just because
we don't like them. That seems a bit too extreme
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-06 16:09:00

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 5:49 PM Daniel Vetter <[email protected]> wrote:
> On Tue, Jul 6, 2021 at 5:25 PM Jason Gunthorpe <[email protected]> wrote:
> > I'm not sure about this all or nothing approach. AFAIK DRM has the
> > worst problems with out of tree drivers right now.
>
> Well I guess someone could stand up a drivers/totally-not-gpu and just
> let the flood in. Even duplicated drivers and everything included,
> because the vendor drivers are better. Worth a shot, we've practically
> started this already, I'm just not going to help with the cleanup.

tbh I think at this point someone should just do that. Ideally with
some boundary like please don't use dma-fence or dma-buf and stuff
like that so drivers/gpu doesn't ever have to deal with the fallout.
But way too many people think that somehow you magically get the other
90% of an open accel stack if you're just friendly enough and merge
the kernel driver, so we really should just that experiment in
upstream and watch it pan out in reality.

Minimally it would be some great entertainment :-)

Also on your claim that drivers/gpu is a non-upstream disaster: I've
also learned that that for drivers/rdma there's the upstream driver,
and then there's the out-of-tree hackjob the vendor actually supports.
So seems to be about the same level of screwed up, if you ask the
vendor they tell you the upstream driver isn't a thing they care about
and it's just done for a bit of goodwill. Except if you have enormous
amounts of volume, then suddenly it's an option ... Minus the fw issue
for nvidia, upstream does support all the gpus you can buy right now
and that can run on linux with some vendor driver (aka excluding apple
M1 and ofc upcoming products from most vendors).

drivers/accel otoh is mostly out-of-tree, because aside from Greg
mergin habanalabs no one is bold enough anymore to just merge them
all. There's lots of those going around that would be ready for
picking. And they've been continously submitted to upstream over the
years, even before the entire habanalabs thing.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-06 16:32:36

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 06, 2021 at 05:49:01PM +0200, Daniel Vetter wrote:

> The other thing to keep in mind is that one of these drivers supports
> 25 years of product generations, and the other one doesn't.

Sure, but that is the point, isn't it? To have an actually useful
thing you need all of this mess

> > My argument is that an in-tree open kernel driver is a big help to
> > reverse engineering an open userspace. Having the vendors
> > collaboration to build that monstrous thing can only help the end goal
> > of an end to end open stack.
>
> Not sure where this got lost, but we're totally fine with vendors
> using the upstream driver together with their closed stack. And most
> of the drivers we do have in upstream are actually, at least in parts,
> supported by the vendor. E.g. if you'd have looked the drm/arm driver
> you picked is actually 100% written by ARM engineers. So kinda
> unfitting example.

So the argument with Habana really boils down to how much do they need
to show in the open source space to get a kernel driver? You want to
see the ISA or compiler at least?

That at least doesn't seem "extreme" to me.

> > For instance a vendor with an in-tree driver has a strong incentive to
> > sort out their FW licensing issues so it can be redistributed.
>
> Nvidia has been claiming to try and sort out the FW problem for years.
> They even managed to release a few things, but I think the last one is
> 2-3 years late now. Partially the reason is that there don't have a
> stable api between the firmware and driver, it's all internal from the
> same source tree, and they don't really want to change that.

Right, companies have no incentive to work in a sane way if they have
their own parallel world. I think drawing them part by part into the
standard open workflows and expectations is actually helpful to
everyone.

> > > I don't think the facts on the ground support your claim here, aside
> > > from the practical problem that nvidia is unwilling to even create an
> > > open driver to begin with. So there isn't anything to merge.
> >
> > The internet tells me there is nvgpu, it doesn't seem to have helped.
>
> Not sure which one you mean, but every once in a while they open up a
> few headers, or a few programming specs, or a small driver somewhere
> for a very specific thing, and then it dies again or gets obfuscated
> for the next platform, or just never updated. I've never seen anything
> that comes remotely to something complete, aside from tegra socs,
> which are fully supported in upstream afaik.

I understand nvgpu is the tegra driver that people actualy
use. nouveau may have good tegra support but is it used in any actual
commercial product?

Jason

2021-07-06 17:29:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 06, 2021 at 06:07:17PM +0200, Daniel Vetter wrote:

> Also on your claim that drivers/gpu is a non-upstream disaster: I've
> also learned that that for drivers/rdma there's the upstream driver,
> and then there's the out-of-tree hackjob the vendor actually
> supports.

In the enterprise world everyone has their out of tree backport
drivers. It varies on the vendor how much deviation there is from the
upstream driver and what commercial support relationship the vendor
has with the enterprise distros.

> So seems to be about the same level of screwed up, if you ask the
> vendor they tell you the upstream driver isn't a thing they care about
> and it's just done for a bit of goodwill.

Sounds like you should get a new RDMA supplier :)

To be fair Intel is getting better, they got their new RDMA HW support
merged into v5.14 after about 2 years in the out of tree world. Though
it is still incomplete compared to their out of tree driver, the gap
is much smaller now.

> amounts of volume, then suddenly it's an option ... Minus the fw issue
> for nvidia, upstream does support all the gpus you can buy right now
> and that can run on linux with some vendor driver (aka excluding apple
> M1 and ofc upcoming products from most vendors).

I would look at how many actual commercial systems are running the
upstream/inbox stack. I personally know of quite a few sites with big
HPC RDMA deployments running pure inbox kernels, no add on kernel
modules, with full commercial support.

If you can say that kind of arrangment is also common place in the GPU
world then I will happily be wrong.

Jason

2021-07-06 17:32:25

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 06, 2021 at 02:28:28PM -0300, Jason Gunthorpe wrote:
> > Also on your claim that drivers/gpu is a non-upstream disaster: I've
> > also learned that that for drivers/rdma there's the upstream driver,
> > and then there's the out-of-tree hackjob the vendor actually
> > supports.
>
> In the enterprise world everyone has their out of tree backport
> drivers. It varies on the vendor how much deviation there is from the
> upstream driver and what commercial support relationship the vendor
> has with the enterprise distros.

I think he means the Mellanox OFED stack, which is a complete and utter
mess and which gets force fed by Mellanox/Nvidia on unsuspecting
customers. I know many big HPC sites that ignore it, but a lot of
enterprise customers are dumb enought to deploy it.

2021-07-06 17:38:17

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 6:29 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Jul 06, 2021 at 05:49:01PM +0200, Daniel Vetter wrote:
>
> > The other thing to keep in mind is that one of these drivers supports
> > 25 years of product generations, and the other one doesn't.
>
> Sure, but that is the point, isn't it? To have an actually useful
> thing you need all of this mess
>
> > > My argument is that an in-tree open kernel driver is a big help to
> > > reverse engineering an open userspace. Having the vendors
> > > collaboration to build that monstrous thing can only help the end goal
> > > of an end to end open stack.
> >
> > Not sure where this got lost, but we're totally fine with vendors
> > using the upstream driver together with their closed stack. And most
> > of the drivers we do have in upstream are actually, at least in parts,
> > supported by the vendor. E.g. if you'd have looked the drm/arm driver
> > you picked is actually 100% written by ARM engineers. So kinda
> > unfitting example.
>
> So the argument with Habana really boils down to how much do they need
> to show in the open source space to get a kernel driver? You want to
> see the ISA or compiler at least?

Yup. We dont care about any of the fancy pieces you build on top, nor
does the compiler need to be the optimizing one. Just something that's
good enough to drive the hw in some demons to see how it works and all
that. Generally that's also not that hard to reverse engineer, if
someone is bored enough, the real fancy stuff tends to be in how you
optimize the generated code. And make it fit into the higher levels
properly.

> That at least doesn't seem "extreme" to me.
>
> > > For instance a vendor with an in-tree driver has a strong incentive to
> > > sort out their FW licensing issues so it can be redistributed.
> >
> > Nvidia has been claiming to try and sort out the FW problem for years.
> > They even managed to release a few things, but I think the last one is
> > 2-3 years late now. Partially the reason is that there don't have a
> > stable api between the firmware and driver, it's all internal from the
> > same source tree, and they don't really want to change that.
>
> Right, companies have no incentive to work in a sane way if they have
> their own parallel world. I think drawing them part by part into the
> standard open workflows and expectations is actually helpful to
> everyone.

Well we do try to get them on board part-by-part generally starting
with the kernel and ending with a proper compiler instead of the usual
llvm hack job, but for whatever reasons they really like their
in-house stuff, see below for what I mean.

> > > > I don't think the facts on the ground support your claim here, aside
> > > > from the practical problem that nvidia is unwilling to even create an
> > > > open driver to begin with. So there isn't anything to merge.
> > >
> > > The internet tells me there is nvgpu, it doesn't seem to have helped.
> >
> > Not sure which one you mean, but every once in a while they open up a
> > few headers, or a few programming specs, or a small driver somewhere
> > for a very specific thing, and then it dies again or gets obfuscated
> > for the next platform, or just never updated. I've never seen anything
> > that comes remotely to something complete, aside from tegra socs,
> > which are fully supported in upstream afaik.
>
> I understand nvgpu is the tegra driver that people actualy
> use. nouveau may have good tegra support but is it used in any actual
> commercial product?

I think it was almost the case. Afaik they still have their internal
userspace stack working on top of nvidia, at least last year someone
fixed up a bunch of issues in the tegra+nouveau combo to enable format
modifiers properly across the board. But also nvidia is never going to
sell you that as the officially supported thing, unless your ask comes
back with enormous amounts of sold hardware.

And it's not just nvidia, it's pretty much everyone. Like a soc
company I don't want to know started collaborating with upstream and
the reverse-engineered mesa team on a kernel driver, seems to work
pretty well for current hardware. But for the next generation they
decided it's going to be again only their in-house tree that
completele ignores drivers/gpu/drm, and also tosses all the
foundational work they helped build on the userspace side. And this is
consistent across all companies, over the last 20 years I know of
(often non-public) stories across every single company where they
decided that all the time invested into community/upstream
collaboration isn't useful anymore, we go all vendor solo for the next
one.

Most of those you luckily don't hear about anymore, all it results in
the upstream driver being 1-2 years late or so. But even the good ones
where we collaborate well can't seem to help themselves and want to
throw it all away every few years.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-06 18:00:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 06, 2021 at 07:31:37PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 06, 2021 at 02:28:28PM -0300, Jason Gunthorpe wrote:
> > > Also on your claim that drivers/gpu is a non-upstream disaster: I've
> > > also learned that that for drivers/rdma there's the upstream driver,
> > > and then there's the out-of-tree hackjob the vendor actually
> > > supports.
> >
> > In the enterprise world everyone has their out of tree backport
> > drivers. It varies on the vendor how much deviation there is from the
> > upstream driver and what commercial support relationship the vendor
> > has with the enterprise distros.
>
> I think he means the Mellanox OFED stack, which is a complete and utter
> mess and which gets force fed by Mellanox/Nvidia on unsuspecting
> customers. I know many big HPC sites that ignore it, but a lot of
> enterprise customers are dumb enought to deploy it.

No, I don't think so. While MOFED is indeed a giant mess, the mlx5
upstream driver is not some token effort to generate good will and
Mellanox certainly does provide full commercial support for the mlx5
drivers shipped inside various enterprise distros.

MOFED also doesn't have a big functional divergance from RDMA
upstream, and it is not mandatory just to use the hardware.

I can not say the same about other company's RDMA driver
distributions, Daniel's description of "minimal effort to get
goodwill" would match others much better.

You are right that there are a lot of enterprise customers who deploy
the MOFED. I can't agree with their choices, but they are not forced
into using it anymore.

Jason

2021-07-06 18:05:20

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

I should stop typing and prep dinner, but I found some too hilarious
typos below.

On Tue, Jul 6, 2021 at 7:35 PM Daniel Vetter <[email protected]> wrote:
>
> On Tue, Jul 6, 2021 at 6:29 PM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Tue, Jul 06, 2021 at 05:49:01PM +0200, Daniel Vetter wrote:
> >
> > > The other thing to keep in mind is that one of these drivers supports
> > > 25 years of product generations, and the other one doesn't.
> >
> > Sure, but that is the point, isn't it? To have an actually useful
> > thing you need all of this mess
> >
> > > > My argument is that an in-tree open kernel driver is a big help to
> > > > reverse engineering an open userspace. Having the vendors
> > > > collaboration to build that monstrous thing can only help the end goal
> > > > of an end to end open stack.
> > >
> > > Not sure where this got lost, but we're totally fine with vendors
> > > using the upstream driver together with their closed stack. And most
> > > of the drivers we do have in upstream are actually, at least in parts,
> > > supported by the vendor. E.g. if you'd have looked the drm/arm driver
> > > you picked is actually 100% written by ARM engineers. So kinda
> > > unfitting example.
> >
> > So the argument with Habana really boils down to how much do they need
> > to show in the open source space to get a kernel driver? You want to
> > see the ISA or compiler at least?
>
> Yup. We dont care about any of the fancy pieces you build on top, nor
> does the compiler need to be the optimizing one. Just something that's
> good enough to drive the hw in some demons to see how it works and all

s/demons/demos/ but hw tends to be funky enough that either fits :-)

> that. Generally that's also not that hard to reverse engineer, if
> someone is bored enough, the real fancy stuff tends to be in how you
> optimize the generated code. And make it fit into the higher levels
> properly.
>
> > That at least doesn't seem "extreme" to me.
> >
> > > > For instance a vendor with an in-tree driver has a strong incentive to
> > > > sort out their FW licensing issues so it can be redistributed.
> > >
> > > Nvidia has been claiming to try and sort out the FW problem for years.
> > > They even managed to release a few things, but I think the last one is
> > > 2-3 years late now. Partially the reason is that there don't have a
> > > stable api between the firmware and driver, it's all internal from the
> > > same source tree, and they don't really want to change that.
> >
> > Right, companies have no incentive to work in a sane way if they have
> > their own parallel world. I think drawing them part by part into the
> > standard open workflows and expectations is actually helpful to
> > everyone.
>
> Well we do try to get them on board part-by-part generally starting
> with the kernel and ending with a proper compiler instead of the usual
> llvm hack job, but for whatever reasons they really like their
> in-house stuff, see below for what I mean.
>
> > > > > I don't think the facts on the ground support your claim here, aside
> > > > > from the practical problem that nvidia is unwilling to even create an
> > > > > open driver to begin with. So there isn't anything to merge.
> > > >
> > > > The internet tells me there is nvgpu, it doesn't seem to have helped.
> > >
> > > Not sure which one you mean, but every once in a while they open up a
> > > few headers, or a few programming specs, or a small driver somewhere
> > > for a very specific thing, and then it dies again or gets obfuscated
> > > for the next platform, or just never updated. I've never seen anything
> > > that comes remotely to something complete, aside from tegra socs,
> > > which are fully supported in upstream afaik.
> >
> > I understand nvgpu is the tegra driver that people actualy
> > use. nouveau may have good tegra support but is it used in any actual
> > commercial product?
>
> I think it was almost the case. Afaik they still have their internal
> userspace stack working on top of nvidia, at least last year someone
> fixed up a bunch of issues in the tegra+nouveau combo to enable format
> modifiers properly across the board. But also nvidia is never going to
> sell you that as the officially supported thing, unless your ask comes
> back with enormous amounts of sold hardware.
>
> And it's not just nvidia, it's pretty much everyone. Like a soc
> company I don't want to know started collaborating with upstream and

s/know/name/ I do know them unfortunately quite well ...

Cheers, Daniel

> the reverse-engineered mesa team on a kernel driver, seems to work
> pretty well for current hardware. But for the next generation they
> decided it's going to be again only their in-house tree that
> completele ignores drivers/gpu/drm, and also tosses all the
> foundational work they helped build on the userspace side. And this is
> consistent across all companies, over the last 20 years I know of
> (often non-public) stories across every single company where they
> decided that all the time invested into community/upstream
> collaboration isn't useful anymore, we go all vendor solo for the next
> one.
>
> Most of those you luckily don't hear about anymore, all it results in
> the upstream driver being 1-2 years late or so. But even the good ones
> where we collaborate well can't seem to help themselves and want to
> throw it all away every few years.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-06 18:33:06

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 06, 2021 at 07:35:55PM +0200, Daniel Vetter wrote:

> Yup. We dont care about any of the fancy pieces you build on top, nor
> does the compiler need to be the optimizing one. Just something that's
> good enough to drive the hw in some demons to see how it works and all
> that. Generally that's also not that hard to reverse engineer, if
> someone is bored enough, the real fancy stuff tends to be in how you
> optimize the generated code. And make it fit into the higher levels
> properly.

Seems reasonable to me

> And it's not just nvidia, it's pretty much everyone. Like a soc
> company I don't want to know started collaborating with upstream and
> the reverse-engineered mesa team on a kernel driver, seems to work
> pretty well for current hardware.

What I've seen is that this only works with customer demand. Companies
need to hear from their customers that upstream is what is needed, and
companies cannot properly hear that until they are at least already
partially invested in the upstream process and have the right
customers that are sophisticated enough to care.

Embedded makes everything 10x worse because too many customers just
don't care about upstream, you can hack your way through everything,
and indulge in single generation thinking. Fork the whole kernel for 3
years, EOL, no problem!

It is the enterprise world, particularly with an opinionated company
like RH saying NO stuck in the middle that really seems to drive
things toward upstream.

Yes, vendors can work around Red Hat's No (and NVIDIA GPU is such an
example) but it is incredibly time consuming, expensive and becoming
more and more difficult every year.

The big point is this:

> But also nvidia is never going to sell you that as the officially
> supported thing, unless your ask comes back with enormous amounts of
> sold hardware.

I think this is at the core of Linux's success in the enterprise
world. Big customers who care demanding open source. Any vendor, even
nvidia will want to meet customer demands.

IHMO upstream success is found by motivating the customer to demand
and make it "easy" for the vendor to supply it.

Jason

2021-07-06 19:11:04

by Alex Deucher

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 2:31 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Jul 06, 2021 at 07:35:55PM +0200, Daniel Vetter wrote:
>
> > Yup. We dont care about any of the fancy pieces you build on top, nor
> > does the compiler need to be the optimizing one. Just something that's
> > good enough to drive the hw in some demons to see how it works and all
> > that. Generally that's also not that hard to reverse engineer, if
> > someone is bored enough, the real fancy stuff tends to be in how you
> > optimize the generated code. And make it fit into the higher levels
> > properly.
>
> Seems reasonable to me
>
> > And it's not just nvidia, it's pretty much everyone. Like a soc
> > company I don't want to know started collaborating with upstream and
> > the reverse-engineered mesa team on a kernel driver, seems to work
> > pretty well for current hardware.
>
> What I've seen is that this only works with customer demand. Companies
> need to hear from their customers that upstream is what is needed, and
> companies cannot properly hear that until they are at least already
> partially invested in the upstream process and have the right
> customers that are sophisticated enough to care.
>
> Embedded makes everything 10x worse because too many customers just
> don't care about upstream, you can hack your way through everything,
> and indulge in single generation thinking. Fork the whole kernel for 3
> years, EOL, no problem!
>
> It is the enterprise world, particularly with an opinionated company
> like RH saying NO stuck in the middle that really seems to drive
> things toward upstream.
>
> Yes, vendors can work around Red Hat's No (and NVIDIA GPU is such an
> example) but it is incredibly time consuming, expensive and becoming
> more and more difficult every year.
>
> The big point is this:
>
> > But also nvidia is never going to sell you that as the officially
> > supported thing, unless your ask comes back with enormous amounts of
> > sold hardware.
>
> I think this is at the core of Linux's success in the enterprise
> world. Big customers who care demanding open source. Any vendor, even
> nvidia will want to meet customer demands.
>
> IHMO upstream success is found by motivating the customer to demand
> and make it "easy" for the vendor to supply it.

I think this is one of the last big challenges on Linux. It's REALLY
hard to align new products with Linux kernel releases and distro
kernels. Hardware cycles are too short and drivers (at least for
GPUs) are too big to really fit well with the current Linux release
model. In many cases enterprise distros have locked down on a kernel
version around the same time we are doing new chip bring up. You are
almost always off by one when it comes to kernel version alignment.
Even if you can get the initial code upstream in the right kernel
version, it tends to be aligned to such early silicon that you end up
needing a pile of additional patches to make production cards work.
Those changes are often deemed "too big" for stable kernel fixes. The
only real way to deal with that effectively is with vendor provided
packaged drivers using something like dkms to cover launch. Thus you
need to do your bring up against latest upstream and then backport, or
do your bring up against some older kernel and forward port for
upstream. You end up doing everything twice. Things get better with
sustaining support in subsequent distro releases, but it doesn't help
at product launch. I don't know what the right solution for this is.

Alex

2021-07-06 19:11:14

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Tue, Jul 6, 2021 at 8:31 PM Jason Gunthorpe <[email protected]> wrote:
> On Tue, Jul 06, 2021 at 07:35:55PM +0200, Daniel Vetter wrote:
> > Yup. We dont care about any of the fancy pieces you build on top, nor
> > does the compiler need to be the optimizing one. Just something that's
> > good enough to drive the hw in some demons to see how it works and all
> > that. Generally that's also not that hard to reverse engineer, if
> > someone is bored enough, the real fancy stuff tends to be in how you
> > optimize the generated code. And make it fit into the higher levels
> > properly.
>
> Seems reasonable to me
>
> > And it's not just nvidia, it's pretty much everyone. Like a soc
> > company I don't want to know started collaborating with upstream and
> > the reverse-engineered mesa team on a kernel driver, seems to work
> > pretty well for current hardware.
>
> What I've seen is that this only works with customer demand. Companies
> need to hear from their customers that upstream is what is needed, and
> companies cannot properly hear that until they are at least already
> partially invested in the upstream process and have the right
> customers that are sophisticated enough to care.
>
> Embedded makes everything 10x worse because too many customers just
> don't care about upstream, you can hack your way through everything,
> and indulge in single generation thinking. Fork the whole kernel for 3
> years, EOL, no problem!

It's not entirely hopeless in embedded either. Sure there's the giant
pile of sell&forget abandonware, but there are lots of embedded things
where multi-year to multi-decade support is required. And an upstream
gfx stack beats anything the vendor has to offer on that, easily.

And on the server side it's actually pretty hard to convince customers
of the upstream driver benefits, because they don't want or can't
abandon nvidia and have just learned to accept the pain. They either
build a few abstraction layers on top (and demand the vendor support
those), or they flat out demand you support the nvidia broprietary
interfaces. And AMD has been trying to move the needle here for years,
with not that much success.

> It is the enterprise world, particularly with an opinionated company
> like RH saying NO stuck in the middle that really seems to drive
> things toward upstream.
>
> Yes, vendors can work around Red Hat's No (and NVIDIA GPU is such an
> example) but it is incredibly time consuming, expensive and becoming
> more and more difficult every year.
>
> The big point is this:
>
> > But also nvidia is never going to sell you that as the officially
> > supported thing, unless your ask comes back with enormous amounts of
> > sold hardware.
>
> I think this is at the core of Linux's success in the enterprise
> world. Big customers who care demanding open source. Any vendor, even
> nvidia will want to meet customer demands.
>
> IHMO upstream success is found by motivating the customer to demand
> and make it "easy" for the vendor to supply it.

Yup, exactly same situation here. The problem seems to be a bit that
gpu vendor stubbornness is higher than established customer demand
even, or they just don't care, and so in the last few years that
customer demand has resulted in payment to consulting shops and hiring
of engineers into reverse-engineering a full driver, instead of
customer and vendor splitting the difference and the vendor
upstreaming their stack. And that's for companies who've done it in
the past, or at least collaborated on parts like the kernel driver, so
I really have no clue why they don't just continue. We have
well-established customers who do want it all open and upstream,
across kernel and userspace pieces.

And it looks like it's going to repeat itself a few more times
unfortunately. I'm not sure when exactly the lesson will sink in.

Maybe I missed some, but looking at current render/compute drivers I
think (but not even sure on that) only drm/lima is a hobbyist project
and perhaps you want to include drm/nouveau as not paid by customers
and more something redhat does out of principle. All the others are
paid for by customers, with vendor involvement ranging from "just
helping out with the kernel driver" to "pays for pretty much all of
the development". And still apparently that's not enough demand for an
upstream driver stack.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-07 12:18:50

by Christian König

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

Am 06.07.21 um 14:23 schrieb Daniel Vetter:
> On Tue, Jul 06, 2021 at 02:21:10PM +0200, Christoph Hellwig wrote:
>> On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
>>>> Greg, I hope this will be good enough for you to merge this code.
>>> So we're officially going to use dri-devel for technical details review
>>> and then Greg for merging so we don't have to deal with other merge
>>> criteria dri-devel folks have?
>>>
>>> I don't expect anything less by now, but it does make the original claim
>>> that drivers/misc will not step all over accelerators folks a complete
>>> farce under the totally-not-a-gpu banner.
>>>
>>> This essentially means that for any other accelerator stack that doesn't
>>> fit the dri-devel merge criteria, even if it's acting like a gpu and uses
>>> other gpu driver stuff, you can just send it to Greg and it's good to go.
>>>
>>> There's quite a lot of these floating around actually (and many do have
>>> semi-open runtimes, like habanalabs have now too, just not open enough to
>>> be actually useful). It's going to be absolutely lovely having to explain
>>> to these companies in background chats why habanalabs gets away with their
>>> stack and they don't.
>> FYI, I fully agree with Daniel here. Habanlabs needs to open up their
>> runtime if they want to push any additional feature in the kernel.
>> The current situation is not sustainable.
> Before anyone replies: The runtime is open, the compiler is still closed.
> This has become the new default for accel driver submissions, I think
> mostly because all the interesting bits for non-3d accelerators are in the
> accel ISA, and no longer in the runtime. So vendors are fairly happy to
> throw in the runtime as a freebie.

Well a compiler and runtime makes things easier, but the real question
is if they are really required for upstreaming a kernel driver?

I mean what we need is to be able to exercise the functionality. So
wouldn't (for example) an assembler be sufficient?

> It's still incomplete, and it's still useless if you want to actually hack
> on the driver stack.

Yeah, when you want to hack on it in the sense of extending it then this
requirement is certainly true.

But as far as I can see userspace don't need to be extendable to justify
a kernel driver. It just needs to have enough glue to thoughtfully
exercise the relevant kernel interfaces.

Applying that to GPUs I think what you need to be able to is to write
shaders, but that doesn't need to be in a higher language requiring a
compiler and runtime. Released opcodes and a low level assembler should
be sufficient.

Regards,
Christian.

> -Daniel

2021-07-07 12:58:15

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On Wed, Jul 7, 2021 at 2:17 PM Christian König
<[email protected]> wrote:
> Am 06.07.21 um 14:23 schrieb Daniel Vetter:
> > On Tue, Jul 06, 2021 at 02:21:10PM +0200, Christoph Hellwig wrote:
> >> On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
> >>>> Greg, I hope this will be good enough for you to merge this code.
> >>> So we're officially going to use dri-devel for technical details review
> >>> and then Greg for merging so we don't have to deal with other merge
> >>> criteria dri-devel folks have?
> >>>
> >>> I don't expect anything less by now, but it does make the original claim
> >>> that drivers/misc will not step all over accelerators folks a complete
> >>> farce under the totally-not-a-gpu banner.
> >>>
> >>> This essentially means that for any other accelerator stack that doesn't
> >>> fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> >>> other gpu driver stuff, you can just send it to Greg and it's good to go.
> >>>
> >>> There's quite a lot of these floating around actually (and many do have
> >>> semi-open runtimes, like habanalabs have now too, just not open enough to
> >>> be actually useful). It's going to be absolutely lovely having to explain
> >>> to these companies in background chats why habanalabs gets away with their
> >>> stack and they don't.
> >> FYI, I fully agree with Daniel here. Habanlabs needs to open up their
> >> runtime if they want to push any additional feature in the kernel.
> >> The current situation is not sustainable.
> > Before anyone replies: The runtime is open, the compiler is still closed.
> > This has become the new default for accel driver submissions, I think
> > mostly because all the interesting bits for non-3d accelerators are in the
> > accel ISA, and no longer in the runtime. So vendors are fairly happy to
> > throw in the runtime as a freebie.
>
> Well a compiler and runtime makes things easier, but the real question
> is if they are really required for upstreaming a kernel driver?
>
> I mean what we need is to be able to exercise the functionality. So
> wouldn't (for example) an assembler be sufficient?

So no one has tried this yet, but I think an assembler, or maybe even
just the full PRM for the ISA is also good enough I think.

I guess in practice everyone just comes with the compiler for a few reasons:
- AMD and Intel are great and release full PRMs for the gpu, but
preparing those takes a lot of time. Often that's done as part of
bring up, to make sure everything is annotated properly, so that all
the necessary bits are included, but none of the future stuff, or
silicon bring-up pieces. So in reality you have the compiler before
you have the isa docs.

- reverse-engineered drivers also tend to have demo compilers before
anything like full ISA docs show up :-) But also the docs tooling they
have are great.

- then there's the case of developing a driver with NDA'd docs. Again
you'll have a compiler as the only real output, there's not going to
be any docs or anything like that.

> > It's still incomplete, and it's still useless if you want to actually hack
> > on the driver stack.
>
> Yeah, when you want to hack on it in the sense of extending it then this
> requirement is certainly true.
>
> But as far as I can see userspace don't need to be extendable to justify
> a kernel driver. It just needs to have enough glue to thoughtfully
> exercise the relevant kernel interfaces.
>
> Applying that to GPUs I think what you need to be able to is to write
> shaders, but that doesn't need to be in a higher language requiring a
> compiler and runtime. Released opcodes and a low level assembler should
> be sufficient.

Yeah I think in theory ISA docs + assembler testcase or whatever is
perfectly fine. In reality anyone who cares enough to do this properly
gets to the demo quality compiler stage first, and so that's what we
take for merging a new stack.

I do disagree that we're only ever asking for this and not more, e.g.
if you come with a new 3d accelator and it's not coming with a
userspace driver as a mesa MR, you have to do some very serious
explaining about wtf you're doing - mesa3d won, pretty much across the
board, as a common project for both vulkan and opengl, and the
justifications for reinventing wheels better be really good here. Also
by the time you've written enough scaffolding to show it integrates in
non-stupid ways into mesa, you practically have a demo-quality driver
stack anyway.

Similar on the display side of things, over the past year consensus
for merge criteria have gone up quite a bit, e.g. there's a patch
floating around to make that clearer:

https://lore.kernel.org/dri-devel/[email protected]/

Of course this doesn't include anything grandfathered in (*cough*
amdvlk *cough*), and also outside of 3d there's clearly no
cross-vendor project that's established enough, media, compute, AI/NN
stuff is all very badly fragmented. That's maybe lamentable, but like
you said not really a reason to reject a kernel driver.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-07-09 14:48:59

by Dennis Dalessandro

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs

On 7/6/21 1:59 PM, Jason Gunthorpe wrote:
> I can not say the same about other company's RDMA driver
> distributions, Daniel's description of "minimal effort to get
> goodwill" would match others much better.

Not sure what other RDMA driver you are talking about but as for Cornelis
Networks, we do have a packaged up version of our software. However it is meant
to make things easier on end users to bridge the gap between the distro kernel
drivers and the upstream kernel.

It's definitely not a requirement and plenty of folks do use distro
kernels/drivers. I'm not sure how many large sites are using something straight
off kernel.org but the upstream hfi1 driver is 100% the real deal.

We continually develop on and test the upstream kernel. Our goal is always to
upstream patches first. We learned that lesson the hard way when we first tried
to upstream hfi1.

-Denny