2020-03-28 08:53:12

by Oded Gabbay

[permalink] [raw]
Subject: [PATCH 1/6] habanalabs: don't wait for ASIC CPU after reset

Upon reset of the ASIC, the driver would have waited for the CPU to come
out of reset before finishing the reset process. This was done for the
purpose of making the CPU available to answer FLR requests. However, when a
VM shuts down, the driver isn't removed so a reset never happens.
Therefore, remove this waiting period as we don't need it.

Signed-off-by: Oded Gabbay <[email protected]>
---
drivers/misc/habanalabs/goya/goya.c | 24 ------------------------
drivers/misc/habanalabs/goya/goyaP.h | 2 +-
2 files changed, 1 insertion(+), 25 deletions(-)

diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 68f065607544..db125cf80850 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -2684,30 +2684,6 @@ static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
HW_CAP_MMU | HW_CAP_TPC_MBIST |
HW_CAP_GOLDEN | HW_CAP_TPC);
memset(goya->events_stat, 0, sizeof(goya->events_stat));
-
- if (!hdev->pldm) {
- int rc;
- /* In case we are running inside VM and the VM is
- * shutting down, we need to make sure CPU boot-loader
- * is running before we can continue the VM shutdown.
- * That is because the VM will send an FLR signal that
- * we must answer
- */
- dev_info(hdev->dev,
- "Going to wait up to %ds for CPU boot loader\n",
- GOYA_CPU_TIMEOUT_USEC / 1000 / 1000);
-
- rc = hl_poll_timeout(
- hdev,
- mmPSOC_GLOBAL_CONF_WARM_REBOOT,
- status,
- (status == CPU_BOOT_STATUS_DRAM_RDY),
- 10000,
- GOYA_CPU_TIMEOUT_USEC);
- if (rc)
- dev_err(hdev->dev,
- "failed to wait for CPU boot loader\n");
- }
}

int goya_suspend(struct hl_device *hdev)
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index c3230cb6e25c..1555d03e3cb2 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -45,7 +45,7 @@

#define CORESIGHT_TIMEOUT_USEC 100000 /* 100 ms */

-#define GOYA_CPU_TIMEOUT_USEC 10000000 /* 10s */
+#define GOYA_CPU_TIMEOUT_USEC 15000000 /* 15s */

#define TPC_ENABLED_MASK 0xFF

--
2.17.1


2020-03-28 08:53:22

by Oded Gabbay

[permalink] [raw]
Subject: [PATCH 2/6] habanalabs: remove stop-on-error flag from DMA

From: Omer Shpigelman <[email protected]>

Stop-on-error mode in DMA is useful as it stops the transaction
immediately upon error e.g. page fault.
But it may cause the next command submission to fail as is leaves the DMA
in unstable state.
Therefore we remove the stop-on-error configuration from the DMA.
Stop-on-err is still available for debug.

Signed-off-by: Omer Shpigelman <[email protected]>
Reviewed-by: Oded Gabbay <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>
---
.../ABI/testing/debugfs-driver-habanalabs | 7 +++
drivers/misc/habanalabs/debugfs.c | 55 +++++++++++++++++++
drivers/misc/habanalabs/goya/goya.c | 6 +-
drivers/misc/habanalabs/habanalabs.h | 2 +
.../include/goya/asic_reg/goya_masks.h | 3 +-
5 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/debugfs-driver-habanalabs b/Documentation/ABI/testing/debugfs-driver-habanalabs
index a73601c5121e..67e04f2d7e1d 100644
--- a/Documentation/ABI/testing/debugfs-driver-habanalabs
+++ b/Documentation/ABI/testing/debugfs-driver-habanalabs
@@ -150,3 +150,10 @@ KernelVersion: 5.1
Contact: [email protected]
Description: Displays a list with information about all the active virtual
address mappings per ASID
+
+What: /sys/kernel/debug/habanalabs/hl<n>/stop_on_err
+Date: Mar 2020
+KernelVersion: 5.6
+Contact: [email protected]
+Description: Sets the stop-on_error option for the device engines. Value of
+ "0" is for disable, otherwise enable.
diff --git a/drivers/misc/habanalabs/debugfs.c b/drivers/misc/habanalabs/debugfs.c
index 756d36ed5d95..37beff3096f8 100644
--- a/drivers/misc/habanalabs/debugfs.c
+++ b/drivers/misc/habanalabs/debugfs.c
@@ -970,6 +970,49 @@ static ssize_t hl_device_write(struct file *f, const char __user *buf,
return count;
}

+static ssize_t hl_stop_on_err_read(struct file *f, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+ struct hl_device *hdev = entry->hdev;
+ char tmp_buf[200];
+ ssize_t rc;
+
+ if (*ppos)
+ return 0;
+
+ sprintf(tmp_buf, "%d\n", hdev->stop_on_err);
+ rc = simple_read_from_buffer(buf, strlen(tmp_buf) + 1, ppos, tmp_buf,
+ strlen(tmp_buf) + 1);
+
+ return rc;
+}
+
+static ssize_t hl_stop_on_err_write(struct file *f, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+ struct hl_device *hdev = entry->hdev;
+ u32 value;
+ ssize_t rc;
+
+ if (atomic_read(&hdev->in_reset)) {
+ dev_warn_ratelimited(hdev->dev,
+ "Can't change stop on error during reset\n");
+ return 0;
+ }
+
+ rc = kstrtouint_from_user(buf, count, 10, &value);
+ if (rc)
+ return rc;
+
+ hdev->stop_on_err = value ? 1 : 0;
+
+ hl_device_reset(hdev, false, false);
+
+ return count;
+}
+
static const struct file_operations hl_data32b_fops = {
.owner = THIS_MODULE,
.read = hl_data_read32,
@@ -1015,6 +1058,12 @@ static const struct file_operations hl_device_fops = {
.write = hl_device_write
};

+static const struct file_operations hl_stop_on_err_fops = {
+ .owner = THIS_MODULE,
+ .read = hl_stop_on_err_read,
+ .write = hl_stop_on_err_write
+};
+
static const struct hl_info_list hl_debugfs_list[] = {
{"command_buffers", command_buffers_show, NULL},
{"command_submission", command_submission_show, NULL},
@@ -1152,6 +1201,12 @@ void hl_debugfs_add_device(struct hl_device *hdev)
dev_entry,
&hl_device_fops);

+ debugfs_create_file("stop_on_err",
+ 0644,
+ dev_entry->root,
+ dev_entry,
+ &hl_stop_on_err_fops);
+
for (i = 0, entry = dev_entry->entry_arr ; i < count ; i++, entry++) {

ent = debugfs_create_file(hl_debugfs_list[i].name,
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index db125cf80850..08f1d4080008 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -800,6 +800,7 @@ static void goya_init_dma_qman(struct hl_device *hdev, int dma_id,
u32 so_base_lo, so_base_hi;
u32 gic_base_lo, gic_base_hi;
u32 reg_off = dma_id * (mmDMA_QM_1_PQ_PI - mmDMA_QM_0_PQ_PI);
+ u32 dma_err_cfg = QMAN_DMA_ERR_MSG_EN;

mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
@@ -836,7 +837,10 @@ static void goya_init_dma_qman(struct hl_device *hdev, int dma_id,
else
WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_FULLY_TRUSTED);

- WREG32(mmDMA_QM_0_GLBL_ERR_CFG + reg_off, QMAN_DMA_ERR_MSG_EN);
+ if (hdev->stop_on_err)
+ dma_err_cfg |= 1 << DMA_QM_0_GLBL_ERR_CFG_DMA_STOP_ON_ERR_SHIFT;
+
+ WREG32(mmDMA_QM_0_GLBL_ERR_CFG + reg_off, dma_err_cfg);
WREG32(mmDMA_QM_0_GLBL_CFG0 + reg_off, QMAN_DMA_ENABLE);
}

diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 31ebcf9458fe..ae3db8eb2fb5 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -1300,6 +1300,7 @@ struct hl_device_idle_busy_ts {
* @in_debug: is device under debug. This, together with fpriv_list, enforces
* that only a single user is configuring the debug infrastructure.
* @cdev_sysfs_created: were char devices and sysfs nodes created.
+ * @stop_on_err: true if engines should stop on error.
*/
struct hl_device {
struct pci_dev *pdev;
@@ -1380,6 +1381,7 @@ struct hl_device {
u8 dma_mask;
u8 in_debug;
u8 cdev_sysfs_created;
+ u8 stop_on_err;

/* Parameters for bring-up */
u8 mmu_enable;
diff --git a/drivers/misc/habanalabs/include/goya/asic_reg/goya_masks.h b/drivers/misc/habanalabs/include/goya/asic_reg/goya_masks.h
index 3c44ef3a23ed..067489bd048e 100644
--- a/drivers/misc/habanalabs/include/goya/asic_reg/goya_masks.h
+++ b/drivers/misc/habanalabs/include/goya/asic_reg/goya_masks.h
@@ -55,8 +55,7 @@
(1 << DMA_QM_0_GLBL_ERR_CFG_DMA_ERR_MSG_EN_SHIFT) | \
(1 << DMA_QM_0_GLBL_ERR_CFG_PQF_STOP_ON_ERR_SHIFT) | \
(1 << DMA_QM_0_GLBL_ERR_CFG_CQF_STOP_ON_ERR_SHIFT) | \
- (1 << DMA_QM_0_GLBL_ERR_CFG_CP_STOP_ON_ERR_SHIFT) | \
- (1 << DMA_QM_0_GLBL_ERR_CFG_DMA_STOP_ON_ERR_SHIFT))
+ (1 << DMA_QM_0_GLBL_ERR_CFG_CP_STOP_ON_ERR_SHIFT))

#define QMAN_MME_ENABLE (\
(1 << MME_QM_GLBL_CFG0_PQF_EN_SHIFT) | \
--
2.17.1

2020-03-28 08:53:26

by Oded Gabbay

[permalink] [raw]
Subject: [PATCH 3/6] habanalabs: re-factor H/W queues initialization

From: Omer Shpigelman <[email protected]>

We want to remove the following restrictions/assumptions in our driver:
1. The H/W queue index is also the completion queue index.
2. The H/W queue index is also the IRQ number of the completion queue.
3. All queues of the same type have consecutive indexes.

Therefore we add the support for H/W queues of the same type with
nonconsecutive indexes and completion queue index and IRQ number different
than the H/W queue index.

Signed-off-by: Omer Shpigelman <[email protected]>
Reviewed-by: Oded Gabbay <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>
---
drivers/misc/habanalabs/device.c | 17 +++++++++--------
drivers/misc/habanalabs/goya/goya.c | 9 ++++++++-
drivers/misc/habanalabs/goya/goyaP.h | 1 +
drivers/misc/habanalabs/habanalabs.h | 6 ++++++
drivers/misc/habanalabs/hw_queue.c | 10 +++++-----
5 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index aef4de36b7aa..c89157dafa33 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -1062,7 +1062,7 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
*/
int hl_device_init(struct hl_device *hdev, struct class *hclass)
{
- int i, rc, cq_ready_cnt;
+ int i, rc, cq_cnt, cq_ready_cnt;
char *name;
bool add_cdev_sysfs_on_err = false;

@@ -1120,14 +1120,16 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
goto sw_fini;
}

+ cq_cnt = hdev->asic_prop.completion_queues_count;
+
/*
* Initialize the completion queues. Must be done before hw_init,
* because there the addresses of the completion queues are being
* passed as arguments to request_irq
*/
- hdev->completion_queue =
- kcalloc(hdev->asic_prop.completion_queues_count,
- sizeof(*hdev->completion_queue), GFP_KERNEL);
+ hdev->completion_queue = kcalloc(cq_cnt,
+ sizeof(*hdev->completion_queue),
+ GFP_KERNEL);

if (!hdev->completion_queue) {
dev_err(hdev->dev, "failed to allocate completion queues\n");
@@ -1135,10 +1137,9 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
goto hw_queues_destroy;
}

- for (i = 0, cq_ready_cnt = 0;
- i < hdev->asic_prop.completion_queues_count;
- i++, cq_ready_cnt++) {
- rc = hl_cq_init(hdev, &hdev->completion_queue[i], i);
+ for (i = 0, cq_ready_cnt = 0 ; i < cq_cnt ; i++, cq_ready_cnt++) {
+ rc = hl_cq_init(hdev, &hdev->completion_queue[i],
+ hdev->asic_funcs->get_queue_id_for_cq(hdev, i));
if (rc) {
dev_err(hdev->dev,
"failed to initialize completion queue\n");
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 08f1d4080008..f7eb60f5f6f9 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -890,6 +890,7 @@ void goya_init_dma_qmans(struct hl_device *hdev)
q = &hdev->kernel_queues[0];

for (i = 0 ; i < NUMBER_OF_EXT_HW_QUEUES ; i++, q++) {
+ q->cq_id = q->msi_vec = i;
goya_init_dma_qman(hdev, i, q->bus_address);
goya_init_dma_ch(hdev, i);
}
@@ -5273,6 +5274,11 @@ static enum hl_device_hw_state goya_get_hw_state(struct hl_device *hdev)
return RREG32(mmHW_STATE);
}

+u32 goya_get_queue_id_for_cq(struct hl_device *hdev, u32 cq_idx)
+{
+ return cq_idx;
+}
+
static const struct hl_asic_funcs goya_funcs = {
.early_init = goya_early_init,
.early_fini = goya_early_fini,
@@ -5332,7 +5338,8 @@ static const struct hl_asic_funcs goya_funcs = {
.rreg = hl_rreg,
.wreg = hl_wreg,
.halt_coresight = goya_halt_coresight,
- .get_clk_rate = goya_get_clk_rate
+ .get_clk_rate = goya_get_clk_rate,
+ .get_queue_id_for_cq = goya_get_queue_id_for_cq
};

/*
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index 1555d03e3cb2..5db5f6ea1d98 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -234,5 +234,6 @@ void goya_cpu_accessible_dma_pool_free(struct hl_device *hdev, size_t size,
void goya_mmu_remove_device_cpu_mappings(struct hl_device *hdev);

int goya_get_clk_rate(struct hl_device *hdev, u32 *cur_clk, u32 *max_clk);
+u32 goya_get_queue_id_for_cq(struct hl_device *hdev, u32 cq_idx);

#endif /* GOYAP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index ae3db8eb2fb5..299add419e79 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -365,6 +365,8 @@ struct hl_cs_job;
* @pi: holds the queue's pi value.
* @ci: holds the queue's ci value, AS CALCULATED BY THE DRIVER (not real ci).
* @hw_queue_id: the id of the H/W queue.
+ * @cq_id: the id for the corresponding CQ for this H/W queue.
+ * @msi_vec: the IRQ number of the H/W queue.
* @int_queue_len: length of internal queue (number of entries).
* @valid: is the queue valid (we have array of 32 queues, not all of them
* exists).
@@ -377,6 +379,8 @@ struct hl_hw_queue {
u32 pi;
u32 ci;
u32 hw_queue_id;
+ u32 cq_id;
+ u32 msi_vec;
u16 int_queue_len;
u8 valid;
};
@@ -534,6 +538,7 @@ enum hl_pll_frequency {
* @wreg: Write a register. Needed for simulator support.
* @halt_coresight: stop the ETF and ETR traces.
* @get_clk_rate: Retrieve the ASIC current and maximum clock rate in MHz
+ * @get_queue_id_for_cq: Get the H/W queue id related to the given CQ index.
*/
struct hl_asic_funcs {
int (*early_init)(struct hl_device *hdev);
@@ -620,6 +625,7 @@ struct hl_asic_funcs {
void (*wreg)(struct hl_device *hdev, u32 reg, u32 val);
void (*halt_coresight)(struct hl_device *hdev);
int (*get_clk_rate)(struct hl_device *hdev, u32 *cur_clk, u32 *max_clk);
+ u32 (*get_queue_id_for_cq)(struct hl_device *hdev, u32 cq_idx);
};


diff --git a/drivers/misc/habanalabs/hw_queue.c b/drivers/misc/habanalabs/hw_queue.c
index 91579dde9262..8248adcc7ef8 100644
--- a/drivers/misc/habanalabs/hw_queue.c
+++ b/drivers/misc/habanalabs/hw_queue.c
@@ -111,7 +111,7 @@ static int ext_queue_sanity_checks(struct hl_device *hdev,
bool reserve_cq_entry)
{
atomic_t *free_slots =
- &hdev->completion_queue[q->hw_queue_id].free_slots_cnt;
+ &hdev->completion_queue[q->cq_id].free_slots_cnt;
int free_slots_cnt;

/* Check we have enough space in the queue */
@@ -194,7 +194,7 @@ static int hw_queue_sanity_checks(struct hl_device *hdev, struct hl_hw_queue *q,
int num_of_entries)
{
atomic_t *free_slots =
- &hdev->completion_queue[q->hw_queue_id].free_slots_cnt;
+ &hdev->completion_queue[q->cq_id].free_slots_cnt;

/*
* Check we have enough space in the completion queue.
@@ -308,13 +308,13 @@ static void ext_queue_schedule_job(struct hl_cs_job *job)
* No need to check if CQ is full because it was already
* checked in ext_queue_sanity_checks
*/
- cq = &hdev->completion_queue[q->hw_queue_id];
+ cq = &hdev->completion_queue[q->cq_id];
cq_addr = cq->bus_address + cq->pi * sizeof(struct hl_cq_entry);

hdev->asic_funcs->add_end_of_cb_packets(hdev, cb->kernel_address, len,
cq_addr,
le32_to_cpu(cq_pkt.data),
- q->hw_queue_id);
+ q->msi_vec);

q->shadow_queue[hl_pi_2_offset(q->pi)] = job;

@@ -401,7 +401,7 @@ static void hw_queue_schedule_job(struct hl_cs_job *job)
* No need to check if CQ is full because it was already
* checked in hw_queue_sanity_checks
*/
- cq = &hdev->completion_queue[q->hw_queue_id];
+ cq = &hdev->completion_queue[q->cq_id];
cq->pi = hl_cq_inc_ptr(cq->pi);

ext_and_hw_queue_submit_bd(hdev, q, ctl, len, ptr);
--
2.17.1

2020-03-28 08:53:57

by Oded Gabbay

[permalink] [raw]
Subject: [PATCH 5/6] habanalabs: print warning when reset is requested

When the system administrator asks the driver to soft or hard reset the
device through sysfs, the driver should display a warning in the kernel log
to explain why it suddenly resets the device.

Signed-off-by: Oded Gabbay <[email protected]>
---
drivers/misc/habanalabs/sysfs.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/drivers/misc/habanalabs/sysfs.c b/drivers/misc/habanalabs/sysfs.c
index 4cd622b017b9..e478a191e5f5 100644
--- a/drivers/misc/habanalabs/sysfs.c
+++ b/drivers/misc/habanalabs/sysfs.c
@@ -183,6 +183,8 @@ static ssize_t soft_reset_store(struct device *dev,
goto out;
}

+ dev_warn(hdev->dev, "Soft-Reset requested through sysfs\n");
+
hl_device_reset(hdev, false, false);

out:
@@ -204,6 +206,8 @@ static ssize_t hard_reset_store(struct device *dev,
goto out;
}

+ dev_warn(hdev->dev, "Hard-Reset requested through sysfs\n");
+
hl_device_reset(hdev, true, false);

out:
--
2.17.1

2020-03-28 08:54:23

by Oded Gabbay

[permalink] [raw]
Subject: [PATCH 4/6] habanalabs: unify and improve device cpu init

Move the code of device CPU initialization from being ASIC-Dependent to
common code. In addition, add support for the new error reporting feature
of the firmware boot code.

Signed-off-by: Oded Gabbay <[email protected]>
---
drivers/misc/habanalabs/firmware_if.c | 188 ++++++++++++++++++-
drivers/misc/habanalabs/goya/goya.c | 116 ++----------
drivers/misc/habanalabs/goya/goyaP.h | 5 -
drivers/misc/habanalabs/habanalabs.h | 21 ++-
drivers/misc/habanalabs/include/hl_boot_if.h | 3 +-
5 files changed, 220 insertions(+), 113 deletions(-)

diff --git a/drivers/misc/habanalabs/firmware_if.c b/drivers/misc/habanalabs/firmware_if.c
index f5bd03171dac..99983822feeb 100644
--- a/drivers/misc/habanalabs/firmware_if.c
+++ b/drivers/misc/habanalabs/firmware_if.c
@@ -6,20 +6,21 @@
*/

#include "habanalabs.h"
+#include "include/hl_boot_if.h"

#include <linux/firmware.h>
#include <linux/genalloc.h>
#include <linux/io-64-nonatomic-lo-hi.h>

/**
- * hl_fw_push_fw_to_device() - Push FW code to device.
+ * hl_fw_load_fw_to_device() - Load F/W code to device's memory.
* @hdev: pointer to hl_device structure.
*
* Copy fw code from firmware file to device memory.
*
* Return: 0 on success, non-zero for failure.
*/
-int hl_fw_push_fw_to_device(struct hl_device *hdev, const char *fw_name,
+int hl_fw_load_fw_to_device(struct hl_device *hdev, const char *fw_name,
void __iomem *dst)
{
const struct firmware *fw;
@@ -286,3 +287,186 @@ int hl_fw_get_eeprom_data(struct hl_device *hdev, void *data, size_t max_size)

return rc;
}
+
+static void fw_read_errors(struct hl_device *hdev, u32 boot_err0_reg)
+{
+ u32 err_val;
+
+ /* Some of the firmware status codes are deprecated in newer f/w
+ * versions. In those versions, the errors are reported
+ * in different registers. Therefore, we need to check those
+ * registers and print the exact errors. Moreover, there
+ * may be multiple errors, so we need to report on each error
+ * separately. Some of the error codes might indicate a state
+ * that is not an error per-se, but it is an error in production
+ * environment
+ */
+ err_val = RREG32(boot_err0_reg);
+ if (!(err_val & CPU_BOOT_ERR0_ENABLED))
+ return;
+
+ if (err_val & CPU_BOOT_ERR0_DRAM_INIT_FAIL)
+ dev_err(hdev->dev,
+ "Device boot error - DRAM initialization failed\n");
+ if (err_val & CPU_BOOT_ERR0_FIT_CORRUPTED)
+ dev_err(hdev->dev, "Device boot error - FIT image corrupted\n");
+ if (err_val & CPU_BOOT_ERR0_TS_INIT_FAIL)
+ dev_err(hdev->dev,
+ "Device boot error - Thermal Sensor initialization failed\n");
+ if (err_val & CPU_BOOT_ERR0_DRAM_SKIPPED)
+ dev_warn(hdev->dev,
+ "Device boot warning - Skipped DRAM initialization\n");
+ if (err_val & CPU_BOOT_ERR0_BMC_WAIT_SKIPPED)
+ dev_warn(hdev->dev,
+ "Device boot error - Skipped waiting for BMC\n");
+ if (err_val & CPU_BOOT_ERR0_NIC_DATA_NOT_RDY)
+ dev_err(hdev->dev,
+ "Device boot error - Serdes data from BMC not available\n");
+ if (err_val & CPU_BOOT_ERR0_NIC_FW_FAIL)
+ dev_err(hdev->dev,
+ "Device boot error - NIC F/W initialization failed\n");
+}
+
+int hl_fw_init_cpu(struct hl_device *hdev, u32 cpu_boot_status_reg,
+ u32 msg_to_cpu_reg, u32 boot_err0_reg, bool skip_bmc,
+ u32 cpu_timeout)
+{
+ u32 status;
+ int rc;
+
+ dev_info(hdev->dev, "Going to wait for device boot (up to %lds)\n",
+ cpu_timeout / USEC_PER_SEC);
+
+ /* Make sure CPU boot-loader is running */
+ rc = hl_poll_timeout(
+ hdev,
+ cpu_boot_status_reg,
+ status,
+ (status == CPU_BOOT_STATUS_DRAM_RDY) ||
+ (status == CPU_BOOT_STATUS_NIC_FW_RDY) ||
+ (status == CPU_BOOT_STATUS_READY_TO_BOOT) ||
+ (status == CPU_BOOT_STATUS_SRAM_AVAIL),
+ 10000,
+ cpu_timeout);
+
+ /* Read U-Boot, preboot versions now in case we will later fail */
+ hdev->asic_funcs->read_device_fw_version(hdev, FW_COMP_UBOOT);
+ hdev->asic_funcs->read_device_fw_version(hdev, FW_COMP_PREBOOT);
+
+ /* Some of the status codes below are deprecated in newer f/w
+ * versions but we keep them here for backward compatibility
+ */
+ if (rc) {
+ switch (status) {
+ case CPU_BOOT_STATUS_NA:
+ dev_err(hdev->dev,
+ "Device boot error - BTL did NOT run\n");
+ break;
+ case CPU_BOOT_STATUS_IN_WFE:
+ dev_err(hdev->dev,
+ "Device boot error - Stuck inside WFE loop\n");
+ break;
+ case CPU_BOOT_STATUS_IN_BTL:
+ dev_err(hdev->dev,
+ "Device boot error - Stuck in BTL\n");
+ break;
+ case CPU_BOOT_STATUS_IN_PREBOOT:
+ dev_err(hdev->dev,
+ "Device boot error - Stuck in Preboot\n");
+ break;
+ case CPU_BOOT_STATUS_IN_SPL:
+ dev_err(hdev->dev,
+ "Device boot error - Stuck in SPL\n");
+ break;
+ case CPU_BOOT_STATUS_IN_UBOOT:
+ dev_err(hdev->dev,
+ "Device boot error - Stuck in u-boot\n");
+ break;
+ case CPU_BOOT_STATUS_DRAM_INIT_FAIL:
+ dev_err(hdev->dev,
+ "Device boot error - DRAM initialization failed\n");
+ break;
+ case CPU_BOOT_STATUS_UBOOT_NOT_READY:
+ dev_err(hdev->dev,
+ "Device boot error - u-boot stopped by user\n");
+ break;
+ case CPU_BOOT_STATUS_TS_INIT_FAIL:
+ dev_err(hdev->dev,
+ "Device boot error - Thermal Sensor initialization failed\n");
+ break;
+ default:
+ dev_err(hdev->dev,
+ "Device boot error - Invalid status code\n");
+ break;
+ }
+
+ rc = -EIO;
+ goto out;
+ }
+
+ if (!hdev->fw_loading) {
+ dev_info(hdev->dev, "Skip loading FW\n");
+ goto out;
+ }
+
+ if (status == CPU_BOOT_STATUS_SRAM_AVAIL)
+ goto out;
+
+ dev_info(hdev->dev,
+ "Loading firmware to device, may take some time...\n");
+
+ rc = hdev->asic_funcs->load_firmware_to_device(hdev);
+ if (rc)
+ goto out;
+
+ if (skip_bmc) {
+ WREG32(msg_to_cpu_reg, KMD_MSG_SKIP_BMC);
+
+ rc = hl_poll_timeout(
+ hdev,
+ cpu_boot_status_reg,
+ status,
+ (status == CPU_BOOT_STATUS_BMC_WAITING_SKIPPED),
+ 10000,
+ cpu_timeout);
+
+ if (rc) {
+ dev_err(hdev->dev,
+ "Failed to get ACK on skipping BMC, %d\n",
+ status);
+ WREG32(msg_to_cpu_reg, KMD_MSG_NA);
+ rc = -EIO;
+ goto out;
+ }
+ }
+
+ WREG32(msg_to_cpu_reg, KMD_MSG_FIT_RDY);
+
+ rc = hl_poll_timeout(
+ hdev,
+ cpu_boot_status_reg,
+ status,
+ (status == CPU_BOOT_STATUS_SRAM_AVAIL),
+ 10000,
+ cpu_timeout);
+
+ if (rc) {
+ if (status == CPU_BOOT_STATUS_FIT_CORRUPTED)
+ dev_err(hdev->dev,
+ "Device reports FIT image is corrupted\n");
+ else
+ dev_err(hdev->dev,
+ "Device failed to load, %d\n", status);
+
+ WREG32(msg_to_cpu_reg, KMD_MSG_NA);
+ rc = -EIO;
+ goto out;
+ }
+
+ dev_info(hdev->dev, "Successfully loaded firmware to device\n");
+
+out:
+ fw_read_errors(hdev, boot_err0_reg);
+
+ return rc;
+}
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index f7eb60f5f6f9..a0a96ca31757 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -2223,24 +2223,24 @@ static int goya_push_uboot_to_device(struct hl_device *hdev)

dst = hdev->pcie_bar[SRAM_CFG_BAR_ID] + UBOOT_FW_OFFSET;

- return hl_fw_push_fw_to_device(hdev, GOYA_UBOOT_FW_FILE, dst);
+ return hl_fw_load_fw_to_device(hdev, GOYA_UBOOT_FW_FILE, dst);
}

/*
- * goya_push_linux_to_device() - Push LINUX FW code to device.
+ * goya_load_firmware_to_device() - Load LINUX FW code to device.
* @hdev: Pointer to hl_device structure.
*
* Copy LINUX fw code from firmware file to HBM BAR.
*
* Return: 0 on success, non-zero for failure.
*/
-static int goya_push_linux_to_device(struct hl_device *hdev)
+static int goya_load_firmware_to_device(struct hl_device *hdev)
{
void __iomem *dst;

dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;

- return hl_fw_push_fw_to_device(hdev, GOYA_LINUX_FW_FILE, dst);
+ return hl_fw_load_fw_to_device(hdev, GOYA_LINUX_FW_FILE, dst);
}

static int goya_pldm_init_cpu(struct hl_device *hdev)
@@ -2266,7 +2266,7 @@ static int goya_pldm_init_cpu(struct hl_device *hdev)
if (rc)
return rc;

- rc = goya_push_linux_to_device(hdev);
+ rc = goya_load_firmware_to_device(hdev);
if (rc)
return rc;

@@ -2291,7 +2291,7 @@ static int goya_pldm_init_cpu(struct hl_device *hdev)
* The version string should be located by that offset.
*/
static void goya_read_device_fw_version(struct hl_device *hdev,
- enum goya_fw_component fwc)
+ enum hl_fw_component fwc)
{
const char *name;
u32 ver_off;
@@ -2328,7 +2328,6 @@ static void goya_read_device_fw_version(struct hl_device *hdev,
static int goya_init_cpu(struct hl_device *hdev, u32 cpu_timeout)
{
struct goya_device *goya = hdev->asic_specific;
- u32 status;
int rc;

if (!hdev->cpu_enable)
@@ -2355,106 +2354,13 @@ static int goya_init_cpu(struct hl_device *hdev, u32 cpu_timeout)
goto out;
}

- /* Make sure CPU boot-loader is running */
- rc = hl_poll_timeout(
- hdev,
- mmPSOC_GLOBAL_CONF_WARM_REBOOT,
- status,
- (status == CPU_BOOT_STATUS_DRAM_RDY) ||
- (status == CPU_BOOT_STATUS_SRAM_AVAIL),
- 10000,
- cpu_timeout);
-
- /* Read U-Boot version now in case we will later fail */
- goya_read_device_fw_version(hdev, FW_COMP_UBOOT);
- goya_read_device_fw_version(hdev, FW_COMP_PREBOOT);
-
- if (rc) {
- dev_err(hdev->dev, "Error in ARM u-boot!");
- switch (status) {
- case CPU_BOOT_STATUS_NA:
- dev_err(hdev->dev,
- "ARM status %d - BTL did NOT run\n", status);
- break;
- case CPU_BOOT_STATUS_IN_WFE:
- dev_err(hdev->dev,
- "ARM status %d - Inside WFE loop\n", status);
- break;
- case CPU_BOOT_STATUS_IN_BTL:
- dev_err(hdev->dev,
- "ARM status %d - Stuck in BTL\n", status);
- break;
- case CPU_BOOT_STATUS_IN_PREBOOT:
- dev_err(hdev->dev,
- "ARM status %d - Stuck in Preboot\n", status);
- break;
- case CPU_BOOT_STATUS_IN_SPL:
- dev_err(hdev->dev,
- "ARM status %d - Stuck in SPL\n", status);
- break;
- case CPU_BOOT_STATUS_IN_UBOOT:
- dev_err(hdev->dev,
- "ARM status %d - Stuck in u-boot\n", status);
- break;
- case CPU_BOOT_STATUS_DRAM_INIT_FAIL:
- dev_err(hdev->dev,
- "ARM status %d - DDR initialization failed\n",
- status);
- break;
- case CPU_BOOT_STATUS_UBOOT_NOT_READY:
- dev_err(hdev->dev,
- "ARM status %d - u-boot stopped by user\n",
- status);
- break;
- case CPU_BOOT_STATUS_TS_INIT_FAIL:
- dev_err(hdev->dev,
- "ARM status %d - Thermal Sensor initialization failed\n",
- status);
- break;
- default:
- dev_err(hdev->dev,
- "ARM status %d - Invalid status code\n",
- status);
- break;
- }
- return -EIO;
- }
-
- if (!hdev->fw_loading) {
- dev_info(hdev->dev, "Skip loading FW\n");
- goto out;
- }
-
- if (status == CPU_BOOT_STATUS_SRAM_AVAIL)
- goto out;
+ rc = hl_fw_init_cpu(hdev, mmPSOC_GLOBAL_CONF_CPU_BOOT_STATUS,
+ mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, mmCPU_BOOT_ERR0,
+ false, cpu_timeout);

- rc = goya_push_linux_to_device(hdev);
if (rc)
return rc;

- WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
-
- rc = hl_poll_timeout(
- hdev,
- mmPSOC_GLOBAL_CONF_WARM_REBOOT,
- status,
- (status == CPU_BOOT_STATUS_SRAM_AVAIL),
- 10000,
- cpu_timeout);
-
- if (rc) {
- if (status == CPU_BOOT_STATUS_FIT_CORRUPTED)
- dev_err(hdev->dev,
- "ARM u-boot reports FIT image is corrupted\n");
- else
- dev_err(hdev->dev,
- "ARM Linux failed to load, %d\n", status);
- WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_NA);
- return -EIO;
- }
-
- dev_info(hdev->dev, "Successfully loaded firmware to device\n");
-
out:
goya->hw_cap_initialized |= HW_CAP_CPU;

@@ -5339,7 +5245,9 @@ static const struct hl_asic_funcs goya_funcs = {
.wreg = hl_wreg,
.halt_coresight = goya_halt_coresight,
.get_clk_rate = goya_get_clk_rate,
- .get_queue_id_for_cq = goya_get_queue_id_for_cq
+ .get_queue_id_for_cq = goya_get_queue_id_for_cq,
+ .read_device_fw_version = goya_read_device_fw_version,
+ .load_firmware_to_device = goya_load_firmware_to_device
};

/*
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index 5db5f6ea1d98..a05250e53175 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -149,11 +149,6 @@
#define HW_CAP_GOLDEN 0x00000400
#define HW_CAP_TPC 0x00000800

-enum goya_fw_component {
- FW_COMP_UBOOT,
- FW_COMP_PREBOOT
-};
-
struct goya_device {
/* TODO: remove hw_queues_lock after moving to scheduler code */
spinlock_t hw_queues_lock;
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 299add419e79..199f7835ae46 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -75,6 +75,16 @@ struct pgt_info {
struct hl_device;
struct hl_fpriv;

+/**
+ * enum hl_fw_component - F/W components to read version through registers.
+ * @FW_COMP_UBOOT: u-boot.
+ * @FW_COMP_PREBOOT: preboot.
+ */
+enum hl_fw_component {
+ FW_COMP_UBOOT,
+ FW_COMP_PREBOOT
+};
+
/**
* enum hl_queue_type - Supported QUEUE types.
* @QUEUE_TYPE_NA: queue is not available.
@@ -539,6 +549,9 @@ enum hl_pll_frequency {
* @halt_coresight: stop the ETF and ETR traces.
* @get_clk_rate: Retrieve the ASIC current and maximum clock rate in MHz
* @get_queue_id_for_cq: Get the H/W queue id related to the given CQ index.
+ * @read_device_fw_version: read the device's firmware versions that are
+ * contained in registers
+ * @load_firmware_to_device: load the firmware to the device's memory
*/
struct hl_asic_funcs {
int (*early_init)(struct hl_device *hdev);
@@ -626,6 +639,9 @@ struct hl_asic_funcs {
void (*halt_coresight)(struct hl_device *hdev);
int (*get_clk_rate)(struct hl_device *hdev, u32 *cur_clk, u32 *max_clk);
u32 (*get_queue_id_for_cq)(struct hl_device *hdev, u32 cq_idx);
+ void (*read_device_fw_version)(struct hl_device *hdev,
+ enum hl_fw_component fwc);
+ int (*load_firmware_to_device)(struct hl_device *hdev);
};


@@ -1591,7 +1607,7 @@ int hl_mmu_unmap(struct hl_ctx *ctx, u64 virt_addr, u32 page_size,
void hl_mmu_swap_out(struct hl_ctx *ctx);
void hl_mmu_swap_in(struct hl_ctx *ctx);

-int hl_fw_push_fw_to_device(struct hl_device *hdev, const char *fw_name,
+int hl_fw_load_fw_to_device(struct hl_device *hdev, const char *fw_name,
void __iomem *dst);
int hl_fw_send_pci_access_msg(struct hl_device *hdev, u32 opcode);
int hl_fw_send_cpu_message(struct hl_device *hdev, u32 hw_queue_id, u32 *msg,
@@ -1604,6 +1620,9 @@ void hl_fw_cpu_accessible_dma_pool_free(struct hl_device *hdev, size_t size,
int hl_fw_send_heartbeat(struct hl_device *hdev);
int hl_fw_armcp_info_get(struct hl_device *hdev);
int hl_fw_get_eeprom_data(struct hl_device *hdev, void *data, size_t max_size);
+int hl_fw_init_cpu(struct hl_device *hdev, u32 cpu_boot_status_reg,
+ u32 msg_to_cpu_reg, u32 boot_err0_reg, bool skip_bmc,
+ u32 cpu_timeout);

int hl_pci_bars_map(struct hl_device *hdev, const char * const name[3],
bool is_wc[3]);
diff --git a/drivers/misc/habanalabs/include/hl_boot_if.h b/drivers/misc/habanalabs/include/hl_boot_if.h
index f7992a69fd3a..107482fb45a4 100644
--- a/drivers/misc/habanalabs/include/hl_boot_if.h
+++ b/drivers/misc/habanalabs/include/hl_boot_if.h
@@ -42,7 +42,8 @@ enum cpu_boot_status {
enum kmd_msg {
KMD_MSG_NA = 0,
KMD_MSG_GOTO_WFE,
- KMD_MSG_FIT_RDY
+ KMD_MSG_FIT_RDY,
+ KMD_MSG_SKIP_BMC,
};

#endif /* HL_BOOT_IF_H */
--
2.17.1

2020-03-28 08:54:57

by Oded Gabbay

[permalink] [raw]
Subject: [PATCH 6/6] habanalabs: increase timeout during reset

When doing training, the DL framework (e.g. tensorflow) performs hundreds
of thousands of memory allocations and mappings. In case the driver needs
to perform hard-reset during training, the driver kills the application and
unmaps all those memory allocations. Unfortunately, because of that large
amount of mappings, the driver isn't able to do that in the current timeout
(5 seconds). Therefore, increase the timeout significantly to 30 seconds
to avoid situation where the driver resets the device with active mappings,
which sometime can cause a kernel bug.

BTW, it doesn't mean we will spend all the 30 seconds because the reset
thread checks every one second if the unmap operation is done.

Signed-off-by: Oded Gabbay <[email protected]>
---
drivers/misc/habanalabs/habanalabs.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 199f7835ae46..6c54d0ba0a1d 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -23,7 +23,7 @@

#define HL_MMAP_CB_MASK (0x8000000000000000ull >> PAGE_SHIFT)

-#define HL_PENDING_RESET_PER_SEC 5
+#define HL_PENDING_RESET_PER_SEC 30

#define HL_DEVICE_TIMEOUT_USEC 1000000 /* 1 s */

--
2.17.1

2020-03-30 06:02:15

by Omer Shpigelman

[permalink] [raw]
Subject: RE: [PATCH 1/6] habanalabs: don't wait for ASIC CPU after reset

On Sat, Mar 28, 2020 at 11:53 AM, Oded Gabbay <[email protected]> wrote:
> Upon reset of the ASIC, the driver would have waited for the CPU to come out
> of reset before finishing the reset process. This was done for the purpose of
> making the CPU available to answer FLR requests. However, when a VM shuts
> down, the driver isn't removed so a reset never happens.
> Therefore, remove this waiting period as we don't need it.
>
> Signed-off-by: Oded Gabbay <[email protected]>

Reviewed-by: Omer Shpigelman <[email protected]>

2020-03-30 06:09:05

by Omer Shpigelman

[permalink] [raw]
Subject: RE: [PATCH 4/6] habanalabs: unify and improve device cpu init

On Sat, Mar 28, 2020 at 11:53 AM, Oded Gabbay <[email protected]> wrote:
> Move the code of device CPU initialization from being ASIC-Dependent to common
> code. In addition, add support for the new error reporting feature of the firmware
> boot code.
>
> Signed-off-by: Oded Gabbay <[email protected]>

Reviewed-by: Omer Shpigelman <[email protected]>

2020-03-30 06:11:10

by Omer Shpigelman

[permalink] [raw]
Subject: RE: [PATCH 5/6] habanalabs: print warning when reset is requested

On Sat, Mar 28, 2020 at 11:53 AM, Oded Gabbay <[email protected]> wrote:
> When the system administrator asks the driver to soft or hard reset the device
> through sysfs, the driver should display a warning in the kernel log to explain
> why it suddenly resets the device.
>
> Signed-off-by: Oded Gabbay <[email protected]>

Reviewed-by: Omer Shpigelman <[email protected]>

2020-03-30 06:15:15

by Omer Shpigelman

[permalink] [raw]
Subject: RE: [PATCH 6/6] habanalabs: increase timeout during reset

On Sat, Mar 28, 2020 at 11:53 AM, Oded Gabbay <[email protected]> wrote:
> When doing training, the DL framework (e.g. tensorflow) performs hundreds of
> thousands of memory allocations and mappings. In case the driver needs to
> perform hard-reset during training, the driver kills the application and unmaps all
> those memory allocations. Unfortunately, because of that large amount of
> mappings, the driver isn't able to do that in the current timeout (5 seconds).
> Therefore, increase the timeout significantly to 30 seconds to avoid situation
> where the driver resets the device with active mappings, which sometime can
> cause a kernel bug.
>
> BTW, it doesn't mean we will spend all the 30 seconds because the reset thread
> checks every one second if the unmap operation is done.
>
> Signed-off-by: Oded Gabbay <[email protected]>

Reviewed-by: Omer Shpigelman <[email protected]>