2024-01-03 04:09:02

by Mun Chun Yep

[permalink] [raw]
Subject: [PATCH 0/9] crypto: qat - improve recovery flows

This set improves the error recovery flows in the QAT drivers and
adds a mechanism to test it through an heartbeat simulator.

When a QAT device reports either a fatal error, or an AER fatal error,
or fails an heartbeat check, the PF driver sends an error notification to
the VFs through PFVF comms and if `auto_reset` is enabled then
the device goes through reset flows for error recovery.
If SRIOV is enabled when an error is encountered, this is re-enabled after
the reset cycle is done.

Damian Muszynski (2):
crypto: qat - add heartbeat error simulator
crypto: qat - add auto reset on error

Furong Zhou (3):
crypto: qat - add fatal error notify method
crypto: qat - disable arbitration before reset
crypto: qat - limit heartbeat notifications

Mun Chun Yep (4):
crypto: qat - update PFVF protocol for recovery
crypto: qat - re-enable sriov after pf reset
crypto: qat - add fatal error notification
crypto: qat - improve aer error reset handling

Documentation/ABI/testing/debugfs-driver-qat | 26 +++++
Documentation/ABI/testing/sysfs-driver-qat | 20 ++++
drivers/crypto/intel/qat/Kconfig | 15 +++
drivers/crypto/intel/qat/qat_common/Makefile | 3 +
.../intel/qat/qat_common/adf_accel_devices.h | 2 +
drivers/crypto/intel/qat/qat_common/adf_aer.c | 107 +++++++++++++++++-
.../intel/qat/qat_common/adf_cfg_strings.h | 1 +
.../intel/qat/qat_common/adf_common_drv.h | 10 ++
.../intel/qat/qat_common/adf_heartbeat.c | 20 +++-
.../intel/qat/qat_common/adf_heartbeat.h | 21 ++++
.../qat/qat_common/adf_heartbeat_dbgfs.c | 52 +++++++++
.../qat/qat_common/adf_heartbeat_inject.c | 76 +++++++++++++
.../intel/qat/qat_common/adf_hw_arbiter.c | 25 ++++
.../crypto/intel/qat/qat_common/adf_init.c | 12 ++
drivers/crypto/intel/qat/qat_common/adf_isr.c | 7 +-
.../intel/qat/qat_common/adf_pfvf_msg.h | 7 +-
.../intel/qat/qat_common/adf_pfvf_pf_msg.c | 64 ++++++++++-
.../intel/qat/qat_common/adf_pfvf_pf_msg.h | 21 ++++
.../intel/qat/qat_common/adf_pfvf_pf_proto.c | 8 ++
.../intel/qat/qat_common/adf_pfvf_vf_proto.c | 6 +
.../crypto/intel/qat/qat_common/adf_sriov.c | 38 ++++++-
.../crypto/intel/qat/qat_common/adf_sysfs.c | 37 ++++++
22 files changed, 565 insertions(+), 13 deletions(-)
create mode 100644 drivers/crypto/intel/qat/qat_common/adf_heartbeat_inject.c

--
2.34.1



2024-01-03 04:09:47

by Mun Chun Yep

[permalink] [raw]
Subject: [PATCH 3/9] crypto: qat - disable arbitration before reset

From: Zhou Furong <[email protected]>

Disable arbitration to avoid new requests to be processed before
resetting a device.

This is needed so that new requests are not fetched when an error is
detected.

Signed-off-by: Zhou Furong <[email protected]>
Reviewed-by: Ahsan Atta <[email protected]>
Reviewed-by: Markas Rapoportas <[email protected]>
---
drivers/crypto/intel/qat/qat_common/adf_aer.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/drivers/crypto/intel/qat/qat_common/adf_aer.c b/drivers/crypto/intel/qat/qat_common/adf_aer.c
index 22a43b4b8315..acbbd32bd815 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_aer.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_aer.c
@@ -181,8 +181,16 @@ static void adf_notify_fatal_error_worker(struct work_struct *work)
struct adf_fatal_error_data *wq_data =
container_of(work, struct adf_fatal_error_data, work);
struct adf_accel_dev *accel_dev = wq_data->accel_dev;
+ struct adf_hw_device_data *hw_device = accel_dev->hw_device;

adf_error_notifier(accel_dev);
+
+ if (!accel_dev->is_vf) {
+ /* Disable arbitration to stop processing of new requests */
+ if (hw_device->exit_arb)
+ hw_device->exit_arb(accel_dev);
+ }
+
kfree(wq_data);
}

--
2.34.1


2024-01-03 04:09:48

by Mun Chun Yep

[permalink] [raw]
Subject: [PATCH 4/9] crypto: qat - update PFVF protocol for recovery

Update the PFVF logit to handle restart and recovery. This adds the
following functions:

* adf_pf2vf_notify_fatal_error(): allows the PF to notify VFs that the
device detected a fatal error and requires a reset. This sends to
VF the event `ADF_PF2VF_MSGTYPE_FATAL_ERROR`.
* adf_pf2vf_wait_for_restarting_complete(): allows the PF to wait for
`ADF_VF2PF_MSGTYPE_RESTARTING_COMPLETE` events from active VFs
before proceeding with a reset.
* adf_pf2vf_notify_restarted(): enables the PF to notify VFs with
an `ADF_PF2VF_MSGTYPE_RESTARTED` event after recovery, indicating that
the device is back to normal. This prompts VF drivers switch back to
use the accelerator for workload processing.

These changes improve the communication and synchronization between PF
and VF drivers during system restart and recovery processes.

Signed-off-by: Mun Chun Yep <[email protected]>
Reviewed-by: Ahsan Atta <[email protected]>
Reviewed-by: Markas Rapoportas <[email protected]>
---
.../intel/qat/qat_common/adf_accel_devices.h | 1 +
drivers/crypto/intel/qat/qat_common/adf_aer.c | 3 +
.../intel/qat/qat_common/adf_pfvf_msg.h | 7 +-
.../intel/qat/qat_common/adf_pfvf_pf_msg.c | 64 ++++++++++++++++++-
.../intel/qat/qat_common/adf_pfvf_pf_msg.h | 21 ++++++
.../intel/qat/qat_common/adf_pfvf_pf_proto.c | 8 +++
.../intel/qat/qat_common/adf_pfvf_vf_proto.c | 6 ++
.../crypto/intel/qat/qat_common/adf_sriov.c | 1 +
8 files changed, 109 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h b/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h
index a16c7e6edc65..4a3c36aaa7ca 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h
@@ -332,6 +332,7 @@ struct adf_accel_vf_info {
struct ratelimit_state vf2pf_ratelimit;
u32 vf_nr;
bool init;
+ bool restarting;
u8 vf_compat_ver;
};

diff --git a/drivers/crypto/intel/qat/qat_common/adf_aer.c b/drivers/crypto/intel/qat/qat_common/adf_aer.c
index acbbd32bd815..ecb114e1b59f 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_aer.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_aer.c
@@ -7,6 +7,7 @@
#include <linux/delay.h>
#include "adf_accel_devices.h"
#include "adf_common_drv.h"
+#include "adf_pfvf_pf_msg.h"

struct adf_fatal_error_data {
struct adf_accel_dev *accel_dev;
@@ -189,6 +190,8 @@ static void adf_notify_fatal_error_worker(struct work_struct *work)
/* Disable arbitration to stop processing of new requests */
if (hw_device->exit_arb)
hw_device->exit_arb(accel_dev);
+ if (accel_dev->pf.vf_info)
+ adf_pf2vf_notify_fatal_error(accel_dev);
}

kfree(wq_data);
diff --git a/drivers/crypto/intel/qat/qat_common/adf_pfvf_msg.h b/drivers/crypto/intel/qat/qat_common/adf_pfvf_msg.h
index 204a42438992..d1b3ef9cadac 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_pfvf_msg.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_pfvf_msg.h
@@ -99,6 +99,8 @@ enum pf2vf_msgtype {
ADF_PF2VF_MSGTYPE_RESTARTING = 0x01,
ADF_PF2VF_MSGTYPE_VERSION_RESP = 0x02,
ADF_PF2VF_MSGTYPE_BLKMSG_RESP = 0x03,
+ ADF_PF2VF_MSGTYPE_FATAL_ERROR = 0x04,
+ ADF_PF2VF_MSGTYPE_RESTARTED = 0x05,
/* Values from 0x10 are Gen4 specific, message type is only 4 bits in Gen2 devices. */
ADF_PF2VF_MSGTYPE_RP_RESET_RESP = 0x10,
};
@@ -112,6 +114,7 @@ enum vf2pf_msgtype {
ADF_VF2PF_MSGTYPE_LARGE_BLOCK_REQ = 0x07,
ADF_VF2PF_MSGTYPE_MEDIUM_BLOCK_REQ = 0x08,
ADF_VF2PF_MSGTYPE_SMALL_BLOCK_REQ = 0x09,
+ ADF_VF2PF_MSGTYPE_RESTARTING_COMPLETE = 0x0a,
/* Values from 0x10 are Gen4 specific, message type is only 4 bits in Gen2 devices. */
ADF_VF2PF_MSGTYPE_RP_RESET = 0x10,
};
@@ -124,8 +127,10 @@ enum pfvf_compatibility_version {
ADF_PFVF_COMPAT_FAST_ACK = 0x03,
/* Ring to service mapping support for non-standard mappings */
ADF_PFVF_COMPAT_RING_TO_SVC_MAP = 0x04,
+ /* Fallback compat */
+ ADF_PFVF_COMPAT_FALLBACK = 0x05,
/* Reference to the latest version */
- ADF_PFVF_COMPAT_THIS_VERSION = 0x04,
+ ADF_PFVF_COMPAT_THIS_VERSION = 0x05,
};

/* PF->VF Version Response */
diff --git a/drivers/crypto/intel/qat/qat_common/adf_pfvf_pf_msg.c b/drivers/crypto/intel/qat/qat_common/adf_pfvf_pf_msg.c
index 14c069f0d71a..0e31f4b41844 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_pfvf_pf_msg.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_pfvf_pf_msg.c
@@ -1,21 +1,83 @@
// SPDX-License-Identifier: (BSD-3-Clause OR GPL-2.0-only)
/* Copyright(c) 2015 - 2021 Intel Corporation */
+#include <linux/delay.h>
#include <linux/pci.h>
#include "adf_accel_devices.h"
#include "adf_pfvf_msg.h"
#include "adf_pfvf_pf_msg.h"
#include "adf_pfvf_pf_proto.h"

+#define ADF_PF_WAIT_RESTARTING_COMPLETE_DELAY 100
+#define ADF_VF_SHUTDOWN_RETRY 100
+
void adf_pf2vf_notify_restarting(struct adf_accel_dev *accel_dev)
{
struct adf_accel_vf_info *vf;
struct pfvf_message msg = { .type = ADF_PF2VF_MSGTYPE_RESTARTING };
int i, num_vfs = pci_num_vf(accel_to_pci_dev(accel_dev));

+ dev_dbg(&GET_DEV(accel_dev), "pf2vf notify restarting\n");
for (i = 0, vf = accel_dev->pf.vf_info; i < num_vfs; i++, vf++) {
- if (vf->init && adf_send_pf2vf_msg(accel_dev, i, msg))
+ vf->restarting = false;
+ if (!vf->init)
+ continue;
+ if (adf_send_pf2vf_msg(accel_dev, i, msg))
dev_err(&GET_DEV(accel_dev),
"Failed to send restarting msg to VF%d\n", i);
+ else if (vf->vf_compat_ver >= ADF_PFVF_COMPAT_FALLBACK)
+ vf->restarting = true;
+ }
+}
+
+void adf_pf2vf_wait_for_restarting_complete(struct adf_accel_dev *accel_dev)
+{
+ int num_vfs = pci_num_vf(accel_to_pci_dev(accel_dev));
+ int i, retries = ADF_VF_SHUTDOWN_RETRY;
+ struct adf_accel_vf_info *vf;
+ bool vf_running;
+
+ dev_dbg(&GET_DEV(accel_dev), "pf2vf wait for restarting complete\n");
+ do {
+ vf_running = false;
+ for (i = 0, vf = accel_dev->pf.vf_info; i < num_vfs; i++, vf++)
+ if (vf->restarting)
+ vf_running = true;
+ if (!vf_running)
+ break;
+ msleep(ADF_PF_WAIT_RESTARTING_COMPLETE_DELAY);
+ } while (--retries);
+
+ if (vf_running)
+ dev_warn(&GET_DEV(accel_dev), "Some VFs are still running\n");
+}
+
+void adf_pf2vf_notify_restarted(struct adf_accel_dev *accel_dev)
+{
+ struct pfvf_message msg = { .type = ADF_PF2VF_MSGTYPE_RESTARTED };
+ int i, num_vfs = pci_num_vf(accel_to_pci_dev(accel_dev));
+ struct adf_accel_vf_info *vf;
+
+ dev_dbg(&GET_DEV(accel_dev), "pf2vf notify restarted\n");
+ for (i = 0, vf = accel_dev->pf.vf_info; i < num_vfs; i++, vf++) {
+ if (vf->init && vf->vf_compat_ver >= ADF_PFVF_COMPAT_FALLBACK &&
+ adf_send_pf2vf_msg(accel_dev, i, msg))
+ dev_err(&GET_DEV(accel_dev),
+ "Failed to send restarted msg to VF%d\n", i);
+ }
+}
+
+void adf_pf2vf_notify_fatal_error(struct adf_accel_dev *accel_dev)
+{
+ struct pfvf_message msg = { .type = ADF_PF2VF_MSGTYPE_FATAL_ERROR };
+ int i, num_vfs = pci_num_vf(accel_to_pci_dev(accel_dev));
+ struct adf_accel_vf_info *vf;
+
+ dev_dbg(&GET_DEV(accel_dev), "pf2vf notify fatal error\n");
+ for (i = 0, vf = accel_dev->pf.vf_info; i < num_vfs; i++, vf++) {
+ if (vf->init && vf->vf_compat_ver >= ADF_PFVF_COMPAT_FALLBACK &&
+ adf_send_pf2vf_msg(accel_dev, i, msg))
+ dev_err(&GET_DEV(accel_dev),
+ "Failed to send fatal error msg to VF%d\n", i);
}
}

diff --git a/drivers/crypto/intel/qat/qat_common/adf_pfvf_pf_msg.h b/drivers/crypto/intel/qat/qat_common/adf_pfvf_pf_msg.h
index e8982d1ac896..f203d88c919c 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_pfvf_pf_msg.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_pfvf_pf_msg.h
@@ -5,7 +5,28 @@

#include "adf_accel_devices.h"

+#if defined(CONFIG_PCI_IOV)
void adf_pf2vf_notify_restarting(struct adf_accel_dev *accel_dev);
+void adf_pf2vf_wait_for_restarting_complete(struct adf_accel_dev *accel_dev);
+void adf_pf2vf_notify_restarted(struct adf_accel_dev *accel_dev);
+void adf_pf2vf_notify_fatal_error(struct adf_accel_dev *accel_dev);
+#else
+static inline void adf_pf2vf_notify_restarting(struct adf_accel_dev *accel_dev)
+{
+}
+
+static inline void adf_pf2vf_wait_for_restarting_complete(struct adf_accel_dev *accel_dev)
+{
+}
+
+static inline void adf_pf2vf_notify_restarted(struct adf_accel_dev *accel_dev)
+{
+}
+
+static inline void adf_pf2vf_notify_fatal_error(struct adf_accel_dev *accel_dev)
+{
+}
+#endif

typedef int (*adf_pf2vf_blkmsg_provider)(struct adf_accel_dev *accel_dev,
u8 *buffer, u8 compat);
diff --git a/drivers/crypto/intel/qat/qat_common/adf_pfvf_pf_proto.c b/drivers/crypto/intel/qat/qat_common/adf_pfvf_pf_proto.c
index 388e58bcbcaf..9ab93fbfefde 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_pfvf_pf_proto.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_pfvf_pf_proto.c
@@ -291,6 +291,14 @@ static int adf_handle_vf2pf_msg(struct adf_accel_dev *accel_dev, u8 vf_nr,
vf_info->init = false;
}
break;
+ case ADF_VF2PF_MSGTYPE_RESTARTING_COMPLETE:
+ {
+ dev_dbg(&GET_DEV(accel_dev),
+ "Restarting Complete received from VF%d\n", vf_nr);
+ vf_info->restarting = false;
+ vf_info->init = false;
+ }
+ break;
case ADF_VF2PF_MSGTYPE_LARGE_BLOCK_REQ:
case ADF_VF2PF_MSGTYPE_MEDIUM_BLOCK_REQ:
case ADF_VF2PF_MSGTYPE_SMALL_BLOCK_REQ:
diff --git a/drivers/crypto/intel/qat/qat_common/adf_pfvf_vf_proto.c b/drivers/crypto/intel/qat/qat_common/adf_pfvf_vf_proto.c
index 1015155b6374..dc284a089c88 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_pfvf_vf_proto.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_pfvf_vf_proto.c
@@ -308,6 +308,12 @@ static bool adf_handle_pf2vf_msg(struct adf_accel_dev *accel_dev,

adf_pf2vf_handle_pf_restarting(accel_dev);
return false;
+ case ADF_PF2VF_MSGTYPE_RESTARTED:
+ dev_dbg(&GET_DEV(accel_dev), "Restarted message received from PF\n");
+ return true;
+ case ADF_PF2VF_MSGTYPE_FATAL_ERROR:
+ dev_err(&GET_DEV(accel_dev), "Fatal error received from PF\n");
+ return true;
case ADF_PF2VF_MSGTYPE_VERSION_RESP:
case ADF_PF2VF_MSGTYPE_BLKMSG_RESP:
case ADF_PF2VF_MSGTYPE_RP_RESET_RESP:
diff --git a/drivers/crypto/intel/qat/qat_common/adf_sriov.c b/drivers/crypto/intel/qat/qat_common/adf_sriov.c
index f44025bb6f99..cb2a9830f192 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_sriov.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_sriov.c
@@ -103,6 +103,7 @@ void adf_disable_sriov(struct adf_accel_dev *accel_dev)
return;

adf_pf2vf_notify_restarting(accel_dev);
+ adf_pf2vf_wait_for_restarting_complete(accel_dev);
pci_disable_sriov(accel_to_pci_dev(accel_dev));

/* Disable VF to PF interrupts */
--
2.34.1


2024-01-03 04:09:53

by Mun Chun Yep

[permalink] [raw]
Subject: [PATCH 6/9] crypto: qat - add fatal error notification

Notify a fatal error condition and optionally reset the device in
the following cases:
* if the device reports an uncorrectable fatal error through an
interrupt
* if the heartbeat feature detects that the device is not
responding

This patch is based on earlier work done by Shashank Gupta.

Signed-off-by: Mun Chun Yep <[email protected]>
Reviewed-by: Ahsan Atta <[email protected]>
Reviewed-by: Markas Rapoportas <[email protected]>
---
drivers/crypto/intel/qat/qat_common/adf_heartbeat.c | 3 +++
drivers/crypto/intel/qat/qat_common/adf_isr.c | 7 ++++++-
2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/qat/qat_common/adf_heartbeat.c b/drivers/crypto/intel/qat/qat_common/adf_heartbeat.c
index f88b1bc6857e..fe8428d4ff39 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_heartbeat.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_heartbeat.c
@@ -229,6 +229,9 @@ void adf_heartbeat_status(struct adf_accel_dev *accel_dev,
"Heartbeat ERROR: QAT is not responding.\n");
*hb_status = HB_DEV_UNRESPONSIVE;
hb->hb_failed_counter++;
+ if (adf_notify_fatal_error(accel_dev))
+ dev_err(&GET_DEV(accel_dev),
+ "Failed to notify fatal error\n");
return;
}

diff --git a/drivers/crypto/intel/qat/qat_common/adf_isr.c b/drivers/crypto/intel/qat/qat_common/adf_isr.c
index 3557a0d6dea2..9d60fff5a76c 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_isr.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_isr.c
@@ -139,8 +139,13 @@ static bool adf_handle_ras_int(struct adf_accel_dev *accel_dev)

if (ras_ops->handle_interrupt &&
ras_ops->handle_interrupt(accel_dev, &reset_required)) {
- if (reset_required)
+ if (reset_required) {
dev_err(&GET_DEV(accel_dev), "Fatal error, reset required\n");
+ if (adf_notify_fatal_error(accel_dev))
+ dev_err(&GET_DEV(accel_dev),
+ "Failed to notify fatal error\n");
+ }
+
return true;
}

--
2.34.1


2024-01-03 04:09:57

by Mun Chun Yep

[permalink] [raw]
Subject: [PATCH 9/9] crypto: qat - improve aer error reset handling

Disable worker threads and clear pci bus master in aer error handler.

This is to avoid new requests to be processed and allow dummy responses
generation for in-flight requests in user space before resetting a device.

Signed-off-by: Mun Chun Yep <[email protected]>
Reviewed-by: Ahsan Atta <[email protected]>
Reviewed-by: Markas Rapoportas <[email protected]>
---
drivers/crypto/intel/qat/qat_common/adf_aer.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/drivers/crypto/intel/qat/qat_common/adf_aer.c b/drivers/crypto/intel/qat/qat_common/adf_aer.c
index b3d4b6b99c65..df0ff0caf419 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_aer.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_aer.c
@@ -28,6 +28,14 @@ static pci_ers_result_t adf_error_detected(struct pci_dev *pdev,
return PCI_ERS_RESULT_DISCONNECT;
}

+ if (accel_dev->hw_device->exit_arb) {
+ dev_info(&pdev->dev, "Disabling arbitration\n");
+ accel_dev->hw_device->exit_arb(accel_dev);
+ }
+ adf_error_notifier(accel_dev);
+ adf_pf2vf_notify_fatal_error(accel_dev);
+ pci_clear_master(pdev);
+
if (state == pci_channel_io_perm_failure) {
dev_err(&pdev->dev, "Can't recover from device error\n");
return PCI_ERS_RESULT_DISCONNECT;
@@ -111,9 +119,18 @@ static void adf_device_reset_worker(struct work_struct *work)
container_of(work, struct adf_reset_dev_data, reset_work);
struct adf_accel_dev *accel_dev = reset_data->accel_dev;
unsigned long wait_jiffies = msecs_to_jiffies(10000);
+ struct pci_dev *pdev = accel_to_pci_dev(accel_dev);
struct adf_sriov_dev_data sriov_data;

adf_dev_restarting_notify(accel_dev);
+
+ /*
+ * re-enable device to support pf/vf comms as it would be disabled
+ * in the detect function of aer driver
+ */
+ if (!pdev->is_busmaster)
+ pci_set_master(pdev);
+
if (adf_dev_restart(accel_dev)) {
/* The device hanged and we can't restart it so stop here */
dev_err(&GET_DEV(accel_dev), "Restart device failed\n");
--
2.34.1


2024-01-03 04:09:59

by Mun Chun Yep

[permalink] [raw]
Subject: [PATCH 5/9] crypto: qat - re-enable sriov after pf reset

When a Physical Function (PF) is reset, SR-IOV gets disabled, making the
associated Virtual Functions (VFs) unavailable. Even after reset and
using pci_restore_state, VFs remain uncreated because the numvfs still
at 0. Therefore, it's necessary to reconfigure SR-IOV to re-enable VFs.

This commit introduces the ADF_SRIOV_ENABLED configuration flag to cache
the SR-IOV enablement state. SR-IOV is only re-enabled if it was
previously configured.

This commit also introduces a dedicated workqueue without
`WQ_MEM_RECLAIM` flag for enabling SR-IOV during Heartbeat and CPM error
resets, preventing workqueue flushing warning.

This patch is based on earlier work done by Shashank Gupta.

Signed-off-by: Mun Chun Yep <[email protected]>
Reviewed-by: Ahsan Atta <[email protected]>
Reviewed-by: Markas Rapoportas <[email protected]>
---
drivers/crypto/intel/qat/qat_common/adf_aer.c | 40 ++++++++++++++++++-
.../intel/qat/qat_common/adf_cfg_strings.h | 1 +
.../intel/qat/qat_common/adf_common_drv.h | 5 +++
.../crypto/intel/qat/qat_common/adf_sriov.c | 37 +++++++++++++++--
4 files changed, 79 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/intel/qat/qat_common/adf_aer.c b/drivers/crypto/intel/qat/qat_common/adf_aer.c
index ecb114e1b59f..cd273b31db0e 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_aer.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_aer.c
@@ -15,6 +15,7 @@ struct adf_fatal_error_data {
};

static struct workqueue_struct *device_reset_wq;
+static struct workqueue_struct *device_sriov_wq;

static pci_ers_result_t adf_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
@@ -43,6 +44,13 @@ struct adf_reset_dev_data {
struct work_struct reset_work;
};

+/* sriov dev data */
+struct adf_sriov_dev_data {
+ struct adf_accel_dev *accel_dev;
+ struct completion compl;
+ struct work_struct sriov_work;
+};
+
void adf_reset_sbr(struct adf_accel_dev *accel_dev)
{
struct pci_dev *pdev = accel_to_pci_dev(accel_dev);
@@ -88,11 +96,22 @@ void adf_dev_restore(struct adf_accel_dev *accel_dev)
}
}

+static void adf_device_sriov_worker(struct work_struct *work)
+{
+ struct adf_sriov_dev_data *sriov_data =
+ container_of(work, struct adf_sriov_dev_data, sriov_work);
+
+ adf_reenable_sriov(sriov_data->accel_dev);
+ complete(&sriov_data->compl);
+}
+
static void adf_device_reset_worker(struct work_struct *work)
{
struct adf_reset_dev_data *reset_data =
container_of(work, struct adf_reset_dev_data, reset_work);
struct adf_accel_dev *accel_dev = reset_data->accel_dev;
+ unsigned long wait_jiffies = msecs_to_jiffies(10000);
+ struct adf_sriov_dev_data sriov_data;

adf_dev_restarting_notify(accel_dev);
if (adf_dev_restart(accel_dev)) {
@@ -103,6 +122,14 @@ static void adf_device_reset_worker(struct work_struct *work)
WARN(1, "QAT: device restart failed. Device is unusable\n");
return;
}
+
+ sriov_data.accel_dev = accel_dev;
+ init_completion(&sriov_data.compl);
+ INIT_WORK(&sriov_data.sriov_work, adf_device_sriov_worker);
+ queue_work(device_sriov_wq, &sriov_data.sriov_work);
+ if (wait_for_completion_timeout(&sriov_data.compl, wait_jiffies))
+ adf_pf2vf_notify_restarted(accel_dev);
+
adf_dev_restarted_notify(accel_dev);
clear_bit(ADF_STATUS_RESTARTING, &accel_dev->status);

@@ -216,7 +243,14 @@ int adf_init_aer(void)
{
device_reset_wq = alloc_workqueue("qat_device_reset_wq",
WQ_MEM_RECLAIM, 0);
- return !device_reset_wq ? -EFAULT : 0;
+ if (!device_reset_wq)
+ return -EFAULT;
+
+ device_sriov_wq = alloc_workqueue("qat_device_sriov_wq", 0, 0);
+ if (!device_sriov_wq)
+ return -EFAULT;
+
+ return 0;
}

void adf_exit_aer(void)
@@ -224,4 +258,8 @@ void adf_exit_aer(void)
if (device_reset_wq)
destroy_workqueue(device_reset_wq);
device_reset_wq = NULL;
+
+ if (device_sriov_wq)
+ destroy_workqueue(device_sriov_wq);
+ device_sriov_wq = NULL;
}
diff --git a/drivers/crypto/intel/qat/qat_common/adf_cfg_strings.h b/drivers/crypto/intel/qat/qat_common/adf_cfg_strings.h
index 322b76903a73..e015ad6cace2 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_cfg_strings.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_cfg_strings.h
@@ -49,5 +49,6 @@
ADF_ETRMGR_BANK "%d" ADF_ETRMGR_CORE_AFFINITY
#define ADF_ACCEL_STR "Accelerator%d"
#define ADF_HEARTBEAT_TIMER "HeartbeatTimer"
+#define ADF_SRIOV_ENABLED "SriovEnabled"

#endif
diff --git a/drivers/crypto/intel/qat/qat_common/adf_common_drv.h b/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
index 8c062d5a8db2..10891c9da6e7 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
@@ -192,6 +192,7 @@ bool adf_misc_wq_queue_delayed_work(struct delayed_work *work,
#if defined(CONFIG_PCI_IOV)
int adf_sriov_configure(struct pci_dev *pdev, int numvfs);
void adf_disable_sriov(struct adf_accel_dev *accel_dev);
+void adf_reenable_sriov(struct adf_accel_dev *accel_dev);
void adf_enable_vf2pf_interrupts(struct adf_accel_dev *accel_dev, u32 vf_mask);
void adf_disable_all_vf2pf_interrupts(struct adf_accel_dev *accel_dev);
bool adf_recv_and_handle_pf2vf_msg(struct adf_accel_dev *accel_dev);
@@ -212,6 +213,10 @@ static inline void adf_disable_sriov(struct adf_accel_dev *accel_dev)
{
}

+static inline void adf_reenable_sriov(struct adf_accel_dev *accel_dev)
+{
+}
+
static inline int adf_init_pf_wq(void)
{
return 0;
diff --git a/drivers/crypto/intel/qat/qat_common/adf_sriov.c b/drivers/crypto/intel/qat/qat_common/adf_sriov.c
index cb2a9830f192..1a9523994c0b 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_sriov.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_sriov.c
@@ -60,7 +60,6 @@ static int adf_enable_sriov(struct adf_accel_dev *accel_dev)
/* This ptr will be populated when VFs will be created */
vf_info->accel_dev = accel_dev;
vf_info->vf_nr = i;
- vf_info->vf_compat_ver = 0;

mutex_init(&vf_info->pf2vf_lock);
ratelimit_state_init(&vf_info->vf2pf_ratelimit,
@@ -84,6 +83,32 @@ static int adf_enable_sriov(struct adf_accel_dev *accel_dev)
return pci_enable_sriov(pdev, totalvfs);
}

+void adf_reenable_sriov(struct adf_accel_dev *accel_dev)
+{
+ struct pci_dev *pdev = accel_to_pci_dev(accel_dev);
+ char cfg[ADF_CFG_MAX_VAL_LEN_IN_BYTES] = {0};
+ unsigned long val = 0;
+
+ if (adf_cfg_get_param_value(accel_dev, ADF_GENERAL_SEC,
+ ADF_SRIOV_ENABLED, cfg))
+ return;
+
+ if (!accel_dev->pf.vf_info)
+ return;
+
+ if (adf_cfg_add_key_value_param(accel_dev, ADF_KERNEL_SEC, ADF_NUM_CY,
+ &val, ADF_DEC))
+ return;
+
+ if (adf_cfg_add_key_value_param(accel_dev, ADF_KERNEL_SEC, ADF_NUM_DC,
+ &val, ADF_DEC))
+ return;
+
+ set_bit(ADF_STATUS_CONFIGURED, &accel_dev->status);
+ dev_info(&pdev->dev, "Re-enabling SRIOV\n");
+ adf_enable_sriov(accel_dev);
+}
+
/**
* adf_disable_sriov() - Disable SRIOV for the device
* @accel_dev: Pointer to accel device.
@@ -116,8 +141,10 @@ void adf_disable_sriov(struct adf_accel_dev *accel_dev)
for (i = 0, vf = accel_dev->pf.vf_info; i < totalvfs; i++, vf++)
mutex_destroy(&vf->pf2vf_lock);

- kfree(accel_dev->pf.vf_info);
- accel_dev->pf.vf_info = NULL;
+ if (!test_bit(ADF_STATUS_RESTARTING, &accel_dev->status)) {
+ kfree(accel_dev->pf.vf_info);
+ accel_dev->pf.vf_info = NULL;
+ }
}
EXPORT_SYMBOL_GPL(adf_disable_sriov);

@@ -195,6 +222,10 @@ int adf_sriov_configure(struct pci_dev *pdev, int numvfs)
if (ret)
return ret;

+ val = 1;
+ adf_cfg_add_key_value_param(accel_dev, ADF_GENERAL_SEC, ADF_SRIOV_ENABLED,
+ &val, ADF_DEC);
+
return numvfs;
}
EXPORT_SYMBOL_GPL(adf_sriov_configure);
--
2.34.1


2024-01-03 04:10:01

by Mun Chun Yep

[permalink] [raw]
Subject: [PATCH 7/9] crypto: qat - add auto reset on error

From: Damian Muszynski <[email protected]>

Expose the `auto_reset` sysfs attribute to configure the driver to reset
the device when a fatal error is detected.

When auto reset is enabled, the driver resets the device when it detects
either an heartbeat failure or a fatal error through an interrupt.

This patch is based on earlier work done by Shashank Gupta.

Signed-off-by: Damian Muszynski <[email protected]>
Reviewed-by: Ahsan Atta <[email protected]>
Reviewed-by: Markas Rapoportas <[email protected]>
---
Documentation/ABI/testing/sysfs-driver-qat | 20 ++++++++++
.../intel/qat/qat_common/adf_accel_devices.h | 1 +
drivers/crypto/intel/qat/qat_common/adf_aer.c | 11 +++++-
.../intel/qat/qat_common/adf_common_drv.h | 1 +
.../crypto/intel/qat/qat_common/adf_sysfs.c | 37 +++++++++++++++++++
5 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-driver-qat b/Documentation/ABI/testing/sysfs-driver-qat
index bbf329cf0d67..6778f1fea874 100644
--- a/Documentation/ABI/testing/sysfs-driver-qat
+++ b/Documentation/ABI/testing/sysfs-driver-qat
@@ -141,3 +141,23 @@ Description:
64

This attribute is only available for qat_4xxx devices.
+
+What: /sys/bus/pci/devices/<BDF>/qat/auto_reset
+Date: March 2024
+KernelVersion: 6.8
+Contact: [email protected]
+Description: (RW) Reports the current state of the autoreset feature
+ for a QAT device
+
+ Write to the attribute to enable or disable device auto reset.
+
+ Device auto reset is disabled by default.
+
+ The values are::
+
+ * 1/Yy/on: auto reset enabled. If the device encounters an
+ unrecoverable error, it will be reset automatically.
+ * 0/Nn/off: auto reset disabled. If the device encounters an
+ unrecoverable error, it will not be reset.
+
+ This attribute is only available for qat_4xxx devices.
diff --git a/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h b/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h
index 4a3c36aaa7ca..0f26aa976c8c 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h
@@ -402,6 +402,7 @@ struct adf_accel_dev {
struct adf_error_counters ras_errors;
struct mutex state_lock; /* protect state of the device */
bool is_vf;
+ bool autoreset_on_error;
u32 accel_id;
};
#endif
diff --git a/drivers/crypto/intel/qat/qat_common/adf_aer.c b/drivers/crypto/intel/qat/qat_common/adf_aer.c
index cd273b31db0e..b3d4b6b99c65 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_aer.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_aer.c
@@ -204,6 +204,14 @@ const struct pci_error_handlers adf_err_handler = {
};
EXPORT_SYMBOL_GPL(adf_err_handler);

+int adf_dev_autoreset(struct adf_accel_dev *accel_dev)
+{
+ if (accel_dev->autoreset_on_error)
+ return adf_dev_aer_schedule_reset(accel_dev, ADF_DEV_RESET_ASYNC);
+
+ return 0;
+}
+
static void adf_notify_fatal_error_worker(struct work_struct *work)
{
struct adf_fatal_error_data *wq_data =
@@ -215,10 +223,11 @@ static void adf_notify_fatal_error_worker(struct work_struct *work)

if (!accel_dev->is_vf) {
/* Disable arbitration to stop processing of new requests */
- if (hw_device->exit_arb)
+ if (accel_dev->autoreset_on_error && hw_device->exit_arb)
hw_device->exit_arb(accel_dev);
if (accel_dev->pf.vf_info)
adf_pf2vf_notify_fatal_error(accel_dev);
+ adf_dev_autoreset(accel_dev);
}

kfree(wq_data);
diff --git a/drivers/crypto/intel/qat/qat_common/adf_common_drv.h b/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
index 10891c9da6e7..57328249c89e 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
@@ -87,6 +87,7 @@ int adf_ae_stop(struct adf_accel_dev *accel_dev);
extern const struct pci_error_handlers adf_err_handler;
void adf_reset_sbr(struct adf_accel_dev *accel_dev);
void adf_reset_flr(struct adf_accel_dev *accel_dev);
+int adf_dev_autoreset(struct adf_accel_dev *accel_dev);
void adf_dev_restore(struct adf_accel_dev *accel_dev);
int adf_init_aer(void);
void adf_exit_aer(void);
diff --git a/drivers/crypto/intel/qat/qat_common/adf_sysfs.c b/drivers/crypto/intel/qat/qat_common/adf_sysfs.c
index d450dad32c9e..4e7f70d4049d 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_sysfs.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_sysfs.c
@@ -204,6 +204,42 @@ static ssize_t pm_idle_enabled_store(struct device *dev, struct device_attribute
}
static DEVICE_ATTR_RW(pm_idle_enabled);

+static ssize_t auto_reset_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ char *auto_reset;
+ struct adf_accel_dev *accel_dev;
+
+ accel_dev = adf_devmgr_pci_to_accel_dev(to_pci_dev(dev));
+ if (!accel_dev)
+ return -EINVAL;
+
+ auto_reset = accel_dev->autoreset_on_error ? "on" : "off";
+
+ return sysfs_emit(buf, "%s\n", auto_reset);
+}
+
+static ssize_t auto_reset_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct adf_accel_dev *accel_dev;
+ bool enabled = false;
+ int ret;
+
+ ret = kstrtobool(buf, &enabled);
+ if (ret)
+ return ret;
+
+ accel_dev = adf_devmgr_pci_to_accel_dev(to_pci_dev(dev));
+ if (!accel_dev)
+ return -EINVAL;
+
+ accel_dev->autoreset_on_error = enabled;
+
+ return count;
+}
+static DEVICE_ATTR_RW(auto_reset);
+
static DEVICE_ATTR_RW(state);
static DEVICE_ATTR_RW(cfg_services);

@@ -291,6 +327,7 @@ static struct attribute *qat_attrs[] = {
&dev_attr_pm_idle_enabled.attr,
&dev_attr_rp2srv.attr,
&dev_attr_num_rps.attr,
+ &dev_attr_auto_reset.attr,
NULL,
};

--
2.34.1


2024-01-03 04:10:02

by Mun Chun Yep

[permalink] [raw]
Subject: [PATCH 8/9] crypto: qat - limit heartbeat notifications

From: Furong Zhou <[email protected]>

When the driver detects an heartbeat failure, it starts the recovery
flow. Set a limit so that the number of events is limited in case the
heartbeat status is read too frequently.

Signed-off-by: Furong Zhou <[email protected]>
Reviewed-by: Ahsan Atta <[email protected]>
Reviewed-by: Markas Rapoportas <[email protected]>
---
.../crypto/intel/qat/qat_common/adf_heartbeat.c | 17 ++++++++++++++---
.../crypto/intel/qat/qat_common/adf_heartbeat.h | 3 +++
2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/crypto/intel/qat/qat_common/adf_heartbeat.c b/drivers/crypto/intel/qat/qat_common/adf_heartbeat.c
index fe8428d4ff39..ef23cd817818 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_heartbeat.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_heartbeat.c
@@ -205,6 +205,19 @@ static int adf_hb_get_status(struct adf_accel_dev *accel_dev)
return ret;
}

+static void adf_heartbeat_reset(struct adf_accel_dev *accel_dev)
+{
+ u64 curr_time = adf_clock_get_current_time();
+ u64 last_reset = curr_time - accel_dev->heartbeat->last_hb_reset_time;
+
+ if (last_reset < ADF_CFG_HB_RESET_MS)
+ return;
+
+ accel_dev->heartbeat->last_hb_reset_time = curr_time;
+ if (adf_notify_fatal_error(accel_dev))
+ dev_err(&GET_DEV(accel_dev), "Failed to notify fatal error\n");
+}
+
void adf_heartbeat_status(struct adf_accel_dev *accel_dev,
enum adf_device_heartbeat_status *hb_status)
{
@@ -229,9 +242,7 @@ void adf_heartbeat_status(struct adf_accel_dev *accel_dev,
"Heartbeat ERROR: QAT is not responding.\n");
*hb_status = HB_DEV_UNRESPONSIVE;
hb->hb_failed_counter++;
- if (adf_notify_fatal_error(accel_dev))
- dev_err(&GET_DEV(accel_dev),
- "Failed to notify fatal error\n");
+ adf_heartbeat_reset(accel_dev);
return;
}

diff --git a/drivers/crypto/intel/qat/qat_common/adf_heartbeat.h b/drivers/crypto/intel/qat/qat_common/adf_heartbeat.h
index 75db563f23ba..900876baac66 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_heartbeat.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_heartbeat.h
@@ -13,6 +13,8 @@ struct dentry;
#define ADF_CFG_HB_TIMER_DEFAULT_MS 500
#define ADF_CFG_HB_COUNT_THRESHOLD 3

+#define ADF_CFG_HB_RESET_MS 5000
+
enum adf_device_heartbeat_status {
HB_DEV_UNRESPONSIVE = 0,
HB_DEV_ALIVE,
@@ -30,6 +32,7 @@ struct adf_heartbeat {
unsigned int hb_failed_counter;
unsigned int hb_timer;
u64 last_hb_check_time;
+ u64 last_hb_reset_time;
bool ctrs_cnt_checked;
struct hb_dma_addr {
dma_addr_t phy_addr;
--
2.34.1


2024-01-03 04:10:12

by Mun Chun Yep

[permalink] [raw]
Subject: [PATCH 1/9] crypto: qat - add heartbeat error simulator

From: Damian Muszynski <[email protected]>

Add a mechanism that allows to inject a heartbeat error for testing
purposes.
A new attribute `inject_error` is added to debugfs for each QAT device.
Upon a write on this attribute, the driver will inject an error on the
device which can then be detected by the heartbeat feature.
Errors are breaking the device functionality thus they require a
device reset in order to be recovered.

This functionality is not compiled by default, to enable it
CRYPTO_DEV_QAT_ERROR_INJECTION must be set.

Signed-off-by: Damian Muszynski <[email protected]>
Reviewed-by: Giovanni Cabiddu <[email protected]>
Reviewed-by: Lucas Segarra Fernandez <[email protected]>
Reviewed-by: Ahsan Atta <[email protected]>
Reviewed-by: Markas Rapoportas <[email protected]>
---
Documentation/ABI/testing/debugfs-driver-qat | 26 +++++++
drivers/crypto/intel/qat/Kconfig | 15 ++++
drivers/crypto/intel/qat/qat_common/Makefile | 3 +
.../intel/qat/qat_common/adf_common_drv.h | 1 +
.../intel/qat/qat_common/adf_heartbeat.c | 6 --
.../intel/qat/qat_common/adf_heartbeat.h | 18 +++++
.../qat/qat_common/adf_heartbeat_dbgfs.c | 52 +++++++++++++
.../qat/qat_common/adf_heartbeat_inject.c | 76 +++++++++++++++++++
.../intel/qat/qat_common/adf_hw_arbiter.c | 25 ++++++
9 files changed, 216 insertions(+), 6 deletions(-)
create mode 100644 drivers/crypto/intel/qat/qat_common/adf_heartbeat_inject.c

diff --git a/Documentation/ABI/testing/debugfs-driver-qat b/Documentation/ABI/testing/debugfs-driver-qat
index b2db010d851e..bd6793760f29 100644
--- a/Documentation/ABI/testing/debugfs-driver-qat
+++ b/Documentation/ABI/testing/debugfs-driver-qat
@@ -81,3 +81,29 @@ Description: (RO) Read returns, for each Acceleration Engine (AE), the number
<N>: Number of Compress and Verify (CnV) errors and type
of the last CnV error detected by Acceleration
Engine N.
+
+What: /sys/kernel/debug/qat_<device>_<BDF>/heartbeat/inject_error
+Date: March 2024
+KernelVersion: 6.8
+Contact: [email protected]
+Description: (WO) Write to inject an error that simulates an heartbeat
+ failure. This is to be used for testing purposes.
+
+ After writing this file, the driver stops arbitration on a
+ random engine and disables the fetching of heartbeat counters.
+ If a workload is running on the device, a job submitted to the
+ accelerator might not get a response and a read of the
+ `heartbeat/status` attribute might report -1, i.e. device
+ unresponsive.
+ The error is unrecoverable thus the device must be restarted to
+ restore its functionality.
+
+ This attribute is available only when the kernel is built with
+ CONFIG_CRYPTO_DEV_QAT_ERROR_INJECTION=y.
+
+ A write of 1 enables error injection.
+
+ The following example shows how to enable error injection::
+
+ # cd /sys/kernel/debug/qat_<device>_<BDF>
+ # echo 1 > heartbeat/inject_error
diff --git a/drivers/crypto/intel/qat/Kconfig b/drivers/crypto/intel/qat/Kconfig
index c120f6715a09..7aefd57f9d2f 100644
--- a/drivers/crypto/intel/qat/Kconfig
+++ b/drivers/crypto/intel/qat/Kconfig
@@ -106,3 +106,18 @@ config CRYPTO_DEV_QAT_C62XVF

To compile this as a module, choose M here: the module
will be called qat_c62xvf.
+
+config CRYPTO_DEV_QAT_ERROR_INJECTION
+ bool "Support for Intel(R) QAT Devices Heartbeat Error Injection"
+ default n
+ depends on CRYPTO_DEV_QAT
+ depends on DEBUG_FS
+ help
+ Enables a mechanism that allows to inject a heartbeat error on
+ Intel(R) QuickAssist devices for testing purposes.
+
+ This is intended for developer use only.
+ If unsure, say N.
+
+ This functionality is available via debugfs entry of the Intel(R)
+ QuickAssist device
diff --git a/drivers/crypto/intel/qat/qat_common/Makefile b/drivers/crypto/intel/qat/qat_common/Makefile
index 6908727bff3b..e36310667335 100644
--- a/drivers/crypto/intel/qat/qat_common/Makefile
+++ b/drivers/crypto/intel/qat/qat_common/Makefile
@@ -53,3 +53,6 @@ intel_qat-$(CONFIG_PCI_IOV) += adf_sriov.o adf_vf_isr.o adf_pfvf_utils.o \
adf_pfvf_pf_msg.o adf_pfvf_pf_proto.o \
adf_pfvf_vf_msg.o adf_pfvf_vf_proto.o \
adf_gen2_pfvf.o adf_gen4_pfvf.o
+
+intel_qat-$(CONFIG_CRYPTO_DEV_QAT_ERROR_INJECTION) += adf_heartbeat_inject.o
+ccflags-$(CONFIG_CRYPTO_DEV_QAT_ERROR_INJECTION) += -DQAT_HB_ERROR_INJECTION
diff --git a/drivers/crypto/intel/qat/qat_common/adf_common_drv.h b/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
index f06188033a93..0baae42deb3a 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
@@ -90,6 +90,7 @@ void adf_exit_aer(void);
int adf_init_arb(struct adf_accel_dev *accel_dev);
void adf_exit_arb(struct adf_accel_dev *accel_dev);
void adf_update_ring_arb(struct adf_etr_ring_data *ring);
+int adf_disable_arb_thd(struct adf_accel_dev *accel_dev, u32 ae, u32 thr);

int adf_dev_get(struct adf_accel_dev *accel_dev);
void adf_dev_put(struct adf_accel_dev *accel_dev);
diff --git a/drivers/crypto/intel/qat/qat_common/adf_heartbeat.c b/drivers/crypto/intel/qat/qat_common/adf_heartbeat.c
index 13f48d2f6da8..f88b1bc6857e 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_heartbeat.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_heartbeat.c
@@ -23,12 +23,6 @@

#define ADF_HB_EMPTY_SIG 0xA5A5A5A5

-/* Heartbeat counter pair */
-struct hb_cnt_pair {
- __u16 resp_heartbeat_cnt;
- __u16 req_heartbeat_cnt;
-};
-
static int adf_hb_check_polling_freq(struct adf_accel_dev *accel_dev)
{
u64 curr_time = adf_clock_get_current_time();
diff --git a/drivers/crypto/intel/qat/qat_common/adf_heartbeat.h b/drivers/crypto/intel/qat/qat_common/adf_heartbeat.h
index b22e3cb29798..75db563f23ba 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_heartbeat.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_heartbeat.h
@@ -19,6 +19,12 @@ enum adf_device_heartbeat_status {
HB_DEV_UNSUPPORTED,
};

+/* Heartbeat counter pair */
+struct hb_cnt_pair {
+ __u16 resp_heartbeat_cnt;
+ __u16 req_heartbeat_cnt;
+};
+
struct adf_heartbeat {
unsigned int hb_sent_counter;
unsigned int hb_failed_counter;
@@ -35,6 +41,9 @@ struct adf_heartbeat {
struct dentry *cfg;
struct dentry *sent;
struct dentry *failed;
+#ifdef QAT_HB_ERROR_INJECTION
+ struct dentry *inject_error;
+#endif
} dbgfs;
};

@@ -51,6 +60,15 @@ void adf_heartbeat_status(struct adf_accel_dev *accel_dev,
enum adf_device_heartbeat_status *hb_status);
void adf_heartbeat_check_ctrs(struct adf_accel_dev *accel_dev);

+#ifdef QAT_HB_ERROR_INJECTION
+int adf_heartbeat_inject_error(struct adf_accel_dev *accel_dev);
+#else
+static inline int adf_heartbeat_inject_error(struct adf_accel_dev *accel_dev)
+{
+ return -EPERM;
+}
+#endif
+
#else
static inline int adf_heartbeat_init(struct adf_accel_dev *accel_dev)
{
diff --git a/drivers/crypto/intel/qat/qat_common/adf_heartbeat_dbgfs.c b/drivers/crypto/intel/qat/qat_common/adf_heartbeat_dbgfs.c
index 2661af6a2ef6..d797677397e6 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_heartbeat_dbgfs.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_heartbeat_dbgfs.c
@@ -155,6 +155,43 @@ static const struct file_operations adf_hb_cfg_fops = {
.write = adf_hb_cfg_write,
};

+static ssize_t adf_hb_error_inject_write(struct file *file,
+ const char __user *user_buf,
+ size_t count, loff_t *ppos)
+{
+ struct adf_accel_dev *accel_dev = file->private_data;
+ size_t written_chars;
+ char buf[3];
+ int ret;
+
+ /* last byte left as string termination */
+ if (count != 2)
+ return -EINVAL;
+
+ written_chars = simple_write_to_buffer(buf, sizeof(buf) - 1,
+ ppos, user_buf, count);
+ if (buf[0] != '1')
+ return -EINVAL;
+
+ ret = adf_heartbeat_inject_error(accel_dev);
+ if (ret) {
+ dev_err(&GET_DEV(accel_dev),
+ "Heartbeat error injection failed with status %d\n",
+ ret);
+ return ret;
+ }
+
+ dev_info(&GET_DEV(accel_dev), "Heartbeat error injection enabled\n");
+
+ return written_chars;
+}
+
+static const struct file_operations adf_hb_error_inject_fops = {
+ .owner = THIS_MODULE,
+ .open = simple_open,
+ .write = adf_hb_error_inject_write,
+};
+
void adf_heartbeat_dbgfs_add(struct adf_accel_dev *accel_dev)
{
struct adf_heartbeat *hb = accel_dev->heartbeat;
@@ -171,6 +208,17 @@ void adf_heartbeat_dbgfs_add(struct adf_accel_dev *accel_dev)
&hb->hb_failed_counter, &adf_hb_stats_fops);
hb->dbgfs.cfg = debugfs_create_file("config", 0600, hb->dbgfs.base_dir,
accel_dev, &adf_hb_cfg_fops);
+
+ if (IS_ENABLED(CONFIG_CRYPTO_DEV_QAT_ERROR_INJECTION)) {
+ struct dentry *inject_error __maybe_unused;
+
+ inject_error = debugfs_create_file("inject_error", 0200,
+ hb->dbgfs.base_dir, accel_dev,
+ &adf_hb_error_inject_fops);
+#ifdef QAT_HB_ERROR_INJECTION
+ hb->dbgfs.inject_error = inject_error;
+#endif
+ }
}
EXPORT_SYMBOL_GPL(adf_heartbeat_dbgfs_add);

@@ -189,6 +237,10 @@ void adf_heartbeat_dbgfs_rm(struct adf_accel_dev *accel_dev)
hb->dbgfs.failed = NULL;
debugfs_remove(hb->dbgfs.cfg);
hb->dbgfs.cfg = NULL;
+#ifdef QAT_HB_ERROR_INJECTION
+ debugfs_remove(hb->dbgfs.inject_error);
+ hb->dbgfs.inject_error = NULL;
+#endif
debugfs_remove(hb->dbgfs.base_dir);
hb->dbgfs.base_dir = NULL;
}
diff --git a/drivers/crypto/intel/qat/qat_common/adf_heartbeat_inject.c b/drivers/crypto/intel/qat/qat_common/adf_heartbeat_inject.c
new file mode 100644
index 000000000000..a3b474bdef6c
--- /dev/null
+++ b/drivers/crypto/intel/qat/qat_common/adf_heartbeat_inject.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2023 Intel Corporation */
+#include <linux/random.h>
+
+#include "adf_admin.h"
+#include "adf_common_drv.h"
+#include "adf_heartbeat.h"
+
+#define MAX_HB_TICKS 0xFFFFFFFF
+
+static int adf_hb_set_timer_to_max(struct adf_accel_dev *accel_dev)
+{
+ struct adf_hw_device_data *hw_data = accel_dev->hw_device;
+
+ accel_dev->heartbeat->hb_timer = 0;
+
+ if (hw_data->stop_timer)
+ hw_data->stop_timer(accel_dev);
+
+ return adf_send_admin_hb_timer(accel_dev, MAX_HB_TICKS);
+}
+
+static void adf_set_hb_counters_fail(struct adf_accel_dev *accel_dev, u32 ae,
+ u32 thr)
+{
+ struct hb_cnt_pair *stats = accel_dev->heartbeat->dma.virt_addr;
+ struct adf_hw_device_data *hw_device = accel_dev->hw_device;
+ const size_t max_aes = hw_device->get_num_aes(hw_device);
+ const size_t hb_ctrs = hw_device->num_hb_ctrs;
+ size_t thr_id = ae * hb_ctrs + thr;
+ u16 num_rsp = stats[thr_id].resp_heartbeat_cnt;
+
+ /*
+ * Inject live.req != live.rsp and live.rsp == last.rsp
+ * to trigger the heartbeat error detection
+ */
+ stats[thr_id].req_heartbeat_cnt++;
+ stats += (max_aes * hb_ctrs);
+ stats[thr_id].resp_heartbeat_cnt = num_rsp;
+}
+
+int adf_heartbeat_inject_error(struct adf_accel_dev *accel_dev)
+{
+ struct adf_hw_device_data *hw_device = accel_dev->hw_device;
+ const size_t max_aes = hw_device->get_num_aes(hw_device);
+ const size_t hb_ctrs = hw_device->num_hb_ctrs;
+ u32 rand, rand_ae, rand_thr;
+ unsigned long ae_mask;
+ int ret;
+
+ ae_mask = hw_device->ae_mask;
+
+ do {
+ /* Ensure we have a valid ae */
+ get_random_bytes(&rand, sizeof(rand));
+ rand_ae = rand % max_aes;
+ } while (!test_bit(rand_ae, &ae_mask));
+
+ get_random_bytes(&rand, sizeof(rand));
+ rand_thr = rand % hb_ctrs;
+
+ /* Increase the heartbeat timer to prevent FW updating HB counters */
+ ret = adf_hb_set_timer_to_max(accel_dev);
+ if (ret)
+ return ret;
+
+ /* Configure worker threads to stop processing any packet */
+ ret = adf_disable_arb_thd(accel_dev, rand_ae, rand_thr);
+ if (ret)
+ return ret;
+
+ /* Change HB counters memory to simulate a hang */
+ adf_set_hb_counters_fail(accel_dev, rand_ae, rand_thr);
+
+ return 0;
+}
diff --git a/drivers/crypto/intel/qat/qat_common/adf_hw_arbiter.c b/drivers/crypto/intel/qat/qat_common/adf_hw_arbiter.c
index da6956699246..65bd26b25abc 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_hw_arbiter.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_hw_arbiter.c
@@ -103,3 +103,28 @@ void adf_exit_arb(struct adf_accel_dev *accel_dev)
csr_ops->write_csr_ring_srv_arb_en(csr, i, 0);
}
EXPORT_SYMBOL_GPL(adf_exit_arb);
+
+int adf_disable_arb_thd(struct adf_accel_dev *accel_dev, u32 ae, u32 thr)
+{
+ void __iomem *csr = accel_dev->transport->banks[0].csr_addr;
+ struct adf_hw_device_data *hw_data = accel_dev->hw_device;
+ const u32 *thd_2_arb_cfg;
+ struct arb_info info;
+ u32 ae_thr_map;
+
+ if (ADF_AE_STRAND0_THREAD == thr || ADF_AE_STRAND1_THREAD == thr)
+ thr = ADF_AE_ADMIN_THREAD;
+
+ hw_data->get_arb_info(&info);
+ thd_2_arb_cfg = hw_data->get_arb_mapping(accel_dev);
+ if (!thd_2_arb_cfg)
+ return -EFAULT;
+
+ /* Disable scheduling for this particular AE and thread */
+ ae_thr_map = *(thd_2_arb_cfg + ae);
+ ae_thr_map &= ~(GENMASK(3, 0) << (thr * BIT(2)));
+
+ WRITE_CSR_ARB_WT2SAM(csr, info.arb_offset, info.wt2sam_offset, ae,
+ ae_thr_map);
+ return 0;
+}
--
2.34.1


2024-01-03 04:10:14

by Mun Chun Yep

[permalink] [raw]
Subject: [PATCH 2/9] crypto: qat - add fatal error notify method

From: Zhou Furong <[email protected]>

Add error notify method to report a fatal error event to all the
subsystems registered. In addition expose an API,
adf_notify_fatal_error(), that allows to trigger a fatal error
notification asynchronously in the context of a workqueue.

This will be invoked when a fatal error is detected by the ISR or
through Heartbeat.

Signed-off-by: Zhou Furong <[email protected]>
Reviewed-by: Ahsan Atta <[email protected]>
Reviewed-by: Markas Rapoportas <[email protected]>
---
drivers/crypto/intel/qat/qat_common/adf_aer.c | 30 +++++++++++++++++++
.../intel/qat/qat_common/adf_common_drv.h | 3 ++
.../crypto/intel/qat/qat_common/adf_init.c | 12 ++++++++
3 files changed, 45 insertions(+)

diff --git a/drivers/crypto/intel/qat/qat_common/adf_aer.c b/drivers/crypto/intel/qat/qat_common/adf_aer.c
index a39e70bd4b21..22a43b4b8315 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_aer.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_aer.c
@@ -8,6 +8,11 @@
#include "adf_accel_devices.h"
#include "adf_common_drv.h"

+struct adf_fatal_error_data {
+ struct adf_accel_dev *accel_dev;
+ struct work_struct work;
+};
+
static struct workqueue_struct *device_reset_wq;

static pci_ers_result_t adf_error_detected(struct pci_dev *pdev,
@@ -171,6 +176,31 @@ const struct pci_error_handlers adf_err_handler = {
};
EXPORT_SYMBOL_GPL(adf_err_handler);

+static void adf_notify_fatal_error_worker(struct work_struct *work)
+{
+ struct adf_fatal_error_data *wq_data =
+ container_of(work, struct adf_fatal_error_data, work);
+ struct adf_accel_dev *accel_dev = wq_data->accel_dev;
+
+ adf_error_notifier(accel_dev);
+ kfree(wq_data);
+}
+
+int adf_notify_fatal_error(struct adf_accel_dev *accel_dev)
+{
+ struct adf_fatal_error_data *wq_data;
+
+ wq_data = kzalloc(sizeof(*wq_data), GFP_ATOMIC);
+ if (!wq_data)
+ return -ENOMEM;
+
+ wq_data->accel_dev = accel_dev;
+ INIT_WORK(&wq_data->work, adf_notify_fatal_error_worker);
+ adf_misc_wq_queue_work(&wq_data->work);
+
+ return 0;
+}
+
int adf_init_aer(void)
{
device_reset_wq = alloc_workqueue("qat_device_reset_wq",
diff --git a/drivers/crypto/intel/qat/qat_common/adf_common_drv.h b/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
index 0baae42deb3a..8c062d5a8db2 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
@@ -40,6 +40,7 @@ enum adf_event {
ADF_EVENT_SHUTDOWN,
ADF_EVENT_RESTARTING,
ADF_EVENT_RESTARTED,
+ ADF_EVENT_FATAL_ERROR,
};

struct service_hndl {
@@ -60,6 +61,8 @@ int adf_dev_restart(struct adf_accel_dev *accel_dev);

void adf_devmgr_update_class_index(struct adf_hw_device_data *hw_data);
void adf_clean_vf_map(bool);
+int adf_notify_fatal_error(struct adf_accel_dev *accel_dev);
+void adf_error_notifier(struct adf_accel_dev *accel_dev);
int adf_devmgr_add_dev(struct adf_accel_dev *accel_dev,
struct adf_accel_dev *pf);
void adf_devmgr_rm_dev(struct adf_accel_dev *accel_dev,
diff --git a/drivers/crypto/intel/qat/qat_common/adf_init.c b/drivers/crypto/intel/qat/qat_common/adf_init.c
index f43ae9111553..74f0818c0703 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_init.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_init.c
@@ -433,6 +433,18 @@ int adf_dev_restarted_notify(struct adf_accel_dev *accel_dev)
return 0;
}

+void adf_error_notifier(struct adf_accel_dev *accel_dev)
+{
+ struct service_hndl *service;
+
+ list_for_each_entry(service, &service_table, list) {
+ if (service->event_hld(accel_dev, ADF_EVENT_FATAL_ERROR))
+ dev_err(&GET_DEV(accel_dev),
+ "Failed to send error event to %s.\n",
+ service->name);
+ }
+}
+
static int adf_dev_shutdown_cache_cfg(struct adf_accel_dev *accel_dev)
{
char services[ADF_CFG_MAX_VAL_LEN_IN_BYTES] = {0};
--
2.34.1


2024-01-25 10:17:17

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 1/9] crypto: qat - add heartbeat error simulator

On Wed, Jan 03, 2024 at 12:07:14PM +0800, Mun Chun Yep wrote:
>
> +config CRYPTO_DEV_QAT_ERROR_INJECTION
> + bool "Support for Intel(R) QAT Devices Heartbeat Error Injection"
> + default n

This is redundant as n is the default.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2024-02-01 06:16:55

by Tero Kristo

[permalink] [raw]
Subject: Re: [PATCH 1/9] crypto: qat - add heartbeat error simulator

<snip>

On 03/01/2024 06:07, Yep, Mun Chun wrote:
> diff --git a/drivers/crypto/intel/qat/qat_common/Makefile b/drivers/crypto/intel/qat/qat_common/Makefile
> index 6908727bff3b..e36310667335 100644
> --- a/drivers/crypto/intel/qat/qat_common/Makefile
> +++ b/drivers/crypto/intel/qat/qat_common/Makefile
> @@ -53,3 +53,6 @@ intel_qat-$(CONFIG_PCI_IOV) += adf_sriov.o adf_vf_isr.o adf_pfvf_utils.o \
> adf_pfvf_pf_msg.o adf_pfvf_pf_proto.o \
> adf_pfvf_vf_msg.o adf_pfvf_vf_proto.o \
> adf_gen2_pfvf.o adf_gen4_pfvf.o
> +
> +intel_qat-$(CONFIG_CRYPTO_DEV_QAT_ERROR_INJECTION) += adf_heartbeat_inject.o
> +ccflags-$(CONFIG_CRYPTO_DEV_QAT_ERROR_INJECTION) += -DQAT_HB_ERROR_INJECTION

The above ccflags hack seems unnecessary, why not use the CONFIG option
directly in the code?

-Tero

---------------------------------------------------------------------
Intel Finland Oy
Registered Address: PL 281, 00181 Helsinki
Business Identity Code: 0357606 - 4
Domiciled in Helsinki

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.