2022-09-02 03:21:00

by yekai (A)

[permalink] [raw]
Subject: [PATCH v8 0/3] crypto: hisilicon - supports device isolation feature

1、Add the uacce hardware error isolation interface. Supports
configures the hardware error isolation frequency.
2、Defining the isolation strategy for ACC by uacce sysfs node. If the
number of hardware errors in a per hour exceeds the configured value,
the device will not be available in user space. The VF device use the
PF device isolation strategy.

changes v1->v2:
- deleted dev_to_uacce api.
- add vfs node doc.
- move uacce->ref to driver.
changes v2->v3:
- deleted some redundant code.
- use qm state instead of reference count.
- add null pointer check.
- isolate_strategy_read() instead of a copy.
changes v3->v4:
- modify a comment
changes v4->v5:
- use bool instead of atomic.
- isolation frequency instead of isolation command.
changes v5->v6:
- add is_visible in uacce.
- add the description of the isolation strategy file node.
changes v6->v7
- add an example for isolate_strategy in Documentation.
changes v7->v8
- update the correct date.

Kai Ye (3):
uacce: supports device isolation feature
Documentation: add a isolation strategy sysfs node for uacce
crypto: hisilicon/qm - define the device isolation strategy

Documentation/ABI/testing/sysfs-driver-uacce | 26 +++
drivers/crypto/hisilicon/qm.c | 163 +++++++++++++++++--
drivers/misc/uacce/uacce.c | 58 +++++++
include/linux/hisi_acc_qm.h | 9 +
include/linux/uacce.h | 11 ++
5 files changed, 255 insertions(+), 12 deletions(-)

--
2.17.1


2022-09-02 03:21:26

by yekai (A)

[permalink] [raw]
Subject: [PATCH v8 2/3] Documentation: add a isolation strategy sysfs node for uacce

Update documentation describing sysfs node that could help to
configure isolation strategy for users in the user space. And
describing sysfs node that could read the device isolated state.

Signed-off-by: Kai Ye <[email protected]>
---
Documentation/ABI/testing/sysfs-driver-uacce | 26 ++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
index 08f2591138af..af5bc2f326d2 100644
--- a/Documentation/ABI/testing/sysfs-driver-uacce
+++ b/Documentation/ABI/testing/sysfs-driver-uacce
@@ -19,6 +19,32 @@ Contact: [email protected]
Description: Available instances left of the device
Return -ENODEV if uacce_ops get_available_instances is not provided

+What: /sys/class/uacce/<dev_name>/isolate_strategy
+Date: Sep 2022
+KernelVersion: 6.0
+Contact: [email protected]
+Description: (RW) Configure the frequency size for the hardware error
+ isolation strategy. This size is a configured integer value.
+ The default is 0. The maximum value is 65535. This value is a
+ threshold based on your driver strategies.
+
+ For example, in the hisilicon accelerator engine, first we will
+ time-stamp every slot AER error. Then check the AER error log
+ when the device AER error occurred. if the device slot AER error
+ count exceeds the preset the number of times in one hour, the
+ isolated state will be set to true. So the device will be
+ isolated. And the AER error log that exceed one hour will be
+ cleared. Of course, different strategies can be defined in
+ different drivers.
+
+What: /sys/class/uacce/<dev_name>/isolate
+Date: Sep 2022
+KernelVersion: 6.0
+Contact: [email protected]
+Description: (R) A sysfs node that read the device isolated state. The value 1
+ means the device is unavailable. The 0 means the device is
+ available.
+
What: /sys/class/uacce/<dev_name>/algorithms
Date: Feb 2020
KernelVersion: 5.7
--
2.17.1

2022-09-02 03:21:56

by yekai (A)

[permalink] [raw]
Subject: [PATCH v8 1/3] uacce: supports device isolation feature

UACCE adds the hardware error isolation API. Users can configure
the isolation frequency by this sysfs node. UACCE reports the device
isolate state to the user space. If the AER error frequency exceeds
the value of setting for a certain period of time, the device will be
isolated.

Signed-off-by: Kai Ye <[email protected]>
---
drivers/misc/uacce/uacce.c | 58 ++++++++++++++++++++++++++++++++++++++
include/linux/uacce.h | 11 ++++++++
2 files changed, 69 insertions(+)

diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
index 281c54003edc..41f454c89cd1 100644
--- a/drivers/misc/uacce/uacce.c
+++ b/drivers/misc/uacce/uacce.c
@@ -7,6 +7,8 @@
#include <linux/slab.h>
#include <linux/uacce.h>

+#define MAX_ERR_ISOLATE_COUNT 65535
+
static struct class *uacce_class;
static dev_t uacce_devt;
static DEFINE_MUTEX(uacce_mutex);
@@ -339,12 +341,57 @@ static ssize_t region_dus_size_show(struct device *dev,
uacce->qf_pg_num[UACCE_QFRT_DUS] << PAGE_SHIFT);
}

+static ssize_t isolate_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct uacce_device *uacce = to_uacce_device(dev);
+
+ if (!uacce->ops->get_isolate_state)
+ return -ENODEV;
+
+ return sysfs_emit(buf, "%d\n", uacce->ops->get_isolate_state(uacce));
+}
+
+static ssize_t isolate_strategy_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct uacce_device *uacce = to_uacce_device(dev);
+ u32 val;
+
+ val = uacce->ops->isolate_strategy_read(uacce);
+ if (val > MAX_ERR_ISOLATE_COUNT)
+ return -EINVAL;
+
+ return sysfs_emit(buf, "%u\n", val);
+}
+
+static ssize_t isolate_strategy_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct uacce_device *uacce = to_uacce_device(dev);
+ unsigned long val;
+ int ret;
+
+ if (kstrtoul(buf, 0, &val) < 0)
+ return -EINVAL;
+
+ if (val > MAX_ERR_ISOLATE_COUNT)
+ return -EINVAL;
+
+ ret = uacce->ops->isolate_strategy_write(uacce, val);
+
+ return ret ? ret : count;
+}
+
static DEVICE_ATTR_RO(api);
static DEVICE_ATTR_RO(flags);
static DEVICE_ATTR_RO(available_instances);
static DEVICE_ATTR_RO(algorithms);
static DEVICE_ATTR_RO(region_mmio_size);
static DEVICE_ATTR_RO(region_dus_size);
+static DEVICE_ATTR_RO(isolate);
+static DEVICE_ATTR_RW(isolate_strategy);

static struct attribute *uacce_dev_attrs[] = {
&dev_attr_api.attr,
@@ -353,6 +400,8 @@ static struct attribute *uacce_dev_attrs[] = {
&dev_attr_algorithms.attr,
&dev_attr_region_mmio_size.attr,
&dev_attr_region_dus_size.attr,
+ &dev_attr_isolate.attr,
+ &dev_attr_isolate_strategy.attr,
NULL,
};

@@ -368,6 +417,15 @@ static umode_t uacce_dev_is_visible(struct kobject *kobj,
(!uacce->qf_pg_num[UACCE_QFRT_DUS])))
return 0;

+ if (attr == &dev_attr_isolate_strategy.attr &&
+ (!uacce->ops->isolate_strategy_read ||
+ !uacce->ops->isolate_strategy_write))
+ return 0;
+
+ if (attr == &dev_attr_isolate.attr &&
+ !uacce->ops->get_isolate_state)
+ return 0;
+
return attr->mode;
}

diff --git a/include/linux/uacce.h b/include/linux/uacce.h
index 48e319f40275..69e8f238d80c 100644
--- a/include/linux/uacce.h
+++ b/include/linux/uacce.h
@@ -30,6 +30,9 @@ struct uacce_qfile_region {
* @is_q_updated: check whether the task is finished
* @mmap: mmap addresses of queue to user space
* @ioctl: ioctl for user space users of the queue
+ * @get_isolate_state: get the device state after set the isolate strategy
+ * @isolate_strategy_write: stored the isolate strategy to the device
+ * @isolate_strategy_read: read the isolate strategy value from the device
*/
struct uacce_ops {
int (*get_available_instances)(struct uacce_device *uacce);
@@ -43,6 +46,9 @@ struct uacce_ops {
struct uacce_qfile_region *qfr);
long (*ioctl)(struct uacce_queue *q, unsigned int cmd,
unsigned long arg);
+ enum uacce_dev_state (*get_isolate_state)(struct uacce_device *uacce);
+ int (*isolate_strategy_write)(struct uacce_device *uacce, u32 freq);
+ u32 (*isolate_strategy_read)(struct uacce_device *uacce);
};

/**
@@ -57,6 +63,11 @@ struct uacce_interface {
const struct uacce_ops *ops;
};

+enum uacce_dev_state {
+ UACCE_DEV_NORMAL,
+ UACCE_DEV_ISOLATE,
+};
+
enum uacce_q_state {
UACCE_Q_ZOMBIE = 0,
UACCE_Q_INIT,
--
2.17.1

2022-09-09 08:39:04

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] Documentation: add a isolation strategy sysfs node for uacce

On Fri, Sep 02, 2022 at 03:13:03AM +0000, Kai Ye wrote:
> Update documentation describing sysfs node that could help to
> configure isolation strategy for users in the user space. And
> describing sysfs node that could read the device isolated state.
>
> Signed-off-by: Kai Ye <[email protected]>
> ---
> Documentation/ABI/testing/sysfs-driver-uacce | 26 ++++++++++++++++++++
> 1 file changed, 26 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
> index 08f2591138af..af5bc2f326d2 100644
> --- a/Documentation/ABI/testing/sysfs-driver-uacce
> +++ b/Documentation/ABI/testing/sysfs-driver-uacce
> @@ -19,6 +19,32 @@ Contact: [email protected]
> Description: Available instances left of the device
> Return -ENODEV if uacce_ops get_available_instances is not provided
>
> +What: /sys/class/uacce/<dev_name>/isolate_strategy
> +Date: Sep 2022
> +KernelVersion: 6.0
> +Contact: [email protected]
> +Description: (RW) Configure the frequency size for the hardware error
> + isolation strategy. This size is a configured integer value.
> + The default is 0. The maximum value is 65535. This value is a
> + threshold based on your driver strategies.

I do not understand what the units are here.

How is anyone supposed to know what they are?

> + For example, in the hisilicon accelerator engine, first we will
> + time-stamp every slot AER error. Then check the AER error log
> + when the device AER error occurred. if the device slot AER error
> + count exceeds the preset the number of times in one hour, the
> + isolated state will be set to true. So the device will be
> + isolated. And the AER error log that exceed one hour will be
> + cleared. Of course, different strategies can be defined in
> + different drivers.

So this file can contain values of different units depending on the
different driver that creates it? How is anyone supposed to know what
it is and what it should be?

This feels very loose, please define this much better so that it can be
understood and maintained properly.

thanks,

greg k-h

2022-09-09 08:39:04

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v8 1/3] uacce: supports device isolation feature

On Fri, Sep 02, 2022 at 03:13:02AM +0000, Kai Ye wrote:
> UACCE adds the hardware error isolation API. Users can configure
> the isolation frequency by this sysfs node. UACCE reports the device
> isolate state to the user space. If the AER error frequency exceeds
> the value of setting for a certain period of time, the device will be
> isolated.
>
> Signed-off-by: Kai Ye <[email protected]>
> ---
> drivers/misc/uacce/uacce.c | 58 ++++++++++++++++++++++++++++++++++++++
> include/linux/uacce.h | 11 ++++++++
> 2 files changed, 69 insertions(+)
>
> diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
> index 281c54003edc..41f454c89cd1 100644
> --- a/drivers/misc/uacce/uacce.c
> +++ b/drivers/misc/uacce/uacce.c
> @@ -7,6 +7,8 @@
> #include <linux/slab.h>
> #include <linux/uacce.h>
>
> +#define MAX_ERR_ISOLATE_COUNT 65535

What units is this in? Shouldn't this be in a .h file somewhere as it
is a limit you impose on a driver implementing this API.

> +
> static struct class *uacce_class;
> static dev_t uacce_devt;
> static DEFINE_MUTEX(uacce_mutex);
> @@ -339,12 +341,57 @@ static ssize_t region_dus_size_show(struct device *dev,
> uacce->qf_pg_num[UACCE_QFRT_DUS] << PAGE_SHIFT);
> }
>
> +static ssize_t isolate_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct uacce_device *uacce = to_uacce_device(dev);
> +
> + if (!uacce->ops->get_isolate_state)
> + return -ENODEV;
> +
> + return sysfs_emit(buf, "%d\n", uacce->ops->get_isolate_state(uacce));
> +}
> +
> +static ssize_t isolate_strategy_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct uacce_device *uacce = to_uacce_device(dev);
> + u32 val;
> +
> + val = uacce->ops->isolate_strategy_read(uacce);
> + if (val > MAX_ERR_ISOLATE_COUNT)
> + return -EINVAL;

How can a driver return a higher number here?

> +
> + return sysfs_emit(buf, "%u\n", val);
> +}
> +
> +static ssize_t isolate_strategy_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + struct uacce_device *uacce = to_uacce_device(dev);
> + unsigned long val;
> + int ret;
> +
> + if (kstrtoul(buf, 0, &val) < 0)
> + return -EINVAL;
> +
> + if (val > MAX_ERR_ISOLATE_COUNT)
> + return -EINVAL;
> +
> + ret = uacce->ops->isolate_strategy_write(uacce, val);
> +
> + return ret ? ret : count;

Please write out if statements.

> +}
> +
> static DEVICE_ATTR_RO(api);
> static DEVICE_ATTR_RO(flags);
> static DEVICE_ATTR_RO(available_instances);
> static DEVICE_ATTR_RO(algorithms);
> static DEVICE_ATTR_RO(region_mmio_size);
> static DEVICE_ATTR_RO(region_dus_size);
> +static DEVICE_ATTR_RO(isolate);
> +static DEVICE_ATTR_RW(isolate_strategy);
>
> static struct attribute *uacce_dev_attrs[] = {
> &dev_attr_api.attr,
> @@ -353,6 +400,8 @@ static struct attribute *uacce_dev_attrs[] = {
> &dev_attr_algorithms.attr,
> &dev_attr_region_mmio_size.attr,
> &dev_attr_region_dus_size.attr,
> + &dev_attr_isolate.attr,
> + &dev_attr_isolate_strategy.attr,
> NULL,
> };
>
> @@ -368,6 +417,15 @@ static umode_t uacce_dev_is_visible(struct kobject *kobj,
> (!uacce->qf_pg_num[UACCE_QFRT_DUS])))
> return 0;
>
> + if (attr == &dev_attr_isolate_strategy.attr &&
> + (!uacce->ops->isolate_strategy_read ||
> + !uacce->ops->isolate_strategy_write))

So you need either a read or write? Why not both?

thanks,

greg k-h

2022-09-19 03:42:31

by yekai (A)

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] Documentation: add a isolation strategy sysfs node for uacce



On 2022/9/9 16:27, Greg KH wrote:
> On Fri, Sep 02, 2022 at 03:13:03AM +0000, Kai Ye wrote:
>> Update documentation describing sysfs node that could help to
>> configure isolation strategy for users in the user space. And
>> describing sysfs node that could read the device isolated state.
>>
>> Signed-off-by: Kai Ye <[email protected]>
>> ---
>> Documentation/ABI/testing/sysfs-driver-uacce | 26 ++++++++++++++++++++
>> 1 file changed, 26 insertions(+)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
>> index 08f2591138af..af5bc2f326d2 100644
>> --- a/Documentation/ABI/testing/sysfs-driver-uacce
>> +++ b/Documentation/ABI/testing/sysfs-driver-uacce
>> @@ -19,6 +19,32 @@ Contact: [email protected]
>> Description: Available instances left of the device
>> Return -ENODEV if uacce_ops get_available_instances is not provided
>>
>> +What: /sys/class/uacce/<dev_name>/isolate_strategy
>> +Date: Sep 2022
>> +KernelVersion: 6.0
>> +Contact: [email protected]
>> +Description: (RW) Configure the frequency size for the hardware error
>> + isolation strategy. This size is a configured integer value.
>> + The default is 0. The maximum value is 65535. This value is a
>> + threshold based on your driver strategies.
> I do not understand what the units are here.
>
> How is anyone supposed to know what they are?

This unit is the number of times. Number of occurrences in a period, also means threshold.
If the number of device pci AER error exceeds the threshold in a time window, the device is
isolated.

>> + For example, in the hisilicon accelerator engine, first we will
>> + time-stamp every slot AER error. Then check the AER error log
>> + when the device AER error occurred. if the device slot AER error
>> + count exceeds the preset the number of times in one hour, the
>> + isolated state will be set to true. So the device will be
>> + isolated. And the AER error log that exceed one hour will be
>> + cleared. Of course, different strategies can be defined in
>> + different drivers.
> So this file can contain values of different units depending on the
> different driver that creates it? How is anyone supposed to know what
> it is and what it should be?
>
> This feels very loose, please define this much better so that it can be
> understood and maintained properly.
>
> thanks,
>
> greg k-h
> .
>
Yes, We started out with the idea of not restricting the different drive, only specifying the input and output.
Because we think different drivers require different processing strategy.

2022-09-19 09:39:20

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] Documentation: add a isolation strategy sysfs node for uacce

On Mon, Sep 19, 2022 at 11:21:30AM +0800, yekai (A) wrote:
>
>
> On 2022/9/9 16:27, Greg KH wrote:
> > On Fri, Sep 02, 2022 at 03:13:03AM +0000, Kai Ye wrote:
> >> Update documentation describing sysfs node that could help to
> >> configure isolation strategy for users in the user space. And
> >> describing sysfs node that could read the device isolated state.
> >>
> >> Signed-off-by: Kai Ye <[email protected]>
> >> ---
> >> Documentation/ABI/testing/sysfs-driver-uacce | 26 ++++++++++++++++++++
> >> 1 file changed, 26 insertions(+)
> >>
> >> diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
> >> index 08f2591138af..af5bc2f326d2 100644
> >> --- a/Documentation/ABI/testing/sysfs-driver-uacce
> >> +++ b/Documentation/ABI/testing/sysfs-driver-uacce
> >> @@ -19,6 +19,32 @@ Contact: [email protected]
> >> Description: Available instances left of the device
> >> Return -ENODEV if uacce_ops get_available_instances is not provided
> >>
> >> +What: /sys/class/uacce/<dev_name>/isolate_strategy
> >> +Date: Sep 2022
> >> +KernelVersion: 6.0
> >> +Contact: [email protected]
> >> +Description: (RW) Configure the frequency size for the hardware error
> >> + isolation strategy. This size is a configured integer value.
> >> + The default is 0. The maximum value is 65535. This value is a
> >> + threshold based on your driver strategies.
> > I do not understand what the units are here.
> >
> > How is anyone supposed to know what they are?
>
> This unit is the number of times. Number of occurrences in a period, also means threshold.
> If the number of device pci AER error exceeds the threshold in a time window, the device is
> isolated.

Please document this very very well.

> >> + For example, in the hisilicon accelerator engine, first we will
> >> + time-stamp every slot AER error. Then check the AER error log
> >> + when the device AER error occurred. if the device slot AER error
> >> + count exceeds the preset the number of times in one hour, the
> >> + isolated state will be set to true. So the device will be
> >> + isolated. And the AER error log that exceed one hour will be
> >> + cleared. Of course, different strategies can be defined in
> >> + different drivers.
> > So this file can contain values of different units depending on the
> > different driver that creates it? How is anyone supposed to know what
> > it is and what it should be?
> >
> > This feels very loose, please define this much better so that it can be
> > understood and maintained properly.
> >
> > thanks,
> >
> > greg k-h
> > .
> >
> Yes, We started out with the idea of not restricting the different drive, only specifying the input and output.
> Because we think different drivers require different processing strategy.

What different drivers? You only have one! And why do you need a
framework for only one driver? You should only add that when you have
multiple users to ensure you got the framework correct otherwise you do
not know how it will be used.

thanks,

greg k-h

2022-09-21 03:07:47

by yekai (A)

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] Documentation: add a isolation strategy sysfs node for uacce



On 2022/9/19 17:34, Greg KH wrote:
> On Mon, Sep 19, 2022 at 11:21:30AM +0800, yekai (A) wrote:
>>
>> On 2022/9/9 16:27, Greg KH wrote:
>>> On Fri, Sep 02, 2022 at 03:13:03AM +0000, Kai Ye wrote:
>>>> Update documentation describing sysfs node that could help to
>>>> configure isolation strategy for users in the user space. And
>>>> describing sysfs node that could read the device isolated state.
>>>>
>>>> Signed-off-by: Kai Ye <[email protected]>
>>>> ---
>>>> Documentation/ABI/testing/sysfs-driver-uacce | 26 ++++++++++++++++++++
>>>> 1 file changed, 26 insertions(+)
>>>>
>>>> diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
>>>> index 08f2591138af..af5bc2f326d2 100644
>>>> --- a/Documentation/ABI/testing/sysfs-driver-uacce
>>>> +++ b/Documentation/ABI/testing/sysfs-driver-uacce
>>>> @@ -19,6 +19,32 @@ Contact: [email protected]
>>>> Description: Available instances left of the device
>>>> Return -ENODEV if uacce_ops get_available_instances is not provided
>>>>
>>>> +What: /sys/class/uacce/<dev_name>/isolate_strategy
>>>> +Date: Sep 2022
>>>> +KernelVersion: 6.0
>>>> +Contact: [email protected]
>>>> +Description: (RW) Configure the frequency size for the hardware error
>>>> + isolation strategy. This size is a configured integer value.
>>>> + The default is 0. The maximum value is 65535. This value is a
>>>> + threshold based on your driver strategies.
>>> I do not understand what the units are here.
>>>
>>> How is anyone supposed to know what they are?
>> This unit is the number of times. Number of occurrences in a period, also means threshold.
>> If the number of device pci AER error exceeds the threshold in a time window, the device is
>> isolated.
> Please document this very very well.
>
>>>> + For example, in the hisilicon accelerator engine, first we will
>>>> + time-stamp every slot AER error. Then check the AER error log
>>>> + when the device AER error occurred. if the device slot AER error
>>>> + count exceeds the preset the number of times in one hour, the
>>>> + isolated state will be set to true. So the device will be
>>>> + isolated. And the AER error log that exceed one hour will be
>>>> + cleared. Of course, different strategies can be defined in
>>>> + different drivers.
>>> So this file can contain values of different units depending on the
>>> different driver that creates it? How is anyone supposed to know what
>>> it is and what it should be?
>>>
>>> This feels very loose, please define this much better so that it can be
>>> understood and maintained properly.
>>>
>>> thanks,
>>>
>>> greg k-h
>>> .
>>>
>> Yes, We started out with the idea of not restricting the different drive, only specifying the input and output.
>> Because we think different drivers require different processing strategy.
> What different drivers? You only have one! And why do you need a
> framework for only one driver? You should only add that when you have
> multiple users to ensure you got the framework correct otherwise you do
> not know how it will be used.
>
> thanks,
>
> greg k-h
> .
>
ok . I will move isolation strategy from qm to uacce in the next version.

it can be understood and maintained properly.

thanks

Kai