LinuxLists.cc - [PATCH v3 0/2] drm: Add GPU reset sysfs

2022-11-25 18:17:53

Subject: [PATCH v3 0/2] drm: Add GPU reset sysfs

This patchset adds a udev event for DRM device's resets.

Userspace apps can trigger GPU resets by misuse of graphical APIs or driver
bugs. Either way, the GPU reset might lead the system to a broken state[1], that
might be recovered if user has access to a tty or a remote shell. Arguably, this
recovery could happen automatically by the system itself, thus this is the goal
of this patchset.

For debugging and report purposes, device coredump support was already added
for amdgpu[2], but it's not suitable for programmatic usage like this one given
the uAPI not being stable and the need for parsing.

GL/VK is out of scope for this use, giving that we are dealing with device
resets regardless of API.

A basic userspace daemon is provided at [3] showing how the interface is used
to recovery from resets.

[1] A search for "reset" in DRM/AMD issue tracker shows reports of resets
making the system unusable:
https://gitlab.freedesktop.org/drm/amd/-/issues/?search=reset

[2] https://lore.kernel.org/amd-gfx/[email protected]/

[3] https://gitlab.freedesktop.org/andrealmeid/gpu-resetd

v2: https://lore.kernel.org/dri-devel/[email protected]/

André Almeida (1):
drm/amdgpu: Add work function for GPU reset event

Shashank Sharma (1):
drm: Add GPU reset sysfs event

drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 30 ++++++++++++++++++++++
drivers/gpu/drm/drm_sysfs.c | 26 +++++++++++++++++++
include/drm/drm_sysfs.h | 13 ++++++++++
4 files changed, 73 insertions(+)

--
2.38.1

2022-11-25 18:20:28

by André Almeida

[permalink] [raw]

Subject: [PATCH v3 2/2] drm/amdgpu: Add work function for GPU reset event

Add a work function to send a GPU reset uevent and scheduled it during
a GPU reset.

Co-developed-by: Shashank Sharma <[email protected]>
Signed-off-by: Shashank Sharma <[email protected]>
Signed-off-by: André Almeida <[email protected]>
---
V3:
- Merge two last commits

V2: Addressed review comments from Christian
- Changed the name of the work to gpu_reset_event_work
- Added a structure to accommodate some additional information
(like a PID and some flags)
- Do not add new structure in amdgpu.h
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 30 ++++++++++++++++++++++
2 files changed, 34 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 6b74df446694..88cb5b739c5d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -60,6 +60,8 @@
#include <drm/amdgpu_drm.h>
#include <drm/drm_gem.h>
#include <drm/drm_ioctl.h>
+#include <drm/gpu_scheduler.h>
+#include <drm/drm_sysfs.h>

#include <kgd_kfd_interface.h>
#include "dm_pp_interface.h"
@@ -1003,6 +1005,7 @@ struct amdgpu_device {

int asic_reset_res;
struct work_struct xgmi_reset_work;
+ struct work_struct gpu_reset_event_work;
struct list_head reset_list;

long gfx_timeout;
@@ -1036,6 +1039,7 @@ struct amdgpu_device {
pci_channel_state_t pci_channel_state;

struct amdgpu_reset_control *reset_cntl;
+ struct drm_reset_event_info reset_event_info;
uint32_t ip_versions[MAX_HWIP][HWIP_MAX_INSTANCE];

bool ram_is_direct_mapped;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b2b1c66bfe39..d04541fdb606 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -77,6 +77,7 @@
#include <linux/pm_runtime.h>

#include <drm/drm_drv.h>
+#include <drm/drm_sysfs.h>

MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
@@ -3365,6 +3366,19 @@ bool amdgpu_device_has_dc_support(struct amdgpu_device *adev)
return amdgpu_device_asic_has_dc_support(adev->asic_type);
}

+static void amdgpu_device_reset_event_func(struct work_struct *__work)
+{
+ struct amdgpu_device *adev = container_of(__work, struct amdgpu_device,
+ gpu_reset_event_work);
+ /*
+ * A GPU reset has happened, inform the userspace and pass the reset
+ * related information
+ */
+ drm_sysfs_reset_event(&adev->ddev, &adev->reset_event_info);
+
+ put_pid(adev->reset_event_info.pid);
+}
+
static void amdgpu_device_xgmi_reset_func(struct work_struct *__work)
{
struct amdgpu_device *adev =
@@ -3616,6 +3630,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
amdgpu_device_delay_enable_gfx_off);

INIT_WORK(&adev->xgmi_reset_work, amdgpu_device_xgmi_reset_func);
+ INIT_WORK(&adev->gpu_reset_event_work, amdgpu_device_reset_event_func);

adev->gfx.gfx_off_req_count = 1;
adev->gfx.gfx_off_residency = 0;
@@ -4920,6 +4935,21 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
goto out;

vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
+
+ if (reset_context->job && reset_context->job->vm) {
+ tmp_adev->reset_event_info.pid =
+ find_get_pid(reset_context->job->vm->task_info.pid);
+ } else {
+ tmp_adev->reset_event_info.pid = NULL;
+ }
+
+ if (vram_lost)
+ tmp_adev->reset_event_info.flags |=
+ DRM_RESET_EVENT_VRAM_LOST;
+
+ /* Send GPU reset event */
+ schedule_work(&tmp_adev->gpu_reset_event_work);
+
#ifdef CONFIG_DEV_COREDUMP
tmp_adev->reset_vram_lost = vram_lost;
memset(&tmp_adev->reset_task_info, 0,
--
2.38.1

2022-11-25 18:21:12

by André Almeida

[permalink] [raw]

Subject: [PATCH v3 1/2] drm: Add GPU reset sysfs event

From: Shashank Sharma <[email protected]>

Add a sysfs event to notify userspace about GPU resets providing:
- PID that triggered the GPU reset, if any. Resets can happen from
kernel threads as well, in that case no PID is provided
- Information about the reset (e.g. was VRAM lost?)

Co-developed-by: André Almeida <[email protected]>
Signed-off-by: André Almeida <[email protected]>
Signed-off-by: Shashank Sharma <[email protected]>
---

V3:
- Reduce information to just PID and flags
- Use pid pointer instead of just pid number
- BUG() if no reset info is provided

V2:
- Addressed review comments from Christian and Amar
- move the reset information structure to DRM layer
- drop _ctx from struct name
- make pid 32 bit(than 64)
- set flag when VRAM invalid (than valid)
- add process name as well (Amar)
---
drivers/gpu/drm/drm_sysfs.c | 26 ++++++++++++++++++++++++++
include/drm/drm_sysfs.h | 13 +++++++++++++
2 files changed, 39 insertions(+)

diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
index 430e00b16eec..85777abf4194 100644
--- a/drivers/gpu/drm/drm_sysfs.c
+++ b/drivers/gpu/drm/drm_sysfs.c
@@ -409,6 +409,32 @@ void drm_sysfs_hotplug_event(struct drm_device *dev)
}
EXPORT_SYMBOL(drm_sysfs_hotplug_event);

+/**
+ * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
+ * @dev: DRM device
+ * @reset_info: The contextual information about the reset (like PID, flags)
+ *
+ * Send a uevent for the DRM device specified by @dev. This informs
+ * user that a GPU reset has occurred, so that an interested client
+ * can take any recovery or profiling measure.
+ */
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event_info *reset_info)
+{
+ unsigned char pid_str[13];
+ unsigned char flags_str[18];
+ unsigned char reset_str[] = "RESET=1";
+ char *envp[] = { reset_str, pid_str, flags_str, NULL };
+
+ DRM_DEBUG("generating reset event\n");
+
+ BUG_ON(!reset_info);
+
+ snprintf(pid_str, sizeof(pid_str), "PID=%u", pid_vnr(reset_info->pid));
+ snprintf(flags_str, sizeof(flags_str), "FLAGS=0x%llx", reset_info->flags);
+ kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+EXPORT_SYMBOL(drm_sysfs_reset_event);
+
/**
* drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
* change
diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
index 6273cac44e47..dbb0ac6230b8 100644
--- a/include/drm/drm_sysfs.h
+++ b/include/drm/drm_sysfs.h
@@ -2,15 +2,28 @@
#ifndef _DRM_SYSFS_H_
#define _DRM_SYSFS_H_

+#define DRM_RESET_EVENT_VRAM_LOST (1 << 0)
+
struct drm_device;
struct device;
struct drm_connector;
struct drm_property;

+/**
+ * struct drm_reset_event_info - Information about a GPU reset event
+ * @pid: Process that triggered the reset, if any
+ * @flags: Extra information around the reset event (e.g. is VRAM lost?)
+ */
+struct drm_reset_event_info {
+ struct pid *pid;
+ uint64_t flags;
+};
+
int drm_class_device_register(struct device *dev);
void drm_class_device_unregister(struct device *dev);

void drm_sysfs_hotplug_event(struct drm_device *dev);
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event_info *reset_info);
void drm_sysfs_connector_hotplug_event(struct drm_connector *connector);
void drm_sysfs_connector_status_event(struct drm_connector *connector,
struct drm_property *property);
--
2.38.1

2022-11-28 09:47:35

by Pekka Paalanen

[permalink] [raw]

Subject: Re: [PATCH v3 0/2] drm: Add GPU reset sysfs

On Fri, 25 Nov 2022 14:52:01 -0300
André Almeida <[email protected]> wrote:

> This patchset adds a udev event for DRM device's resets.

Hi,

this seems a good idea to me.

> Userspace apps can trigger GPU resets by misuse of graphical APIs or driver
> bugs. Either way, the GPU reset might lead the system to a broken state[1], that
> might be recovered if user has access to a tty or a remote shell. Arguably, this
> recovery could happen automatically by the system itself, thus this is the goal
> of this patchset.
>
> For debugging and report purposes, device coredump support was already added
> for amdgpu[2], but it's not suitable for programmatic usage like this one given
> the uAPI not being stable and the need for parsing.
>
> GL/VK is out of scope for this use, giving that we are dealing with device
> resets regardless of API.

I see that the reported PID is intended to be the culprit, the process
that caused the GPU to crash or hang, if identified. Hence, killing
that process perhaps makes sense, even if it could recover on its own
through GL/VK "device lost" mechanism.

"VRAM lost" is interesting. Innocent processes essentially lost the GPU
in that case, I suppose, but that's no reason to kill them and restart
the whole graphics stack outright. Those that actually handle GL/VK
device lost should theoretically be fine, right?

Display servers can make more enlightened decisions on whether they
need to restart or not, if they are implemented to handle that.

The example gpu-resetd [3] behaviour in that case seems sub-optimal.
Could it do better? How would it know, or avoid knowing, which
processes handled the GPU reset fine and which need external restarting?

Maybe gpu-resetd should kill the culprit only if it causes resets
repeatedly? But if the culprit does not handle device lost and also
does not die... how do you know you need to kill it?

Thanks,
pq

>
> A basic userspace daemon is provided at [3] showing how the interface is used
> to recovery from resets.
>
> [1] A search for "reset" in DRM/AMD issue tracker shows reports of resets
> making the system unusable:
> https://gitlab.freedesktop.org/drm/amd/-/issues/?search=reset
>
> [2] https://lore.kernel.org/amd-gfx/[email protected]/
>
> [3] https://gitlab.freedesktop.org/andrealmeid/gpu-resetd
>
> v2: https://lore.kernel.org/dri-devel/[email protected]/
>
> André Almeida (1):
> drm/amdgpu: Add work function for GPU reset event
>
> Shashank Sharma (1):
> drm: Add GPU reset sysfs event
>
> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 30 ++++++++++++++++++++++
> drivers/gpu/drm/drm_sysfs.c | 26 +++++++++++++++++++
> include/drm/drm_sysfs.h | 13 ++++++++++
> 4 files changed, 73 insertions(+)
>

Attachments:

(No filename) (849.00 B)
OpenPGP digital signature

2022-11-28 09:52:00

by Pekka Paalanen

[permalink] [raw]

Subject: Re: [PATCH v3 1/2] drm: Add GPU reset sysfs event

On Fri, 25 Nov 2022 14:52:02 -0300
André Almeida <[email protected]> wrote:

> From: Shashank Sharma <[email protected]>
>
> Add a sysfs event to notify userspace about GPU resets providing:
> - PID that triggered the GPU reset, if any. Resets can happen from
> kernel threads as well, in that case no PID is provided
> - Information about the reset (e.g. was VRAM lost?)
>
> Co-developed-by: André Almeida <[email protected]>
> Signed-off-by: André Almeida <[email protected]>
> Signed-off-by: Shashank Sharma <[email protected]>
> ---
>
> V3:
> - Reduce information to just PID and flags
> - Use pid pointer instead of just pid number
> - BUG() if no reset info is provided
>
> V2:
> - Addressed review comments from Christian and Amar
> - move the reset information structure to DRM layer
> - drop _ctx from struct name
> - make pid 32 bit(than 64)
> - set flag when VRAM invalid (than valid)
> - add process name as well (Amar)
> ---
> drivers/gpu/drm/drm_sysfs.c | 26 ++++++++++++++++++++++++++
> include/drm/drm_sysfs.h | 13 +++++++++++++
> 2 files changed, 39 insertions(+)
>
> diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
> index 430e00b16eec..85777abf4194 100644
> --- a/drivers/gpu/drm/drm_sysfs.c
> +++ b/drivers/gpu/drm/drm_sysfs.c
> @@ -409,6 +409,32 @@ void drm_sysfs_hotplug_event(struct drm_device *dev)
> }
> EXPORT_SYMBOL(drm_sysfs_hotplug_event);
>
> +/**
> + * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
> + * @dev: DRM device
> + * @reset_info: The contextual information about the reset (like PID, flags)
> + *
> + * Send a uevent for the DRM device specified by @dev. This informs
> + * user that a GPU reset has occurred, so that an interested client
> + * can take any recovery or profiling measure.
> + */
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event_info *reset_info)
> +{
> + unsigned char pid_str[13];

Hi,

"PID=4111222333\0"

I count 15 bytes instead of 13?

> + unsigned char flags_str[18];
> + unsigned char reset_str[] = "RESET=1";
> + char *envp[] = { reset_str, pid_str, flags_str, NULL };

It seems you always send PID attribute, even if it's nonsense (I guess
zero). Should sending nonsense be avoided?

> +
> + DRM_DEBUG("generating reset event\n");
> +
> + BUG_ON(!reset_info);
> +
> + snprintf(pid_str, sizeof(pid_str), "PID=%u", pid_vnr(reset_info->pid));

Passing PID by number is racy, but I suppose it has two rationales:
- there is no way to pass a pidfd?
- it's safe enough because the kernel avoids aggressive PID re-use?

Maybe this would be good to note in the commit message to justify the
design.

What about pid namespaces, are they handled by pid_vnr() auto-magically
somehow? Does it mean that the daemon handling these events *must not*
be running under a (non-root?) pid namespace to work at all?

E.g. if you have a container that is given a dedicated GPU device, I
guess it might be reasonable to want to run the daemon inside that
container as well. I have no idea how pid namespaces work, but I recall
hearing they are a thing.

> + snprintf(flags_str, sizeof(flags_str), "FLAGS=0x%llx", reset_info->flags);
> + kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_sysfs_reset_event);
> +
> /**
> * drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
> * change
> diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
> index 6273cac44e47..dbb0ac6230b8 100644
> --- a/include/drm/drm_sysfs.h
> +++ b/include/drm/drm_sysfs.h
> @@ -2,15 +2,28 @@
> #ifndef _DRM_SYSFS_H_
> #define _DRM_SYSFS_H_
>
> +#define DRM_RESET_EVENT_VRAM_LOST (1 << 0)

Since flags are UAPI, they should be documented somewhere in UAPI docs.

Shouldn't this whole event be documented somewhere in UAPI docs to say
what it means and what attributes it may have and what they mean?

Thanks,
pq

> +
> struct drm_device;
> struct device;
> struct drm_connector;
> struct drm_property;
>
> +/**
> + * struct drm_reset_event_info - Information about a GPU reset event
> + * @pid: Process that triggered the reset, if any
> + * @flags: Extra information around the reset event (e.g. is VRAM lost?)
> + */
> +struct drm_reset_event_info {
> + struct pid *pid;
> + uint64_t flags;
> +};
> +
> int drm_class_device_register(struct device *dev);
> void drm_class_device_unregister(struct device *dev);
>
> void drm_sysfs_hotplug_event(struct drm_device *dev);
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event_info *reset_info);
> void drm_sysfs_connector_hotplug_event(struct drm_connector *connector);
> void drm_sysfs_connector_status_event(struct drm_connector *connector,
> struct drm_property *property);

Attachments:

(No filename) (849.00 B)
OpenPGP digital signature

2022-11-28 10:03:41

by Simon Ser

[permalink] [raw]

Subject: Re: [PATCH v3 0/2] drm: Add GPU reset sysfs

The PID is racy, the user-space daemon could end up killing an
unrelated process… Is there any way we could use a pidfd instead?

2022-11-29 14:41:11

by Alex Deucher

[permalink] [raw]

Subject: Re: [PATCH v3 1/2] drm: Add GPU reset sysfs event

On Fri, Nov 25, 2022 at 12:52 PM André Almeida <[email protected]> wrote:
>
> From: Shashank Sharma <[email protected]>
>
> Add a sysfs event to notify userspace about GPU resets providing:
> - PID that triggered the GPU reset, if any. Resets can happen from
> kernel threads as well, in that case no PID is provided
> - Information about the reset (e.g. was VRAM lost?)
>
> Co-developed-by: André Almeida <[email protected]>
> Signed-off-by: André Almeida <[email protected]>
> Signed-off-by: Shashank Sharma <[email protected]>
> ---
>
> V3:
> - Reduce information to just PID and flags
> - Use pid pointer instead of just pid number
> - BUG() if no reset info is provided
>
> V2:
> - Addressed review comments from Christian and Amar
> - move the reset information structure to DRM layer
> - drop _ctx from struct name
> - make pid 32 bit(than 64)
> - set flag when VRAM invalid (than valid)
> - add process name as well (Amar)
> ---
> drivers/gpu/drm/drm_sysfs.c | 26 ++++++++++++++++++++++++++
> include/drm/drm_sysfs.h | 13 +++++++++++++
> 2 files changed, 39 insertions(+)
>
> diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
> index 430e00b16eec..85777abf4194 100644
> --- a/drivers/gpu/drm/drm_sysfs.c
> +++ b/drivers/gpu/drm/drm_sysfs.c
> @@ -409,6 +409,32 @@ void drm_sysfs_hotplug_event(struct drm_device *dev)
> }
> EXPORT_SYMBOL(drm_sysfs_hotplug_event);
>
> +/**
> + * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
> + * @dev: DRM device
> + * @reset_info: The contextual information about the reset (like PID, flags)
> + *
> + * Send a uevent for the DRM device specified by @dev. This informs
> + * user that a GPU reset has occurred, so that an interested client
> + * can take any recovery or profiling measure.
> + */
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event_info *reset_info)
> +{
> + unsigned char pid_str[13];
> + unsigned char flags_str[18];
> + unsigned char reset_str[] = "RESET=1";
> + char *envp[] = { reset_str, pid_str, flags_str, NULL };
> +
> + DRM_DEBUG("generating reset event\n");
> +
> + BUG_ON(!reset_info);
> +
> + snprintf(pid_str, sizeof(pid_str), "PID=%u", pid_vnr(reset_info->pid));
> + snprintf(flags_str, sizeof(flags_str), "FLAGS=0x%llx", reset_info->flags);
> + kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_sysfs_reset_event);
> +
> /**
> * drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
> * change
> diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
> index 6273cac44e47..dbb0ac6230b8 100644
> --- a/include/drm/drm_sysfs.h
> +++ b/include/drm/drm_sysfs.h
> @@ -2,15 +2,28 @@
> #ifndef _DRM_SYSFS_H_
> #define _DRM_SYSFS_H_
>
> +#define DRM_RESET_EVENT_VRAM_LOST (1 << 0)

I was thinking about this a bit more last night, and I think we should add:
DRM_RESET_EVENT_APP_ROBUSTNESS
When an application that supports robustness extensions starts, the
UMD can set a flag when it creates the context with the KMD. That way
if the app causes a GPU hang, the reset daemon would see this flag if
the guilty app supports robustness and adjust it's behavior as
appropriate. E.g., rather than killing the app, it might let it run
or set some grace period, etc.

Alex

> +
> struct drm_device;
> struct device;
> struct drm_connector;
> struct drm_property;
>
> +/**
> + * struct drm_reset_event_info - Information about a GPU reset event
> + * @pid: Process that triggered the reset, if any
> + * @flags: Extra information around the reset event (e.g. is VRAM lost?)
> + */
> +struct drm_reset_event_info {
> + struct pid *pid;
> + uint64_t flags;
> +};
> +
> int drm_class_device_register(struct device *dev);
> void drm_class_device_unregister(struct device *dev);
>
> void drm_sysfs_hotplug_event(struct drm_device *dev);
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event_info *reset_info);
> void drm_sysfs_connector_hotplug_event(struct drm_connector *connector);
> void drm_sysfs_connector_status_event(struct drm_connector *connector,
> struct drm_property *property);
> --
> 2.38.1
>

2022-11-30 11:30:28

by Daniel Vetter

[permalink] [raw]

Subject: Re: [PATCH v3 0/2] drm: Add GPU reset sysfs

On Fri, Nov 25, 2022 at 02:52:01PM -0300, Andr? Almeida wrote:
> This patchset adds a udev event for DRM device's resets.
>
> Userspace apps can trigger GPU resets by misuse of graphical APIs or driver
> bugs. Either way, the GPU reset might lead the system to a broken state[1], that
> might be recovered if user has access to a tty or a remote shell. Arguably, this
> recovery could happen automatically by the system itself, thus this is the goal
> of this patchset.
>
> For debugging and report purposes, device coredump support was already added
> for amdgpu[2], but it's not suitable for programmatic usage like this one given
> the uAPI not being stable and the need for parsing.
>
> GL/VK is out of scope for this use, giving that we are dealing with device
> resets regardless of API.
>
> A basic userspace daemon is provided at [3] showing how the interface is used
> to recovery from resets.
>
> [1] A search for "reset" in DRM/AMD issue tracker shows reports of resets
> making the system unusable:
> https://gitlab.freedesktop.org/drm/amd/-/issues/?search=reset
>
> [2] https://lore.kernel.org/amd-gfx/[email protected]/
>
> [3] https://gitlab.freedesktop.org/andrealmeid/gpu-resetd
>
> v2: https://lore.kernel.org/dri-devel/[email protected]/
>
> Andr? Almeida (1):
> drm/amdgpu: Add work function for GPU reset event
>
> Shashank Sharma (1):
> drm: Add GPU reset sysfs event

This seems a bit much amd specific, and a bit much like an ad-hoc stopgap.

On the amd specific piece:

- amd's gpus suck the most for gpu hangs, because aside from the shader
unblock, there's only device reset, which thrashes vram and display and
absolutely everything. Which is terrible. Everyone else has engine only
reset since years (i.e. doesn't thrash display or vram), and very often
even just context reset (i.e. unless the driver is busted somehow or hw
bug, malicious userspace will _only_ ever impact itself).

- robustness extensions for gl/vk already have very clear specifications
of all cases of reset, and this work here just ignores that. Yes on amd
you only have device reset, but this is drm infra, so you need to be
able to cope with ctx reset or reset which only affected a limited set
of context. If this is for compute and compute apis lack robustness
extensions, then those apis need to be fixed to fill that gap.

- the entire deamon thing feels a bit like overkill and I'm not sure why
it exists. I think for a start it would be much simpler if we just have
a (per-device maybe) sysfs knob to enable automatic killing of process
that die and which don't have arb robustness enabled (for gl case, for
vk case the assumption is that _every_ app supports VK_DEVICE_LOST and
can recover).

Now onto the ad-hoc part:

- Everyone hand-rolls ad-hoc gpu context structures and how to associate
them with a pid. I think we need to stop doing that, because it's just
endless pain and prevents us from building useful management stuff like
cgroups for drivers that work across drivers (and driver/vendor specific
cgroup wont be accepted by upstream cgroup maintainers). Or gpu reset
events and dumps like here. This is going to be some work unforutnately.

- I think the best starting point is the context structure drm/scheduler
already has, but it needs some work:
* untangling it from the scheduler part, so it can be used also for
compute context that are directly scheduled by hw
* (amd specific) moving amdkfd over to that context structure, at least
internally
* tracking the pid in there

- I think the error dump facility should also be integrated into this.
Userspace needs to know which dump is associated with which reset event,
so that remote crash reporting works correctly.

- ideally this framework can keep track of impacted context so that
drivers don't have to reinvent the "which context are impacted"
robustness ioctl book-keeping all on their own. For amd gpus it's kinda
easy, since the impact is "everything", but for other gpus the impact
can be all the way from "only one context" to "only contexts actively
running on $set_of_engines" to "all the context actively running" to "we
thrashed vram, everything is gone"

- i915 has a bunch of this already, but I have honestly no idea whether
it's any use because i915-gem is terminally not switching over to
drm/scheduler (it needs a full rewrite, which is happening somewhere).
So might only be useful to look at to make sure we're not building
something which only works for full device reset gpus and nothing else.
Over the various generations i915 has pretty much every possible gpu
reset options you can think of, with resulting different reporting
requirements to make sure robustness extensions work correctly.

- pid isn't enough once you have engine/context reset, you need pid (well
drm_file really, but I guess we can bind those to pid somehow) and gpu
ctx id. Both gl and vk allow you to allocate limitless gpu context on
the same device, and so this matters.

- igt for this stuff. Probably needs some work to generalize the i915
infra for endless batchbuffers so that you can make very controlled gpu
hangs.

Cheers, Daniel

> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 30 ++++++++++++++++++++++
> drivers/gpu/drm/drm_sysfs.c | 26 +++++++++++++++++++
> include/drm/drm_sysfs.h | 13 ++++++++++
> 4 files changed, 73 insertions(+)
>
> --
> 2.38.1
>

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2022-11-30 15:48:27

by André Almeida

[permalink] [raw]

Subject: Re: [PATCH v3 0/2] drm: Add GPU reset sysfs

On 11/28/22 06:30, Simon Ser wrote:
> The PID is racy, the user-space daemon could end up killing an
> unrelated process… Is there any way we could use a pidfd instead?

Is the PID race condition something that really happens or rather
something theoretical?

Anyway, I can't see how pidfd and uevent would work together. Since
uevent it's kind of a broadcast and pidfd is an anon file, it wouldn't
be possible to say to userspace which is the fd to be used giving that
file descriptors are per process resources.

On the other hand, this interface could be converted to be an ioctl that
userspace would block waiting for a reset notification, then the kernel
could create a pidfd and give to the blocked process the right fd. We
would probably need a queue to make sure no event is lost.

Thanks
André

2022-11-30 15:49:12

by Simon Ser

[permalink] [raw]

Subject: Re: [PATCH v3 0/2] drm: Add GPU reset sysfs

On Wednesday, November 30th, 2022 at 16:23, André Almeida <[email protected]> wrote:

> On 11/28/22 06:30, Simon Ser wrote:
>
> > The PID is racy, the user-space daemon could end up killing an
> > unrelated process… Is there any way we could use a pidfd instead?
>
> Is the PID race condition something that really happens or rather
> something theoretical?

A PID race can happen in practice if many PIDs get spawned. On Linux
PIDs wrap around pretty quickly.

Note, even a sandboxed program inside its own PID namespace can trigger
the wrap-around.

> Anyway, I can't see how pidfd and uevent would work together. Since
> uevent it's kind of a broadcast and pidfd is an anon file, it wouldn't
> be possible to say to userspace which is the fd to be used giving that
> file descriptors are per process resources.

Yeah, I can see how this can be difficult to integrate with uevent.

> On the other hand, this interface could be converted to be an ioctl that
> userspace would block waiting for a reset notification, then the kernel
> could create a pidfd and give to the blocked process the right fd. We
> would probably need a queue to make sure no event is lost.

A blocking IOCTL wouldn't be very nice, you can't integrate that into
an event loop for instance…

2022-12-08 05:39:18

by Alex Deucher

[permalink] [raw]

Subject: Re: [PATCH v3 0/2] drm: Add GPU reset sysfs

On Wed, Nov 30, 2022 at 6:11 AM Daniel Vetter <[email protected]> wrote:
>
> On Fri, Nov 25, 2022 at 02:52:01PM -0300, André Almeida wrote:
> > This patchset adds a udev event for DRM device's resets.
> >
> > Userspace apps can trigger GPU resets by misuse of graphical APIs or driver
> > bugs. Either way, the GPU reset might lead the system to a broken state[1], that
> > might be recovered if user has access to a tty or a remote shell. Arguably, this
> > recovery could happen automatically by the system itself, thus this is the goal
> > of this patchset.
> >
> > For debugging and report purposes, device coredump support was already added
> > for amdgpu[2], but it's not suitable for programmatic usage like this one given
> > the uAPI not being stable and the need for parsing.
> >
> > GL/VK is out of scope for this use, giving that we are dealing with device
> > resets regardless of API.
> >
> > A basic userspace daemon is provided at [3] showing how the interface is used
> > to recovery from resets.
> >
> > [1] A search for "reset" in DRM/AMD issue tracker shows reports of resets
> > making the system unusable:
> > https://gitlab.freedesktop.org/drm/amd/-/issues/?search=reset
> >
> > [2] https://lore.kernel.org/amd-gfx/[email protected]/
> >
> > [3] https://gitlab.freedesktop.org/andrealmeid/gpu-resetd
> >
> > v2: https://lore.kernel.org/dri-devel/[email protected]/
> >
> > André Almeida (1):
> > drm/amdgpu: Add work function for GPU reset event
> >
> > Shashank Sharma (1):
> > drm: Add GPU reset sysfs event
>
> This seems a bit much amd specific, and a bit much like an ad-hoc stopgap.
>
> On the amd specific piece:
>
> - amd's gpus suck the most for gpu hangs, because aside from the shader
> unblock, there's only device reset, which thrashes vram and display and
> absolutely everything. Which is terrible. Everyone else has engine only
> reset since years (i.e. doesn't thrash display or vram), and very often
> even just context reset (i.e. unless the driver is busted somehow or hw
> bug, malicious userspace will _only_ ever impact itself).
>
> - robustness extensions for gl/vk already have very clear specifications
> of all cases of reset, and this work here just ignores that. Yes on amd
> you only have device reset, but this is drm infra, so you need to be
> able to cope with ctx reset or reset which only affected a limited set
> of context. If this is for compute and compute apis lack robustness
> extensions, then those apis need to be fixed to fill that gap.
>
> - the entire deamon thing feels a bit like overkill and I'm not sure why
> it exists. I think for a start it would be much simpler if we just have
> a (per-device maybe) sysfs knob to enable automatic killing of process
> that die and which don't have arb robustness enabled (for gl case, for
> vk case the assumption is that _every_ app supports VK_DEVICE_LOST and
> can recover).

Thinking about this a bit more, I think there are useful cases for the
GPU reset event and a daemon. When I refer to a daemon here, it could
be a standalone thing or integrated into the desktop manager like
logind or whatever.
1. For APIs that don't have robustness support (e.g., video
encode/decode APIs). This one I could kind of go either way on since,
it probably makes sense to just kill the app if it there is no
robustness mechanism in the API.
2. Telemetry collection. It would be useful to have a central place
to collect telemetry information about what apps seem to be
problematic, etc.
3. A policy manager in userspace. If you want to make some decision
about what to do about repeat offenders or apps that claim to support
robustness, but really don't.
4. Apps that don't use a UMD. E.g., unit tests and IGT. If they
don't use a UMD, who kills them?
5. Server use cases where you have multiple GPU apps running in
containers and you want some sort of policy control or a hand in what
to do when the app causes a hang.

Alex

>
> Now onto the ad-hoc part:
>
> - Everyone hand-rolls ad-hoc gpu context structures and how to associate
> them with a pid. I think we need to stop doing that, because it's just
> endless pain and prevents us from building useful management stuff like
> cgroups for drivers that work across drivers (and driver/vendor specific
> cgroup wont be accepted by upstream cgroup maintainers). Or gpu reset
> events and dumps like here. This is going to be some work unforutnately.
>
> - I think the best starting point is the context structure drm/scheduler
> already has, but it needs some work:
> * untangling it from the scheduler part, so it can be used also for
> compute context that are directly scheduled by hw
> * (amd specific) moving amdkfd over to that context structure, at least
> internally
> * tracking the pid in there
>
> - I think the error dump facility should also be integrated into this.
> Userspace needs to know which dump is associated with which reset event,
> so that remote crash reporting works correctly.
>
> - ideally this framework can keep track of impacted context so that
> drivers don't have to reinvent the "which context are impacted"
> robustness ioctl book-keeping all on their own. For amd gpus it's kinda
> easy, since the impact is "everything", but for other gpus the impact
> can be all the way from "only one context" to "only contexts actively
> running on $set_of_engines" to "all the context actively running" to "we
> thrashed vram, everything is gone"
>
> - i915 has a bunch of this already, but I have honestly no idea whether
> it's any use because i915-gem is terminally not switching over to
> drm/scheduler (it needs a full rewrite, which is happening somewhere).
> So might only be useful to look at to make sure we're not building
> something which only works for full device reset gpus and nothing else.
> Over the various generations i915 has pretty much every possible gpu
> reset options you can think of, with resulting different reporting
> requirements to make sure robustness extensions work correctly.
>
> - pid isn't enough once you have engine/context reset, you need pid (well
> drm_file really, but I guess we can bind those to pid somehow) and gpu
> ctx id. Both gl and vk allow you to allocate limitless gpu context on
> the same device, and so this matters.
>
> - igt for this stuff. Probably needs some work to generalize the i915
> infra for endless batchbuffers so that you can make very controlled gpu
> hangs.
>
> Cheers, Daniel
>
> > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 30 ++++++++++++++++++++++
> > drivers/gpu/drm/drm_sysfs.c | 26 +++++++++++++++++++
> > include/drm/drm_sysfs.h | 13 ++++++++++
> > 4 files changed, 73 insertions(+)
> >
> > --
> > 2.38.1
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

2023-01-05 12:44:32

by Daniel Vetter

[permalink] [raw]

Subject: Re: [PATCH v3 0/2] drm: Add GPU reset sysfs

On Thu, 8 Dec 2022 at 05:54, Alex Deucher <[email protected]> wrote:
>
> On Wed, Nov 30, 2022 at 6:11 AM Daniel Vetter <[email protected]> wrote:
> >
> > On Fri, Nov 25, 2022 at 02:52:01PM -0300, André Almeida wrote:
> > > This patchset adds a udev event for DRM device's resets.
> > >
> > > Userspace apps can trigger GPU resets by misuse of graphical APIs or driver
> > > bugs. Either way, the GPU reset might lead the system to a broken state[1], that
> > > might be recovered if user has access to a tty or a remote shell. Arguably, this
> > > recovery could happen automatically by the system itself, thus this is the goal
> > > of this patchset.
> > >
> > > For debugging and report purposes, device coredump support was already added
> > > for amdgpu[2], but it's not suitable for programmatic usage like this one given
> > > the uAPI not being stable and the need for parsing.
> > >
> > > GL/VK is out of scope for this use, giving that we are dealing with device
> > > resets regardless of API.
> > >
> > > A basic userspace daemon is provided at [3] showing how the interface is used
> > > to recovery from resets.
> > >
> > > [1] A search for "reset" in DRM/AMD issue tracker shows reports of resets
> > > making the system unusable:
> > > https://gitlab.freedesktop.org/drm/amd/-/issues/?search=reset
> > >
> > > [2] https://lore.kernel.org/amd-gfx/[email protected]/
> > >
> > > [3] https://gitlab.freedesktop.org/andrealmeid/gpu-resetd
> > >
> > > v2: https://lore.kernel.org/dri-devel/[email protected]/
> > >
> > > André Almeida (1):
> > > drm/amdgpu: Add work function for GPU reset event
> > >
> > > Shashank Sharma (1):
> > > drm: Add GPU reset sysfs event
> >
> > This seems a bit much amd specific, and a bit much like an ad-hoc stopgap.
> >
> > On the amd specific piece:
> >
> > - amd's gpus suck the most for gpu hangs, because aside from the shader
> > unblock, there's only device reset, which thrashes vram and display and
> > absolutely everything. Which is terrible. Everyone else has engine only
> > reset since years (i.e. doesn't thrash display or vram), and very often
> > even just context reset (i.e. unless the driver is busted somehow or hw
> > bug, malicious userspace will _only_ ever impact itself).
> >
> > - robustness extensions for gl/vk already have very clear specifications
> > of all cases of reset, and this work here just ignores that. Yes on amd
> > you only have device reset, but this is drm infra, so you need to be
> > able to cope with ctx reset or reset which only affected a limited set
> > of context. If this is for compute and compute apis lack robustness
> > extensions, then those apis need to be fixed to fill that gap.
> >
> > - the entire deamon thing feels a bit like overkill and I'm not sure why
> > it exists. I think for a start it would be much simpler if we just have
> > a (per-device maybe) sysfs knob to enable automatic killing of process
> > that die and which don't have arb robustness enabled (for gl case, for
> > vk case the assumption is that _every_ app supports VK_DEVICE_LOST and
> > can recover).
>
> Thinking about this a bit more, I think there are useful cases for the
> GPU reset event and a daemon. When I refer to a daemon here, it could
> be a standalone thing or integrated into the desktop manager like
> logind or whatever.
> 1. For APIs that don't have robustness support (e.g., video
> encode/decode APIs). This one I could kind of go either way on since,
> it probably makes sense to just kill the app if it there is no
> robustness mechanism in the API.

I think transcode might also be a case where the userspace driver can
recover, at least on the decode side. But that would most likely
require some extension to make it clear to the app what's going on.

Or people just use vk video and be done, reset support comes built-in there :-)

> 2. Telemetry collection. It would be useful to have a central place
> to collect telemetry information about what apps seem to be
> problematic, etc.

Yeah I think standardizing reset messages and maybe device state dumps
makes sense. But that's telemetry, not making decisions about what to
kill.

> 3. A policy manager in userspace. If you want to make some decision
> about what to do about repeat offenders or apps that claim to support
> robustness, but really don't.

Imo we should have something for this in the kernel first. Kinda like
oom killer vs userspace oom killer. Sure eventually a userspace one
makes sense for very specific. But for a baseline I think we need an
in-kernel gpu offender killer first that's a bit standardized across
the drivers. Otherwise we're just guarnateed to build the wrong uapi,
with in-kernel first solution we can experiment around a bit first.

> 4. Apps that don't use a UMD. E.g., unit tests and IGT. If they
> don't use a UMD, who kills them?

CI framework. They have to deal with outright machine hangs and fun
stuff like that too unfortunately.

> 5. Server use cases where you have multiple GPU apps running in
> containers and you want some sort of policy control or a hand in what
> to do when the app causes a hang.

Yeah also a good idea, but first I think we need some kind of cgroups
that works across drivers. Then maybe the kill/permanent block support
can be integrated there somehow. I don't think just reset kill alone
makes any sense, because you're only stopping one abuse vector and the
entire dam is still totally broken.

tldr; I'm not against reset daemon on principle, I just think we're
not even close to a point where it makes sense to add something like
that and bake in the entire uapi with it.
-Daniel

>
> Alex
>
> >
> > Now onto the ad-hoc part:
> >
> > - Everyone hand-rolls ad-hoc gpu context structures and how to associate
> > them with a pid. I think we need to stop doing that, because it's just
> > endless pain and prevents us from building useful management stuff like
> > cgroups for drivers that work across drivers (and driver/vendor specific
> > cgroup wont be accepted by upstream cgroup maintainers). Or gpu reset
> > events and dumps like here. This is going to be some work unforutnately.
> >
> > - I think the best starting point is the context structure drm/scheduler
> > already has, but it needs some work:
> > * untangling it from the scheduler part, so it can be used also for
> > compute context that are directly scheduled by hw
> > * (amd specific) moving amdkfd over to that context structure, at least
> > internally
> > * tracking the pid in there
> >
> > - I think the error dump facility should also be integrated into this.
> > Userspace needs to know which dump is associated with which reset event,
> > so that remote crash reporting works correctly.
> >
> > - ideally this framework can keep track of impacted context so that
> > drivers don't have to reinvent the "which context are impacted"
> > robustness ioctl book-keeping all on their own. For amd gpus it's kinda
> > easy, since the impact is "everything", but for other gpus the impact
> > can be all the way from "only one context" to "only contexts actively
> > running on $set_of_engines" to "all the context actively running" to "we
> > thrashed vram, everything is gone"
> >
> > - i915 has a bunch of this already, but I have honestly no idea whether
> > it's any use because i915-gem is terminally not switching over to
> > drm/scheduler (it needs a full rewrite, which is happening somewhere).
> > So might only be useful to look at to make sure we're not building
> > something which only works for full device reset gpus and nothing else.
> > Over the various generations i915 has pretty much every possible gpu
> > reset options you can think of, with resulting different reporting
> > requirements to make sure robustness extensions work correctly.
> >
> > - pid isn't enough once you have engine/context reset, you need pid (well
> > drm_file really, but I guess we can bind those to pid somehow) and gpu
> > ctx id. Both gl and vk allow you to allocate limitless gpu context on
> > the same device, and so this matters.
> >
> > - igt for this stuff. Probably needs some work to generalize the i915
> > infra for endless batchbuffers so that you can make very controlled gpu
> > hangs.
> >
> > Cheers, Daniel
> >
> > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +++
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 30 ++++++++++++++++++++++
> > > drivers/gpu/drm/drm_sysfs.c | 26 +++++++++++++++++++
> > > include/drm/drm_sysfs.h | 13 ++++++++++
> > > 4 files changed, 73 insertions(+)
> > >
> > > --
> > > 2.38.1
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch