2023-06-27 13:27:35

by André Almeida

[permalink] [raw]
Subject: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.

Acked-by: Pekka Paalanen <[email protected]>
Signed-off-by: André Almeida <[email protected]>
---

v4: https://lore.kernel.org/lkml/[email protected]/

Changes:
- Grammar fixes (Randy)

Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 65fb3036a580..3cbffa25ed93 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
mmapped regular files. Threads cause additional pain with signal
handling as well.

+Device reset
+============
+
+The GPU stack is really complex and is prone to errors, from hardware bugs,
+faulty applications and everything in between the many layers. Some errors
+require resetting the device in order to make the device usable again. This
+sections describes the expectations for DRM and usermode drivers when a
+device resets and how to propagate the reset status.
+
+Kernel Mode Driver
+------------------
+
+The KMD is responsible for checking if the device needs a reset, and to perform
+it as needed. Usually a hang is detected when a job gets stuck executing. KMD
+should keep track of resets, because userspace can query any time about the
+reset stats for an specific context. This is needed to propagate to the rest of
+the stack that a reset has happened. Currently, this is implemented by each
+driver separately, with no common DRM interface.
+
+User Mode Driver
+----------------
+
+The UMD should check before submitting new commands to the KMD if the device has
+been reset, and this can be checked more often if the UMD requires it. After
+detecting a reset, UMD will then proceed to report it to the application using
+the appropriate API error code, as explained in the section below about
+robustness.
+
+Robustness
+----------
+
+The only way to try to keep an application working after a reset is if it
+complies with the robustness aspects of the graphical API that it is using.
+
+Graphical APIs provide ways to applications to deal with device resets. However,
+there is no guarantee that the app will use such features correctly, and the
+UMD can implement policies to close the app if it is a repeating offender,
+likely in a broken loop. This is done to ensure that it does not keep blocking
+the user interface from being correctly displayed. This should be done even if
+the app is correct but happens to trigger some bug in the hardware/driver.
+
+OpenGL
+~~~~~~
+
+Apps using OpenGL should use the available robust interfaces, like the
+extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
+interface tells if a reset has happened, and if so, all the context state is
+considered lost and the app proceeds by creating new ones. If it is possible to
+determine that robustness is not in use, the UMD will terminate the app when a
+reset is detected, giving that the contexts are lost and the app won't be able
+to figure this out and recreate the contexts.
+
+Vulkan
+~~~~~~
+
+Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
+This error code means, among other things, that a device reset has happened and
+it needs to recreate the contexts to keep going.
+
+Reporting causes of resets
+--------------------------
+
+Apart from propagating the reset through the stack so apps can recover, it's
+really useful for driver developers to learn more about what caused the reset in
+first place. DRM devices should make use of devcoredump to store relevant
+information about the reset, so this information can be added to user bug
+reports.
+
.. _drm_driver_ioctl:

IOCTL Support on Device Nodes
--
2.41.0



2023-06-27 16:56:11

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

Hi André,

I have just a few more below:

On 6/27/23 06:23, André Almeida wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen <[email protected]>
> Signed-off-by: André Almeida <[email protected]>
> ---
>
> v4: https://lore.kernel.org/lkml/[email protected]/
>
> Changes:
> - Grammar fixes (Randy)
>
> Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a

section

> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of

for a specific

stats or status?

> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible to
> +determine that robustness is not in use, the UMD will terminate the app when a
> +reset is detected, giving that the contexts are lost and the app won't be able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant

the first place.

> +information about the reset, so this information can be added to user bug
> +reports.
> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes

and with those addressed:

Reviewed-by: Randy Dunlap <[email protected]>

Thanks for adding the documentation.

--
~Randy

2023-06-27 18:11:44

by Christian König

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

Am 27.06.23 um 15:23 schrieb André Almeida:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen <[email protected]>
> Signed-off-by: André Almeida <[email protected]>
> ---
>
> v4: https://lore.kernel.org/lkml/[email protected]/
>
> Changes:
> - Grammar fixes (Randy)
>
> Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context.

Maybe drop the part "for a specific context". Essentially the reset
query could use global counters instead and we won't need the context
any more here.

Apart from that this sounds good to me, feel free to add my rb.

Regards,
Christian.

> This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible to
> +determine that robustness is not in use, the UMD will terminate the app when a
> +reset is detected, giving that the contexts are lost and the app won't be able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes


2023-06-27 21:33:44

by André Almeida

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

Em 27/06/2023 14:47, Christian König escreveu:
> Am 27.06.23 um 15:23 schrieb André Almeida:
>> Create a section that specifies how to deal with DRM device resets for
>> kernel and userspace drivers.
>>
>> Acked-by: Pekka Paalanen <[email protected]>
>> Signed-off-by: André Almeida <[email protected]>
>> ---
>>
>> v4:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Changes:
>>   - Grammar fixes (Randy)
>>
>>   Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>>   1 file changed, 68 insertions(+)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst
>> b/Documentation/gpu/drm-uapi.rst
>> index 65fb3036a580..3cbffa25ed93 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a
>> third handler for
>>   mmapped regular files. Threads cause additional pain with signal
>>   handling as well.
>> +Device reset
>> +============
>> +
>> +The GPU stack is really complex and is prone to errors, from hardware
>> bugs,
>> +faulty applications and everything in between the many layers. Some
>> errors
>> +require resetting the device in order to make the device usable
>> again. This
>> +sections describes the expectations for DRM and usermode drivers when a
>> +device resets and how to propagate the reset status.
>> +
>> +Kernel Mode Driver
>> +------------------
>> +
>> +The KMD is responsible for checking if the device needs a reset, and
>> to perform
>> +it as needed. Usually a hang is detected when a job gets stuck
>> executing. KMD
>> +should keep track of resets, because userspace can query any time
>> about the
>> +reset stats for an specific context.
>
> Maybe drop the part "for a specific context". Essentially the reset
> query could use global counters instead and we won't need the context
> any more here.
>

Right, I wrote like this to reflect how it's currently implemented.

If follow correctly what you meant, KMD could always notify the global
count for UMD, and we would move to the UMD the responsibility to manage
the reset counters, right? This would also simplify my
DRM_IOCTL_GET_RESET proposal. I'll apply your suggestion to the next doc
version.

> Apart from that this sounds good to me, feel free to add my rb.
>
> Regards,
> Christian.
>
>

2023-06-27 21:47:19

by André Almeida

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

Hi Marek,

Em 27/06/2023 15:57, Marek Olšák escreveu:
> On Tue, Jun 27, 2023, 09:23 André Almeida <[email protected]
> <mailto:[email protected]>> wrote:
>
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if
> the device has
> +been reset, and this can be checked more often if the UMD requires
> it. After
> +detecting a reset, UMD will then proceed to report it to the
> application using
> +the appropriate API error code, as explained in the section below about
> +robustness.
>
>
> The UMD won't check the device status before every command submission
> due to ioctl overhead. Instead, the KMD should skip command submission
> and return an error that it was skipped.

I wrote like this because when reading the source code for
vk::check_status()[0] and Gallium's si_flush_gfx_cs()[1], I was under
the impression that UMD checks the reset status before every
submission/flush.

Is your comment about of how things are currently implemented, or how
they would ideally work? Either way I can apply your suggestion, I just
want to make it clear.

[0]
https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/vulkan/runtime/vk_device.h#L142
[1]
https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/gallium/drivers/radeonsi/si_gfx_cs.c#L83

>
> The only case where that won't be applicable is user queues where
> drivers don't call into the kernel to submit work, but they do call into
> the kernel to create a dma_fence. In that case, the call to create a
> dma_fence can fail with an error.
>
> Marek


2023-06-29 13:20:42

by André Almeida

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations



Em 27/06/2023 18:17, André Almeida escreveu:
> Em 27/06/2023 14:47, Christian König escreveu:
>> Am 27.06.23 um 15:23 schrieb André Almeida:
>>> Create a section that specifies how to deal with DRM device resets for
>>> kernel and userspace drivers.
>>>
>>> Acked-by: Pekka Paalanen <[email protected]>
>>> Signed-off-by: André Almeida <[email protected]>
>>> ---
>>>
>>> v4:
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> Changes:
>>>   - Grammar fixes (Randy)
>>>
>>>   Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>>>   1 file changed, 68 insertions(+)
>>>
>>> diff --git a/Documentation/gpu/drm-uapi.rst
>>> b/Documentation/gpu/drm-uapi.rst
>>> index 65fb3036a580..3cbffa25ed93 100644
>>> --- a/Documentation/gpu/drm-uapi.rst
>>> +++ b/Documentation/gpu/drm-uapi.rst
>>> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a
>>> third handler for
>>>   mmapped regular files. Threads cause additional pain with signal
>>>   handling as well.
>>> +Device reset
>>> +============
>>> +
>>> +The GPU stack is really complex and is prone to errors, from
>>> hardware bugs,
>>> +faulty applications and everything in between the many layers. Some
>>> errors
>>> +require resetting the device in order to make the device usable
>>> again. This
>>> +sections describes the expectations for DRM and usermode drivers when a
>>> +device resets and how to propagate the reset status.
>>> +
>>> +Kernel Mode Driver
>>> +------------------
>>> +
>>> +The KMD is responsible for checking if the device needs a reset, and
>>> to perform
>>> +it as needed. Usually a hang is detected when a job gets stuck
>>> executing. KMD
>>> +should keep track of resets, because userspace can query any time
>>> about the
>>> +reset stats for an specific context.
>>
>> Maybe drop the part "for a specific context". Essentially the reset
>> query could use global counters instead and we won't need the context
>> any more here.
>>
>
> Right, I wrote like this to reflect how it's currently implemented.
>
> If follow correctly what you meant, KMD could always notify the global
> count for UMD, and we would move to the UMD the responsibility to manage
> the reset counters, right? This would also simplify my
> DRM_IOCTL_GET_RESET proposal. I'll apply your suggestion to the next doc
> version.
>

Actually, if we drop the context identifier we would lose the ability to
track which is the guilty context. Vulkan API doesn't seem to care about
this, but OpenGL does.

>> Apart from that this sounds good to me, feel free to add my rb.
>>
>> Regards,
>> Christian.
>>
>>

2023-06-30 14:57:10

by Sebastian Wick

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On Tue, Jun 27, 2023 at 3:23 PM André Almeida <[email protected]> wrote:
>
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen <[email protected]>
> Signed-off-by: André Almeida <[email protected]>
> ---
>
> v4: https://lore.kernel.org/lkml/[email protected]/
>
> Changes:
> - Grammar fixes (Randy)
>
> Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.

I still don't think it's good to let the kernel arbitrarily kill
processes that it thinks are not well-behaved based on some heuristics
and policy.

Can't this be outsourced to user space? Expose the information about
processes causing a device and let e.g. systemd deal with coming up
with a policy and with killing stuff.

> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible to
> +determine that robustness is not in use, the UMD will terminate the app when a
> +reset is detected, giving that the contexts are lost and the app won't be able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes
> --
> 2.41.0
>


2023-06-30 15:16:23

by Alex Deucher

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
<[email protected]> wrote:
>
> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <[email protected]> wrote:
> >
> > Create a section that specifies how to deal with DRM device resets for
> > kernel and userspace drivers.
> >
> > Acked-by: Pekka Paalanen <[email protected]>
> > Signed-off-by: André Almeida <[email protected]>
> > ---
> >
> > v4: https://lore.kernel.org/lkml/[email protected]/
> >
> > Changes:
> > - Grammar fixes (Randy)
> >
> > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> > 1 file changed, 68 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 65fb3036a580..3cbffa25ed93 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> > mmapped regular files. Threads cause additional pain with signal
> > handling as well.
> >
> > +Device reset
> > +============
> > +
> > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > +faulty applications and everything in between the many layers. Some errors
> > +require resetting the device in order to make the device usable again. This
> > +sections describes the expectations for DRM and usermode drivers when a
> > +device resets and how to propagate the reset status.
> > +
> > +Kernel Mode Driver
> > +------------------
> > +
> > +The KMD is responsible for checking if the device needs a reset, and to perform
> > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > +should keep track of resets, because userspace can query any time about the
> > +reset stats for an specific context. This is needed to propagate to the rest of
> > +the stack that a reset has happened. Currently, this is implemented by each
> > +driver separately, with no common DRM interface.
> > +
> > +User Mode Driver
> > +----------------
> > +
> > +The UMD should check before submitting new commands to the KMD if the device has
> > +been reset, and this can be checked more often if the UMD requires it. After
> > +detecting a reset, UMD will then proceed to report it to the application using
> > +the appropriate API error code, as explained in the section below about
> > +robustness.
> > +
> > +Robustness
> > +----------
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. However,
> > +there is no guarantee that the app will use such features correctly, and the
> > +UMD can implement policies to close the app if it is a repeating offender,
> > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > +the user interface from being correctly displayed. This should be done even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
>
> I still don't think it's good to let the kernel arbitrarily kill
> processes that it thinks are not well-behaved based on some heuristics
> and policy.
>
> Can't this be outsourced to user space? Expose the information about
> processes causing a device and let e.g. systemd deal with coming up
> with a policy and with killing stuff.

I don't think it's the kernel doing the killing, it would be the UMD.
E.g., if the app is guilty and doesn't support robustness the UMD can
just call exit().

Alex

>
> > +
> > +OpenGL
> > +~~~~~~
> > +
> > +Apps using OpenGL should use the available robust interfaces, like the
> > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > +interface tells if a reset has happened, and if so, all the context state is
> > +considered lost and the app proceeds by creating new ones. If it is possible to
> > +determine that robustness is not in use, the UMD will terminate the app when a
> > +reset is detected, giving that the contexts are lost and the app won't be able
> > +to figure this out and recreate the contexts.
> > +
> > +Vulkan
> > +~~~~~~
> > +
> > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > +This error code means, among other things, that a device reset has happened and
> > +it needs to recreate the contexts to keep going.
> > +
> > +Reporting causes of resets
> > +--------------------------
> > +
> > +Apart from propagating the reset through the stack so apps can recover, it's
> > +really useful for driver developers to learn more about what caused the reset in
> > +first place. DRM devices should make use of devcoredump to store relevant
> > +information about the reset, so this information can be added to user bug
> > +reports.
> > +
> > .. _drm_driver_ioctl:
> >
> > IOCTL Support on Device Nodes
> > --
> > 2.41.0
> >
>

2023-06-30 15:18:18

by Michel Dänzer

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On 6/30/23 16:59, Alex Deucher wrote:
> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> <[email protected]> wrote:
>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <[email protected]> wrote:
>>>
>>> +Robustness
>>> +----------
>>> +
>>> +The only way to try to keep an application working after a reset is if it
>>> +complies with the robustness aspects of the graphical API that it is using.
>>> +
>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>> +there is no guarantee that the app will use such features correctly, and the
>>> +UMD can implement policies to close the app if it is a repeating offender,
>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>> +the user interface from being correctly displayed. This should be done even if
>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>
>> I still don't think it's good to let the kernel arbitrarily kill
>> processes that it thinks are not well-behaved based on some heuristics
>> and policy.
>>
>> Can't this be outsourced to user space? Expose the information about
>> processes causing a device and let e.g. systemd deal with coming up
>> with a policy and with killing stuff.
>
> I don't think it's the kernel doing the killing, it would be the UMD.
> E.g., if the app is guilty and doesn't support robustness the UMD can
> just call exit().

It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.


[0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".

--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer


2023-06-30 15:44:28

by Sebastian Wick

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On Fri, Jun 30, 2023 at 4:59 PM Alex Deucher <[email protected]> wrote:
>
> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> <[email protected]> wrote:
> >
> > On Tue, Jun 27, 2023 at 3:23 PM André Almeida <[email protected]> wrote:
> > >
> > > Create a section that specifies how to deal with DRM device resets for
> > > kernel and userspace drivers.
> > >
> > > Acked-by: Pekka Paalanen <[email protected]>
> > > Signed-off-by: André Almeida <[email protected]>
> > > ---
> > >
> > > v4: https://lore.kernel.org/lkml/[email protected]/
> > >
> > > Changes:
> > > - Grammar fixes (Randy)
> > >
> > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> > > 1 file changed, 68 insertions(+)
> > >
> > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > index 65fb3036a580..3cbffa25ed93 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> > > mmapped regular files. Threads cause additional pain with signal
> > > handling as well.
> > >
> > > +Device reset
> > > +============
> > > +
> > > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > > +faulty applications and everything in between the many layers. Some errors
> > > +require resetting the device in order to make the device usable again. This
> > > +sections describes the expectations for DRM and usermode drivers when a
> > > +device resets and how to propagate the reset status.
> > > +
> > > +Kernel Mode Driver
> > > +------------------
> > > +
> > > +The KMD is responsible for checking if the device needs a reset, and to perform
> > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > > +should keep track of resets, because userspace can query any time about the
> > > +reset stats for an specific context. This is needed to propagate to the rest of
> > > +the stack that a reset has happened. Currently, this is implemented by each
> > > +driver separately, with no common DRM interface.
> > > +
> > > +User Mode Driver
> > > +----------------
> > > +
> > > +The UMD should check before submitting new commands to the KMD if the device has
> > > +been reset, and this can be checked more often if the UMD requires it. After
> > > +detecting a reset, UMD will then proceed to report it to the application using
> > > +the appropriate API error code, as explained in the section below about
> > > +robustness.
> > > +
> > > +Robustness
> > > +----------
> > > +
> > > +The only way to try to keep an application working after a reset is if it
> > > +complies with the robustness aspects of the graphical API that it is using.
> > > +
> > > +Graphical APIs provide ways to applications to deal with device resets. However,
> > > +there is no guarantee that the app will use such features correctly, and the
> > > +UMD can implement policies to close the app if it is a repeating offender,
> > > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > > +the user interface from being correctly displayed. This should be done even if
> > > +the app is correct but happens to trigger some bug in the hardware/driver.
> >
> > I still don't think it's good to let the kernel arbitrarily kill
> > processes that it thinks are not well-behaved based on some heuristics
> > and policy.
> >
> > Can't this be outsourced to user space? Expose the information about
> > processes causing a device and let e.g. systemd deal with coming up
> > with a policy and with killing stuff.
>
> I don't think it's the kernel doing the killing, it would be the UMD.
> E.g., if the app is guilty and doesn't support robustness the UMD can
> just call exit().

Ah, right, completely skipped over the UMD part. That makes more sense.
>
> Alex
>
> >
> > > +
> > > +OpenGL
> > > +~~~~~~
> > > +
> > > +Apps using OpenGL should use the available robust interfaces, like the
> > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > > +interface tells if a reset has happened, and if so, all the context state is
> > > +considered lost and the app proceeds by creating new ones. If it is possible to
> > > +determine that robustness is not in use, the UMD will terminate the app when a
> > > +reset is detected, giving that the contexts are lost and the app won't be able
> > > +to figure this out and recreate the contexts.
> > > +
> > > +Vulkan
> > > +~~~~~~
> > > +
> > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > > +This error code means, among other things, that a device reset has happened and
> > > +it needs to recreate the contexts to keep going.
> > > +
> > > +Reporting causes of resets
> > > +--------------------------
> > > +
> > > +Apart from propagating the reset through the stack so apps can recover, it's
> > > +really useful for driver developers to learn more about what caused the reset in
> > > +first place. DRM devices should make use of devcoredump to store relevant
> > > +information about the reset, so this information can be added to user bug
> > > +reports.
> > > +
> > > .. _drm_driver_ioctl:
> > >
> > > IOCTL Support on Device Nodes
> > > --
> > > 2.41.0
> > >
> >
>


2023-07-03 07:38:01

by Michel Dänzer

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On 6/30/23 22:32, Marek Olšák wrote:
> On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <[email protected] <mailto:[email protected]>> wrote:
>> On 6/30/23 16:59, Alex Deucher wrote:
>>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>>> <[email protected] <mailto:[email protected]>> wrote:
>>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <[email protected] <mailto:[email protected]>> wrote:
>>>>>
>>>>> +Robustness
>>>>> +----------
>>>>> +
>>>>> +The only way to try to keep an application working after a reset is if it
>>>>> +complies with the robustness aspects of the graphical API that it is using.
>>>>> +
>>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>>>> +there is no guarantee that the app will use such features correctly, and the
>>>>> +UMD can implement policies to close the app if it is a repeating offender,
>>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>>>> +the user interface from being correctly displayed. This should be done even if
>>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>>>
>>>> I still don't think it's good to let the kernel arbitrarily kill
>>>> processes that it thinks are not well-behaved based on some heuristics
>>>> and policy.
>>>>
>>>> Can't this be outsourced to user space? Expose the information about
>>>> processes causing a device and let e.g. systemd deal with coming up
>>>> with a policy and with killing stuff.
>>>
>>> I don't think it's the kernel doing the killing, it would be the UMD.
>>> E.g., if the app is guilty and doesn't support robustness the UMD can
>>> just call exit().
>>
>> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.
>
> That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same.

No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications.


> - non-robust contexts: call exit(1) immediately, which is the best way to recover

That's not the UMD's call to make.


>> [0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".


--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer


2023-07-03 08:56:57

by Pekka Paalanen

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On Mon, 3 Jul 2023 09:12:29 +0200
Michel Dänzer <[email protected]> wrote:

> On 6/30/23 22:32, Marek Olšák wrote:
> > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <[email protected] <mailto:[email protected]>> wrote:
> >> On 6/30/23 16:59, Alex Deucher wrote:
> >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >>> <[email protected] <mailto:[email protected]>> wrote:
> >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <[email protected] <mailto:[email protected]>> wrote:
> >>>>>
> >>>>> +Robustness
> >>>>> +----------
> >>>>> +
> >>>>> +The only way to try to keep an application working after a reset is if it
> >>>>> +complies with the robustness aspects of the graphical API that it is using.
> >>>>> +
> >>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
> >>>>> +there is no guarantee that the app will use such features correctly, and the
> >>>>> +UMD can implement policies to close the app if it is a repeating offender,
> >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >>>>> +the user interface from being correctly displayed. This should be done even if
> >>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
> >>>>
> >>>> I still don't think it's good to let the kernel arbitrarily kill
> >>>> processes that it thinks are not well-behaved based on some heuristics
> >>>> and policy.
> >>>>
> >>>> Can't this be outsourced to user space? Expose the information about
> >>>> processes causing a device and let e.g. systemd deal with coming up
> >>>> with a policy and with killing stuff.
> >>>
> >>> I don't think it's the kernel doing the killing, it would be the UMD.
> >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> >>> just call exit().
> >>
> >> It would be safer to just ignore API calls[0], similarly to what
> >> is done until the application destroys the context with
> >> robustness. Calling exit() likely results in losing any unsaved
> >> work, whereas at least some applications might otherwise allow
> >> saving the work by other means.
> >
> > That's a terrible idea. Ignoring API calls would be identical to a
> > freeze. You might as well disable GPU recovery because the result
> > would be the same.
>
> No GPU recovery would affect everything using the GPU, whereas this
> affects only non-robust applications.
>
>
> > - non-robust contexts: call exit(1) immediately, which is the best
> > way to recover
>
> That's not the UMD's call to make.
>
>
> >> [0] Possibly accompanied by a one-time message to stderr along
> >> the lines of "GPU reset detected but robustness not enabled in
> >> context, ignoring OpenGL API calls".
>

Hi,

Michel does have a point. It's not just games and display servers that
use GPU, but productivity tools as well. They may have periodic
autosave in anticipation of crashes, but being able to do the final
save before quitting would be nice. UMD killing the process would be
new behaviour, right? Previously either application's GPU thread hangs
or various API calls return errors, but it didn't kill the process, did
it?

If an application freezes, that's "no problem"; the end user can just
continue using everything else. Alt-tab away etc. if the app was
fullscreen. I do that already with games on even Xorg.

If a display server freezes, that's a desktop-wide problem, but so is
killing it.

OTOH, if UMD really does need to terminate the process, then please do
it in a way that causes a crash report to be recorded. _exit() with an
error code is not it.


Thanks,
pq


Attachments:
(No filename) (849.00 B)
OpenPGP digital signature

2023-07-03 15:26:22

by André Almeida

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations



Em 03/07/2023 05:49, Pekka Paalanen escreveu:
> On Mon, 3 Jul 2023 09:12:29 +0200
> Michel Dänzer <[email protected]> wrote:
>
>> On 6/30/23 22:32, Marek Olšák wrote:
>>> On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <[email protected] <mailto:[email protected]>> wrote:
>>>> On 6/30/23 16:59, Alex Deucher wrote:
>>>>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>
>>>>>>> +Robustness
>>>>>>> +----------
>>>>>>> +
>>>>>>> +The only way to try to keep an application working after a reset is if it
>>>>>>> +complies with the robustness aspects of the graphical API that it is using.
>>>>>>> +
>>>>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>>>>>> +there is no guarantee that the app will use such features correctly, and the
>>>>>>> +UMD can implement policies to close the app if it is a repeating offender,
>>>>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>>>>>> +the user interface from being correctly displayed. This should be done even if
>>>>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>>>>>
>>>>>> I still don't think it's good to let the kernel arbitrarily kill
>>>>>> processes that it thinks are not well-behaved based on some heuristics
>>>>>> and policy.
>>>>>>
>>>>>> Can't this be outsourced to user space? Expose the information about
>>>>>> processes causing a device and let e.g. systemd deal with coming up
>>>>>> with a policy and with killing stuff.
>>>>>
>>>>> I don't think it's the kernel doing the killing, it would be the UMD.
>>>>> E.g., if the app is guilty and doesn't support robustness the UMD can
>>>>> just call exit().
>>>>
>>>> It would be safer to just ignore API calls[0], similarly to what
>>>> is done until the application destroys the context with
>>>> robustness. Calling exit() likely results in losing any unsaved
>>>> work, whereas at least some applications might otherwise allow
>>>> saving the work by other means.
>>>
>>> That's a terrible idea. Ignoring API calls would be identical to a
>>> freeze. You might as well disable GPU recovery because the result
>>> would be the same.
>>
>> No GPU recovery would affect everything using the GPU, whereas this
>> affects only non-robust applications.
>>
>>
>>> - non-robust contexts: call exit(1) immediately, which is the best
>>> way to recover
>>
>> That's not the UMD's call to make.
>>
>>
>>>> [0] Possibly accompanied by a one-time message to stderr along
>>>> the lines of "GPU reset detected but robustness not enabled in
>>>> context, ignoring OpenGL API calls".
>>
>
> Hi,
>
> Michel does have a point. It's not just games and display servers that
> use GPU, but productivity tools as well. They may have periodic
> autosave in anticipation of crashes, but being able to do the final
> save before quitting would be nice. UMD killing the process would be
> new behaviour, right? Previously either application's GPU thread hangs
> or various API calls return errors, but it didn't kill the process, did
> it?
>

In Intel's Iris, UMD may call abort() for the reset guilty application:

https://elixir.bootlin.com/mesa/mesa-23.0.4/source/src/gallium/drivers/iris/iris_batch.c#L1063

I was pretty sure this was the same for RadeonSI, but I failed to find
the code for this, so I might be wrong.

> If an application freezes, that's "no problem"; the end user can just
> continue using everything else. Alt-tab away etc. if the app was
> fullscreen. I do that already with games on even Xorg.
>
> If a display server freezes, that's a desktop-wide problem, but so is
> killing it.
>

Interesting, what GPU do you use? In my experience (AMD RX 5600 XT),
hanging the GPU usually means that the rest of applications/compositor
can't use the GPU either, freezing all user interactions. So killing the
guilty app is one effective solution currently, but ignoring calls may
help as well.

> OTOH, if UMD really does need to terminate the process, then please do
> it in a way that causes a crash report to be recorded. _exit() with an
> error code is not it.
>

In the "Reporting causes of resets" subsection of this document I can
add something for UMD as well.

>
> Thanks,
> pq

2023-07-04 02:56:26

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations



On 7/3/23 19:34, Marek Olšák wrote:
>
>
> On Mon, Jul 3, 2023, 03:12 Michel Dänzer <[email protected] <mailto:[email protected]>> wrote:
>

Marek,
Please stop sending html emails to the mailing lists.
The mailing list software drops them.

Please set your email interface to use plain text mode instead.
Thanks.

--
~Randy

2023-07-04 03:18:03

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations



On 7/3/23 19:44, Marek Olšák wrote:
>
>
> On Mon, Jul 3, 2023, 22:38 Randy Dunlap <[email protected] <mailto:[email protected]>> wrote:
>
>
>
> On 7/3/23 19:34, Marek Olšák wrote:
> >
> >
> > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >
>
> Marek,
> Please stop sending html emails to the mailing lists.
> The mailing list software drops them.
>
> Please set your email interface to use plain text mode instead.
> Thanks.
>
>
> The mobile Gmail app doesn't support plain text, which I use frequently.

Perhaps you should consider some other mobile app for kernel discussions.

E.g., it looks like the K-9 mail app works with gmail.

--
~Randy

2023-07-04 07:56:14

by Pekka Paalanen

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On Mon, 3 Jul 2023 12:00:22 -0300
André Almeida <[email protected]> wrote:

> Em 03/07/2023 05:49, Pekka Paalanen escreveu:

> > If an application freezes, that's "no problem"; the end user can just
> > continue using everything else. Alt-tab away etc. if the app was
> > fullscreen. I do that already with games on even Xorg.
> >
> > If a display server freezes, that's a desktop-wide problem, but so is
> > killing it.
> >
>
> Interesting, what GPU do you use? In my experience (AMD RX 5600 XT),
> hanging the GPU usually means that the rest of applications/compositor
> can't use the GPU either, freezing all user interactions. So killing the
> guilty app is one effective solution currently, but ignoring calls may
> help as well.

I don't know if what I'm seeing is a GPU hang or just e.g. Proton
getting somehow stuck, all I see is a game freezing. I just Alt+tab
back to Steam, force-stop it, and then all is fine again. This is how
it should work regardless of why a game freezes.

However, even if it was a GPU hang, if I am on a display server that
actually handles GPU resets, I don't see why the rest of the desktop
would not be able to recover. Individual apps are each to their own,
but at the very least non-GPU apps and the DE itself should not have
any problem (DE components can simply be restarted automatically).


Thanks,
pq


Attachments:
(No filename) (849.00 B)
OpenPGP digital signature

2023-07-04 08:16:47

by Michel Dänzer

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On 7/4/23 04:34, Marek Olšák wrote:
> On Mon, Jul 3, 2023, 03:12 Michel Dänzer <[email protected] <mailto:[email protected]>> wrote:
> On 6/30/23 22:32, Marek Olšák wrote:
> > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >> On 6/30/23 16:59, Alex Deucher wrote:
> >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >>> <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>>
> >>>>> +Robustness
> >>>>> +----------
> >>>>> +
> >>>>> +The only way to try to keep an application working after a reset is if it
> >>>>> +complies with the robustness aspects of the graphical API that it is using.
> >>>>> +
> >>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
> >>>>> +there is no guarantee that the app will use such features correctly, and the
> >>>>> +UMD can implement policies to close the app if it is a repeating offender,
> >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >>>>> +the user interface from being correctly displayed. This should be done even if
> >>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
> >>>>
> >>>> I still don't think it's good to let the kernel arbitrarily kill
> >>>> processes that it thinks are not well-behaved based on some heuristics
> >>>> and policy.
> >>>>
> >>>> Can't this be outsourced to user space? Expose the information about
> >>>> processes causing a device and let e.g. systemd deal with coming up
> >>>> with a policy and with killing stuff.
> >>>
> >>> I don't think it's the kernel doing the killing, it would be the UMD.
> >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> >>> just call exit().
> >>
> >> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.
> >
> > That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same.
>
> No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications.
>
> which is currently the majority.

Not sure where you're going with this. Applications need to use robustness to be able to recover from a GPU hang, and the GPU needs to be reset for that. So disabling GPU reset is not the same as what we're discussing here.


> > - non-robust contexts: call exit(1) immediately, which is the best way to recover
>
> That's not the UMD's call to make.
>
> That's absolutely the UMD's call to make because that's mandated by the hw and API design

Can you point us to a spec which mandates that the process must be killed in this case?


> and only driver devs know this, which this thread is a proof of. The default behavior is to skip all command submission if a non-robust context is lost, which looks like a freeze. That's required to prevent infinite hangs from the same context and can be caused by the side effects of the GPU reset itself, not by the cause of the previous hang. The only way out of that is killing the process.

The UMD killing the process is not the only way out of that, and doing so is overreach on its part. The UMD is but one out of many components in a process, not the main one or a special one. It doesn't get to decide when the process must die, certainly not under circumstances where it must be able to continue while ignoring API calls (that's required for robustness).


> >>     [0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".


--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer


2023-07-05 08:20:08

by Michel Dänzer

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On 7/5/23 08:30, Marek Olšák wrote:
> On Tue, Jul 4, 2023, 03:55 Michel Dänzer <[email protected]> wrote:
> On 7/4/23 04:34, Marek Olšák wrote:
> > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <[email protected] > wrote:
> >     On 6/30/23 22:32, Marek Olšák wrote:
> >     > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <[email protected]> wrote:
> >     >> On 6/30/23 16:59, Alex Deucher wrote:
> >     >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >     >>> <[email protected] <mailto:[email protected]> wrote:
> >     >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <[email protected]> wrote:
> >     >>>>>
> >     >>>>> +Robustness
> >     >>>>> +----------
> >     >>>>> +
> >     >>>>> +The only way to try to keep an application working after a reset is if it
> >     >>>>> +complies with the robustness aspects of the graphical API that it is using.
> >     >>>>> +
> >     >>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
> >     >>>>> +there is no guarantee that the app will use such features correctly, and the
> >     >>>>> +UMD can implement policies to close the app if it is a repeating offender,
> >     >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >     >>>>> +the user interface from being correctly displayed. This should be done even if
> >     >>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
> >     >>>>
> >     >>>> I still don't think it's good to let the kernel arbitrarily kill
> >     >>>> processes that it thinks are not well-behaved based on some heuristics
> >     >>>> and policy.
> >     >>>>
> >     >>>> Can't this be outsourced to user space? Expose the information about
> >     >>>> processes causing a device and let e.g. systemd deal with coming up
> >     >>>> with a policy and with killing stuff.
> >     >>>
> >     >>> I don't think it's the kernel doing the killing, it would be the UMD.
> >     >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> >     >>> just call exit().
> >     >>
> >     >> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.
> >     >
> >     > That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same.
> >
> >     No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications.
> >
> > which is currently the majority.
>
> Not sure where you're going with this. Applications need to use robustness to be able to recover from a GPU hang, and the GPU needs to be reset for that. So disabling GPU reset is not the same as what we're discussing here.
>
>
> >     > - non-robust contexts: call exit(1) immediately, which is the best way to recover
> >
> >     That's not the UMD's call to make.
> >
> > That's absolutely the UMD's call to make because that's mandated by the hw and API design
>
> Can you point us to a spec which mandates that the process must be killed in this case?
>
>
> > and only driver devs know this, which this thread is a proof of. The default behavior is to skip all command submission if a non-robust context is lost, which looks like a freeze. That's required to prevent infinite hangs from the same context and can be caused by the side effects of the GPU reset itself, not by the cause of the previous hang. The only way out of that is killing the process.
>
> The UMD killing the process is not the only way out of that, and doing so is overreach on its part. The UMD is but one out of many components in a process, not the main one or a special one. It doesn't get to decide when the process must die, certainly not under circumstances where it must be able to continue while ignoring API calls (that's required for robustness).
>
>
> You're mixing things up. Robust apps don't any special action from a UMD. Only non-robust apps need to be killed for proper recovery with the only other alternative being not updating the window/screen,

I'm saying they don't "need" to be killed, since the UMD must be able to keep going while ignoring API calls (or it couldn't support robustness). It's a choice, one which is not for the UMD to make.


> Also it's already used and required by our customers on Android because killing a process returns the user to the desktop screen and can generate a crash dump instead of keeping the app output frozen, and they agree that this is the best user experience given the circumstances.

Then some appropriate Android component needs to make that call. The UMD is not it.


> >     >>     [0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".


--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer


2023-07-05 16:23:42

by Marek Olšák

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On Wed, Jul 5, 2023 at 3:32 AM Michel Dänzer <[email protected]> wrote:
>
> On 7/5/23 08:30, Marek Olšák wrote:
> > On Tue, Jul 4, 2023, 03:55 Michel Dänzer <[email protected]> wrote:
> > On 7/4/23 04:34, Marek Olšák wrote:
> > > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <[email protected] > wrote:
> > > On 6/30/23 22:32, Marek Olšák wrote:
> > > > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <[email protected]> wrote:
> > > >> On 6/30/23 16:59, Alex Deucher wrote:
> > > >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> > > >>> <[email protected] <mailto:[email protected]> wrote:
> > > >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <[email protected]> wrote:
> > > >>>>>
> > > >>>>> +Robustness
> > > >>>>> +----------
> > > >>>>> +
> > > >>>>> +The only way to try to keep an application working after a reset is if it
> > > >>>>> +complies with the robustness aspects of the graphical API that it is using.
> > > >>>>> +
> > > >>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
> > > >>>>> +there is no guarantee that the app will use such features correctly, and the
> > > >>>>> +UMD can implement policies to close the app if it is a repeating offender,
> > > >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
> > > >>>>> +the user interface from being correctly displayed. This should be done even if
> > > >>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
> > > >>>>
> > > >>>> I still don't think it's good to let the kernel arbitrarily kill
> > > >>>> processes that it thinks are not well-behaved based on some heuristics
> > > >>>> and policy.
> > > >>>>
> > > >>>> Can't this be outsourced to user space? Expose the information about
> > > >>>> processes causing a device and let e.g. systemd deal with coming up
> > > >>>> with a policy and with killing stuff.
> > > >>>
> > > >>> I don't think it's the kernel doing the killing, it would be the UMD.
> > > >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> > > >>> just call exit().
> > > >>
> > > >> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.
> > > >
> > > > That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same.
> > >
> > > No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications.
> > >
> > > which is currently the majority.
> >
> > Not sure where you're going with this. Applications need to use robustness to be able to recover from a GPU hang, and the GPU needs to be reset for that. So disabling GPU reset is not the same as what we're discussing here.
> >
> >
> > > > - non-robust contexts: call exit(1) immediately, which is the best way to recover
> > >
> > > That's not the UMD's call to make.
> > >
> > > That's absolutely the UMD's call to make because that's mandated by the hw and API design
> >
> > Can you point us to a spec which mandates that the process must be killed in this case?
> >
> >
> > > and only driver devs know this, which this thread is a proof of. The default behavior is to skip all command submission if a non-robust context is lost, which looks like a freeze. That's required to prevent infinite hangs from the same context and can be caused by the side effects of the GPU reset itself, not by the cause of the previous hang. The only way out of that is killing the process.
> >
> > The UMD killing the process is not the only way out of that, and doing so is overreach on its part. The UMD is but one out of many components in a process, not the main one or a special one. It doesn't get to decide when the process must die, certainly not under circumstances where it must be able to continue while ignoring API calls (that's required for robustness).
> >
> >
> > You're mixing things up. Robust apps don't any special action from a UMD. Only non-robust apps need to be killed for proper recovery with the only other alternative being not updating the window/screen,
>
> I'm saying they don't "need" to be killed, since the UMD must be able to keep going while ignoring API calls (or it couldn't support robustness). It's a choice, one which is not for the UMD to make.
>
>
> > Also it's already used and required by our customers on Android because killing a process returns the user to the desktop screen and can generate a crash dump instead of keeping the app output frozen, and they agree that this is the best user experience given the circumstances.
>
> Then some appropriate Android component needs to make that call. The UMD is not it.

We can change it once Android and Linux have a better way to handle
non-robust apps. Until then, generating a core dump after a GPU crash
produces the best outcome for users and developers.

Marek

2023-07-25 02:58:37

by André Almeida

[permalink] [raw]
Subject: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

Hi everyone,

It's not clear what we should do about non-robust OpenGL apps after GPU
resets, so I'll try to summarize the topic, show some options and my
proposal to move forward on that.

Em 27/06/2023 10:23, André Almeida escreveu:
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
Depending on the OpenGL version, there are different robustness API
available:

- OpenGL ABR extension [0]
- OpenGL KHR extension [1]
- OpenGL ES extension [2]

Apps written in OpenGL should use whatever version is available for them
to make the app robust for GPU resets. That usually means calling
GetGraphicsResetStatusARB(), checking the status, and if it encounter
something different from NO_ERROR, that means that a reset has happened,
the context is considered lost and should be recreated. If an app follow
this, it will likely succeed recovering a reset.

What should non-robustness apps do then? They certainly will not be
notified if a reset happens, and thus can't recover if their context is
lost. OpenGL specification does not explicitly define what should be
done in such situations[3], and I believe that usually when the spec
mandates to close the app, it would explicitly note it.

However, in reality there are different types of device resets, causing
different results. A reset can be precise enough to damage only the
guilty context, and keep others alive.

Given that, I believe drivers have the following options:

a) Kill all non-robust apps after a reset. This may lead to lose work
from innocent applications.

b) Ignore all non-robust apps OpenGL calls. That means that applications
would still be alive, but the user interface would be freeze. The user
would need to close it manually anyway, but in some corner cases, the
app could autosave some work or the user might be able to interact with
it using some alternative method (command line?).

c) Kill just the affected non-robust applications. To do that, the
driver need to be 100% sure on the impact of its resets.

RadeonSI currently implements a), as can be seen at [4], while Iris
implements what I think it's c)[5].

For the user experience point-of-view, c) is clearly the best option,
but it's the hardest to archive. There's not much gain on having b) over
a), perhaps it could be an optional env var for such corner case
applications.

My proposal for the documentation is: implement a) if nothing else is
available, have a MESA_NO_RESET_KILL for people that want b), ideally
implement c) if the driver is able to know for sure that the non-guilty
apps can still work after a reset.

Thanks,
André

[0] https://registry.khronos.org/OpenGL/extensions/ARB/ARB_robustness.txt
[1] https://registry.khronos.org/OpenGL/extensions/KHR/KHR_robustness.txt
[2] https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt
[3] https://registry.khronos.org/OpenGL/specs/gl/glspec46.core.pdf
[4]
https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c#L1657
[5]
https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/drivers/iris/iris_batch.c#L842

2023-07-25 07:27:37

by Simon Ser

[permalink] [raw]
Subject: Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

On Tuesday, July 25th, 2023 at 04:55, André Almeida <[email protected]> wrote:

> It's not clear what we should do about non-robust OpenGL apps after GPU
> resets, so I'll try to summarize the topic, show some options and my
> proposal to move forward on that.
>
> Em 27/06/2023 10:23, André Almeida escreveu:
>
> > +Robustness
> > +----------
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. However,
> > +there is no guarantee that the app will use such features correctly, and the
> > +UMD can implement policies to close the app if it is a repeating offender,
> > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > +the user interface from being correctly displayed. This should be done even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
> > +
>
> Depending on the OpenGL version, there are different robustness API
> available:
>
> - OpenGL ABR extension [0]
> - OpenGL KHR extension [1]
> - OpenGL ES extension [2]
>
> Apps written in OpenGL should use whatever version is available for them
> to make the app robust for GPU resets. That usually means calling
> GetGraphicsResetStatusARB(), checking the status, and if it encounter
> something different from NO_ERROR, that means that a reset has happened,
> the context is considered lost and should be recreated. If an app follow
> this, it will likely succeed recovering a reset.
>
> What should non-robustness apps do then? They certainly will not be
> notified if a reset happens, and thus can't recover if their context is
> lost. OpenGL specification does not explicitly define what should be
> done in such situations[3], and I believe that usually when the spec
> mandates to close the app, it would explicitly note it.
>
> However, in reality there are different types of device resets, causing
> different results. A reset can be precise enough to damage only the
> guilty context, and keep others alive.
>
> Given that, I believe drivers have the following options:
>
> a) Kill all non-robust apps after a reset. This may lead to lose work
> from innocent applications.
>
> b) Ignore all non-robust apps OpenGL calls. That means that applications
> would still be alive, but the user interface would be freeze. The user
> would need to close it manually anyway, but in some corner cases, the
> app could autosave some work or the user might be able to interact with
> it using some alternative method (command line?).
>
> c) Kill just the affected non-robust applications. To do that, the
> driver need to be 100% sure on the impact of its resets.

We've discussed this a while back on #dri-devel IIRC. I think the best
experience would be for the Wayland compositor to gray out apps which
lost their GL context, and display an information dialog to explain
what happened to the user and a button to kill the app. I'm not exactly
sure how that would translate to a kernel or Mesa uAPI, and if there's
appetite to do a lot of work to get "the best GPU reset UX" (IOW: maybe
it's not worth all of the trouble).

2023-07-25 09:17:00

by Michel Dänzer

[permalink] [raw]
Subject: Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

On 7/25/23 04:55, André Almeida wrote:
> Hi everyone,
>
> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
>
> Em 27/06/2023 10:23, André Almeida escreveu:
>> +Robustness
>> +----------
>> +
>> +The only way to try to keep an application working after a reset is if it
>> +complies with the robustness aspects of the graphical API that it is using.
>> +
>> +Graphical APIs provide ways to applications to deal with device resets. However,
>> +there is no guarantee that the app will use such features correctly, and the
>> +UMD can implement policies to close the app if it is a repeating offender,
>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>> +the user interface from being correctly displayed. This should be done even if
>> +the app is correct but happens to trigger some bug in the hardware/driver.
>> +
> Depending on the OpenGL version, there are different robustness API available:
>
> - OpenGL ABR extension [0]
> - OpenGL KHR extension [1]
> - OpenGL ES extension  [2]
>
> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
>
> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
>
> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
>
> Given that, I believe drivers have the following options:
>
> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
>
> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
>
> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
>
> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
>
> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.

I disagree on these conclusions.

c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.


> [0] https://registry.khronos.org/OpenGL/extensions/ARB/ARB_robustness.txt
> [1] https://registry.khronos.org/OpenGL/extensions/KHR/KHR_robustness.txt
> [2] https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt
> [3] https://registry.khronos.org/OpenGL/specs/gl/glspec46.core.pdf
> [4] https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c#L1657
> [5] https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/drivers/iris/iris_batch.c#L842

--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer


2023-07-25 13:08:59

by André Almeida

[permalink] [raw]
Subject: Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

Hi Michel,

Em 25/07/2023 05:03, Michel Dänzer escreveu:
> On 7/25/23 04:55, André Almeida wrote:
>> Hi everyone,
>>
>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
>>
>> Em 27/06/2023 10:23, André Almeida escreveu:
>>> +Robustness
>>> +----------
>>> +
>>> +The only way to try to keep an application working after a reset is if it
>>> +complies with the robustness aspects of the graphical API that it is using.
>>> +
>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>> +there is no guarantee that the app will use such features correctly, and the
>>> +UMD can implement policies to close the app if it is a repeating offender,
>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>> +the user interface from being correctly displayed. This should be done even if
>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>> +
>> Depending on the OpenGL version, there are different robustness API available:
>>
>> - OpenGL ABR extension [0]
>> - OpenGL KHR extension [1]
>> - OpenGL ES extension  [2]
>>
>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
>>
>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
>>
>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
>>
>> Given that, I believe drivers have the following options:
>>
>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
>>
>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
>>
>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
>>
>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
>>
>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
>
> I disagree on these conclusions.
>
> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.
>

Thank you for the feedback. How do you think the documentation should
look like for this part?

2023-07-25 16:12:00

by Marek Olšák

[permalink] [raw]
Subject: Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
<[email protected]> wrote:
>
> On 7/25/23 04:55, André Almeida wrote:
> > Hi everyone,
> >
> > It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
> >
> > Em 27/06/2023 10:23, André Almeida escreveu:
> >> +Robustness
> >> +----------
> >> +
> >> +The only way to try to keep an application working after a reset is if it
> >> +complies with the robustness aspects of the graphical API that it is using.
> >> +
> >> +Graphical APIs provide ways to applications to deal with device resets. However,
> >> +there is no guarantee that the app will use such features correctly, and the
> >> +UMD can implement policies to close the app if it is a repeating offender,
> >> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >> +the user interface from being correctly displayed. This should be done even if
> >> +the app is correct but happens to trigger some bug in the hardware/driver.
> >> +
> > Depending on the OpenGL version, there are different robustness API available:
> >
> > - OpenGL ABR extension [0]
> > - OpenGL KHR extension [1]
> > - OpenGL ES extension [2]
> >
> > Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
> >
> > What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
> >
> > However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
> >
> > Given that, I believe drivers have the following options:
> >
> > a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
> >
> > b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
> >
> > c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
> >
> > RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
> >
> > For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
>
> I disagree on these conclusions.
>
> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.

That's not true. I recommend that you enable b) with your driver and
then hang the GPU under different scenarios and see the result. Then
enable a) and do the same and compare.

Options a) and c) can be merged into one because they are not separate
options to choose from.

If Wayland wanted to grey out lost apps, they would appear as robust
contexts in gallium, but the reset status would be piped through the
Wayland protocol instead of the GL API.

Marek



Marek

2023-07-25 18:13:33

by Michel Dänzer

[permalink] [raw]
Subject: Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

On 7/25/23 17:05, Marek Olšák wrote:
> On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
> <[email protected]> wrote:
>> On 7/25/23 04:55, André Almeida wrote:
>>> Hi everyone,
>>>
>>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
>>>
>>> Em 27/06/2023 10:23, André Almeida escreveu:
>>>> +Robustness
>>>> +----------
>>>> +
>>>> +The only way to try to keep an application working after a reset is if it
>>>> +complies with the robustness aspects of the graphical API that it is using.
>>>> +
>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>>> +there is no guarantee that the app will use such features correctly, and the
>>>> +UMD can implement policies to close the app if it is a repeating offender,
>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>>> +the user interface from being correctly displayed. This should be done even if
>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>>> +
>>> Depending on the OpenGL version, there are different robustness API available:
>>>
>>> - OpenGL ABR extension [0]
>>> - OpenGL KHR extension [1]
>>> - OpenGL ES extension [2]
>>>
>>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
>>>
>>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
>>>
>>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
>>>
>>> Given that, I believe drivers have the following options:
>>>
>>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
>>>
>>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
>>>
>>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
>>>
>>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
>>>
>>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
>>
>> I disagree on these conclusions.
>>
>> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.
>
> That's not true.

Which part of what I wrote are you referring to?


> I recommend that you enable b) with your driver and then hang the GPU under different scenarios and see the result.

I've been doing GPU driver development for over two decades, I'm perfectly aware what it means. It doesn't change what I wrote above.


--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer


2023-07-26 08:55:18

by Michel Dänzer

[permalink] [raw]
Subject: Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

On 7/25/23 15:02, André Almeida wrote:
> Em 25/07/2023 05:03, Michel Dänzer escreveu:
>> On 7/25/23 04:55, André Almeida wrote:
>>> Hi everyone,
>>>
>>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
>>>
>>> Em 27/06/2023 10:23, André Almeida escreveu:
>>>> +Robustness
>>>> +----------
>>>> +
>>>> +The only way to try to keep an application working after a reset is if it
>>>> +complies with the robustness aspects of the graphical API that it is using.
>>>> +
>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>>> +there is no guarantee that the app will use such features correctly, and the
>>>> +UMD can implement policies to close the app if it is a repeating offender,
>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>>> +the user interface from being correctly displayed. This should be done even if
>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>>> +
>>> Depending on the OpenGL version, there are different robustness API available:
>>>
>>> - OpenGL ABR extension [0]
>>> - OpenGL KHR extension [1]
>>> - OpenGL ES extension  [2]
>>>
>>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
>>>
>>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
>>>
>>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
>>>
>>> Given that, I believe drivers have the following options:
>>>
>>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
>>>
>>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
>>>
>>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
>>>
>>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
>>>
>>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
>>
>> I disagree on these conclusions.
>>
>> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.
>>
>
> Thank you for the feedback. How do you think the documentation should look like for this part?

The initial paragraph about robustness should say "keep OpenGL working" instead of "keep an application working". If an OpenGL context stops working, it doesn't necessarily mean the application stops working altogether.


If the application doesn't use the robustness extensions, your option b) is what should happen by default whenever possible. And it really has to be possible if the robustness extensions are supported.


--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer


2023-07-26 08:59:22

by Timur Kristóf

[permalink] [raw]
Subject: Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

On Tue, 2023-07-25 at 19:00 +0200, Michel Dänzer wrote:
> On 7/25/23 17:05, Marek Olšák wrote:
> > On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
> > <[email protected]> wrote:
> > > On 7/25/23 04:55, André Almeida wrote:
> > > > Hi everyone,
> > > >
> > > > It's not clear what we should do about non-robust OpenGL apps
> > > > after GPU resets, so I'll try to summarize the topic, show some
> > > > options and my proposal to move forward on that.
> > > >
> > > > Em 27/06/2023 10:23, André Almeida escreveu:
> > > > > +Robustness
> > > > > +----------
> > > > > +
> > > > > +The only way to try to keep an application working after a
> > > > > reset is if it
> > > > > +complies with the robustness aspects of the graphical API
> > > > > that it is using.
> > > > > +
> > > > > +Graphical APIs provide ways to applications to deal with
> > > > > device resets. However,
> > > > > +there is no guarantee that the app will use such features
> > > > > correctly, and the
> > > > > +UMD can implement policies to close the app if it is a
> > > > > repeating offender,
> > > > > +likely in a broken loop. This is done to ensure that it does
> > > > > not keep blocking
> > > > > +the user interface from being correctly displayed. This
> > > > > should be done even if
> > > > > +the app is correct but happens to trigger some bug in the
> > > > > hardware/driver.
> > > > > +
> > > > Depending on the OpenGL version, there are different robustness
> > > > API available:
> > > >
> > > > - OpenGL ABR extension [0]
> > > > - OpenGL KHR extension [1]
> > > > - OpenGL ES extension  [2]
> > > >
> > > > Apps written in OpenGL should use whatever version is available
> > > > for them to make the app robust for GPU resets. That usually
> > > > means calling GetGraphicsResetStatusARB(), checking the status,
> > > > and if it encounter something different from NO_ERROR, that
> > > > means that a reset has happened, the context is considered lost
> > > > and should be recreated. If an app follow this, it will likely
> > > > succeed recovering a reset.
> > > >
> > > > What should non-robustness apps do then? They certainly will
> > > > not be notified if a reset happens, and thus can't recover if
> > > > their context is lost. OpenGL specification does not explicitly
> > > > define what should be done in such situations[3], and I believe
> > > > that usually when the spec mandates to close the app, it would
> > > > explicitly note it.
> > > >
> > > > However, in reality there are different types of device resets,
> > > > causing different results. A reset can be precise enough to
> > > > damage only the guilty context, and keep others alive.
> > > >
> > > > Given that, I believe drivers have the following options:
> > > >
> > > > a) Kill all non-robust apps after a reset. This may lead to
> > > > lose work from innocent applications.
> > > >
> > > > b) Ignore all non-robust apps OpenGL calls. That means that
> > > > applications would still be alive, but the user interface would
> > > > be freeze. The user would need to close it manually anyway, but
> > > > in some corner cases, the app could autosave some work or the
> > > > user might be able to interact with it using some alternative
> > > > method (command line?).
> > > >
> > > > c) Kill just the affected non-robust applications. To do that,
> > > > the driver need to be 100% sure on the impact of its resets.
> > > >
> > > > RadeonSI currently implements a), as can be seen at [4], while
> > > > Iris implements what I think it's c)[5].
> > > >
> > > > For the user experience point-of-view, c) is clearly the best
> > > > option, but it's the hardest to archive. There's not much gain
> > > > on having b) over a), perhaps it could be an optional env var
> > > > for such corner case applications.
> > >
> > > I disagree on these conclusions.
> > >
> > > c) is certainly better than a), but it's not "clearly the best"
> > > in all cases. The OpenGL UMD is not a privileged/special
> > > component and is in no position to decide whether or not the
> > > process as a whole (only some thread(s) of which may use OpenGL
> > > at all) gets to continue running or not.
> >
> > That's not true.
>
> Which part of what I wrote are you referring to?
>
>
> > I recommend that you enable b) with your driver and then hang the
> > GPU under different scenarios and see the result.
>
> I've been doing GPU driver development for over two decades, I'm
> perfectly aware what it means. It doesn't change what I wrote above.
>

Michel, I understand that you disagree with the proposed solutions in
this email thread but from your mails it is unclear to me what exactly
is the solution that you would actually recommend, can you please
clarify?

Thanks,
Timur


2023-08-02 08:39:36

by Marek Olšák

[permalink] [raw]
Subject: Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

A screen that doesn't update isn't usable. Killing the window system
and returning to the login screen is one option. Killing the window
system manually from a terminal or over ssh and then returning to the
login screen is another option, but 99% of users either hard-reset the
machine or do sysrq+REISUB anyway because it's faster that way. Those
are all your options. If we don't do the kill, users might decide to
do a hard reset with an unsync'd file system, which can cause more
damage.

The precedent from the CPU land is pretty strong here. There is
SIGSEGV for invalid CPU memory access and SIGILL for invalid CPU
instructions, yet we do nothing for invalid GPU memory access and
invalid GPU instructions. Sending a terminating signal from the kernel
would be the most natural thing to do. Instead, we just keep a frozen
GUI to keep users helpless, or we continue command submission and then
the hanging app can cause an infinite cycle of GPU hangs and resets,
making the GPU unusable until somebody kills the app over ssh.

That's why GL/Vulkan robustness is required - either robust apps, or a
robust compositor that greys out lost windows and pops up a diagnostic
message with a list of actions to choose from. That's the direction we
should be taking. Non-robust apps under a non-robust compositor should
just be killed if they crash the GPU.


Marek

On Wed, Jul 26, 2023 at 4:07 AM Michel Dänzer
<[email protected]> wrote:
>
> On 7/25/23 15:02, André Almeida wrote:
> > Em 25/07/2023 05:03, Michel Dänzer escreveu:
> >> On 7/25/23 04:55, André Almeida wrote:
> >>> Hi everyone,
> >>>
> >>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
> >>>
> >>> Em 27/06/2023 10:23, André Almeida escreveu:
> >>>> +Robustness
> >>>> +----------
> >>>> +
> >>>> +The only way to try to keep an application working after a reset is if it
> >>>> +complies with the robustness aspects of the graphical API that it is using.
> >>>> +
> >>>> +Graphical APIs provide ways to applications to deal with device resets. However,
> >>>> +there is no guarantee that the app will use such features correctly, and the
> >>>> +UMD can implement policies to close the app if it is a repeating offender,
> >>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >>>> +the user interface from being correctly displayed. This should be done even if
> >>>> +the app is correct but happens to trigger some bug in the hardware/driver.
> >>>> +
> >>> Depending on the OpenGL version, there are different robustness API available:
> >>>
> >>> - OpenGL ABR extension [0]
> >>> - OpenGL KHR extension [1]
> >>> - OpenGL ES extension [2]
> >>>
> >>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
> >>>
> >>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
> >>>
> >>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
> >>>
> >>> Given that, I believe drivers have the following options:
> >>>
> >>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
> >>>
> >>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
> >>>
> >>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
> >>>
> >>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
> >>>
> >>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
> >>
> >> I disagree on these conclusions.
> >>
> >> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.
> >>
> >
> > Thank you for the feedback. How do you think the documentation should look like for this part?
>
> The initial paragraph about robustness should say "keep OpenGL working" instead of "keep an application working". If an OpenGL context stops working, it doesn't necessarily mean the application stops working altogether.
>
>
> If the application doesn't use the robustness extensions, your option b) is what should happen by default whenever possible. And it really has to be possible if the robustness extensions are supported.
>
>
> --
> Earthling Michel Dänzer | https://redhat.com
> Libre software enthusiast | Mesa and Xwayland developer
>

2023-08-02 09:41:30

by Michel Dänzer

[permalink] [raw]
Subject: Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

On 8/2/23 09:38, Marek Olšák wrote:
>
> The precedent from the CPU land is pretty strong here. There is
> SIGSEGV for invalid CPU memory access and SIGILL for invalid CPU
> instructions, yet we do nothing for invalid GPU memory access and
> invalid GPU instructions. Sending a terminating signal from the kernel
> would be the most natural thing to do.

After an unhandled SIGSEGV or SIGILL, the process is in an inconsistent state and cannot safely continue executing. That's why the process is terminated by default in those cases.

The same is not true when an OpenGL context stops working. Any threads / other parts of the process not using that OpenGL context continue working normally. And any attempts to use that OpenGL context can be safely ignored (or the OpenGL implementation couldn't support the robustness extensions).


--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer


2023-08-04 13:36:15

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On Tue, Jun 27, 2023 at 10:23:23AM -0300, Andr? Almeida wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen <[email protected]>
> Signed-off-by: Andr? Almeida <[email protected]>
> ---
>
> v4: https://lore.kernel.org/lkml/[email protected]/
>
> Changes:
> - Grammar fixes (Randy)
>
> Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,

Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
offender killing is more a policy where the kernel enforces policy, and no
longer up to userspace to dtrt (because very clearly userspace is not
really doing the right thing anymore when it's just hanging the gpu in an
endless loop). Also maybe tune it down further to something like "the
kernel driver may implemnent ..."

In my opinion the umd shouldn't implement these kind of magic guesses, the
entire point of robustness apis is to delegate responsibility for
correctly recovering to the application. And the kernel is left with
enforcing fair resource usage policies (which eventually might be a
cgroups limit on how much gpu time you're allowed to waste with gpu
resets).

> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible to
> +determine that robustness is not in use, the UMD will terminate the app when a
> +reset is detected, giving that the contexts are lost and the app won't be able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.

Since we do not seem to have a solid consensus in the community about
non-robust userspace, maybe we could just document that lack of consensus
to unblock this patch? Something like this:

Non-Robust Userspace
--------------------

Userspace that doesn't support robust interfaces (like an non-robust
OpenGL context or API without any robustness support like libva) leave the
robustness handling entirely to the userspace driver. There is no strong
community consensus on what the userspace driver should do in that case,
since all reasonable approaches have some clear downsides.

With the s/UMD/KMD/ further up and maybe something added to record the
non-robustness non-consensus:

Acked-by: Daniel Vetter <[email protected]>

Cheers, Daniel



> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes
> --
> 2.41.0
>

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2023-08-08 18:12:18

by Sebastian Wick

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter <[email protected]> wrote:
>
> On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> > Create a section that specifies how to deal with DRM device resets for
> > kernel and userspace drivers.
> >
> > Acked-by: Pekka Paalanen <[email protected]>
> > Signed-off-by: André Almeida <[email protected]>
> > ---
> >
> > v4: https://lore.kernel.org/lkml/[email protected]/
> >
> > Changes:
> > - Grammar fixes (Randy)
> >
> > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> > 1 file changed, 68 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 65fb3036a580..3cbffa25ed93 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> > mmapped regular files. Threads cause additional pain with signal
> > handling as well.
> >
> > +Device reset
> > +============
> > +
> > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > +faulty applications and everything in between the many layers. Some errors
> > +require resetting the device in order to make the device usable again. This
> > +sections describes the expectations for DRM and usermode drivers when a
> > +device resets and how to propagate the reset status.
> > +
> > +Kernel Mode Driver
> > +------------------
> > +
> > +The KMD is responsible for checking if the device needs a reset, and to perform
> > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > +should keep track of resets, because userspace can query any time about the
> > +reset stats for an specific context. This is needed to propagate to the rest of
> > +the stack that a reset has happened. Currently, this is implemented by each
> > +driver separately, with no common DRM interface.
> > +
> > +User Mode Driver
> > +----------------
> > +
> > +The UMD should check before submitting new commands to the KMD if the device has
> > +been reset, and this can be checked more often if the UMD requires it. After
> > +detecting a reset, UMD will then proceed to report it to the application using
> > +the appropriate API error code, as explained in the section below about
> > +robustness.
> > +
> > +Robustness
> > +----------
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. However,
> > +there is no guarantee that the app will use such features correctly, and the
> > +UMD can implement policies to close the app if it is a repeating offender,
>
> Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
> offender killing is more a policy where the kernel enforces policy, and no
> longer up to userspace to dtrt (because very clearly userspace is not
> really doing the right thing anymore when it's just hanging the gpu in an
> endless loop). Also maybe tune it down further to something like "the
> kernel driver may implemnent ..."
>
> In my opinion the umd shouldn't implement these kind of magic guesses, the
> entire point of robustness apis is to delegate responsibility for
> correctly recovering to the application. And the kernel is left with
> enforcing fair resource usage policies (which eventually might be a
> cgroups limit on how much gpu time you're allowed to waste with gpu
> resets).

Killing apps that the kernel thinks are misbehaving really doesn't
seem like a good idea to me. What if the process is a service getting
restarted after getting killed? What if killing that process leaves
the system in a bad state?

Can't the kernel provide some information to user space so that e.g.
systemd can handle those situations?

> > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > +the user interface from being correctly displayed. This should be done even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
> > +
> > +OpenGL
> > +~~~~~~
> > +
> > +Apps using OpenGL should use the available robust interfaces, like the
> > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > +interface tells if a reset has happened, and if so, all the context state is
> > +considered lost and the app proceeds by creating new ones. If it is possible to
> > +determine that robustness is not in use, the UMD will terminate the app when a
> > +reset is detected, giving that the contexts are lost and the app won't be able
> > +to figure this out and recreate the contexts.
> > +
> > +Vulkan
> > +~~~~~~
> > +
> > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > +This error code means, among other things, that a device reset has happened and
> > +it needs to recreate the contexts to keep going.
> > +
> > +Reporting causes of resets
> > +--------------------------
> > +
> > +Apart from propagating the reset through the stack so apps can recover, it's
> > +really useful for driver developers to learn more about what caused the reset in
> > +first place. DRM devices should make use of devcoredump to store relevant
> > +information about the reset, so this information can be added to user bug
> > +reports.
>
> Since we do not seem to have a solid consensus in the community about
> non-robust userspace, maybe we could just document that lack of consensus
> to unblock this patch? Something like this:
>
> Non-Robust Userspace
> --------------------
>
> Userspace that doesn't support robust interfaces (like an non-robust
> OpenGL context or API without any robustness support like libva) leave the
> robustness handling entirely to the userspace driver. There is no strong
> community consensus on what the userspace driver should do in that case,
> since all reasonable approaches have some clear downsides.
>
> With the s/UMD/KMD/ further up and maybe something added to record the
> non-robustness non-consensus:
>
> Acked-by: Daniel Vetter <[email protected]>
>
> Cheers, Daniel
>
>
>
> > +
> > .. _drm_driver_ioctl:
> >
> > IOCTL Support on Device Nodes
> > --
> > 2.41.0
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>


2023-08-08 18:59:36

by Marek Olšák

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

It's the same situation as SIGSEGV. A process can catch the signal,
but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
to catch the GPU error and prevent the process termination. If you
don't use the API, you'll get undefined behavior, which means anything
can happen, including process termination.



Marek

On Tue, Aug 8, 2023 at 8:14 AM Sebastian Wick <[email protected]> wrote:
>
> On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter <[email protected]> wrote:
> >
> > On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> > > Create a section that specifies how to deal with DRM device resets for
> > > kernel and userspace drivers.
> > >
> > > Acked-by: Pekka Paalanen <[email protected]>
> > > Signed-off-by: André Almeida <[email protected]>
> > > ---
> > >
> > > v4: https://lore.kernel.org/lkml/[email protected]/
> > >
> > > Changes:
> > > - Grammar fixes (Randy)
> > >
> > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> > > 1 file changed, 68 insertions(+)
> > >
> > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > index 65fb3036a580..3cbffa25ed93 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> > > mmapped regular files. Threads cause additional pain with signal
> > > handling as well.
> > >
> > > +Device reset
> > > +============
> > > +
> > > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > > +faulty applications and everything in between the many layers. Some errors
> > > +require resetting the device in order to make the device usable again. This
> > > +sections describes the expectations for DRM and usermode drivers when a
> > > +device resets and how to propagate the reset status.
> > > +
> > > +Kernel Mode Driver
> > > +------------------
> > > +
> > > +The KMD is responsible for checking if the device needs a reset, and to perform
> > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > > +should keep track of resets, because userspace can query any time about the
> > > +reset stats for an specific context. This is needed to propagate to the rest of
> > > +the stack that a reset has happened. Currently, this is implemented by each
> > > +driver separately, with no common DRM interface.
> > > +
> > > +User Mode Driver
> > > +----------------
> > > +
> > > +The UMD should check before submitting new commands to the KMD if the device has
> > > +been reset, and this can be checked more often if the UMD requires it. After
> > > +detecting a reset, UMD will then proceed to report it to the application using
> > > +the appropriate API error code, as explained in the section below about
> > > +robustness.
> > > +
> > > +Robustness
> > > +----------
> > > +
> > > +The only way to try to keep an application working after a reset is if it
> > > +complies with the robustness aspects of the graphical API that it is using.
> > > +
> > > +Graphical APIs provide ways to applications to deal with device resets. However,
> > > +there is no guarantee that the app will use such features correctly, and the
> > > +UMD can implement policies to close the app if it is a repeating offender,
> >
> > Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
> > offender killing is more a policy where the kernel enforces policy, and no
> > longer up to userspace to dtrt (because very clearly userspace is not
> > really doing the right thing anymore when it's just hanging the gpu in an
> > endless loop). Also maybe tune it down further to something like "the
> > kernel driver may implemnent ..."
> >
> > In my opinion the umd shouldn't implement these kind of magic guesses, the
> > entire point of robustness apis is to delegate responsibility for
> > correctly recovering to the application. And the kernel is left with
> > enforcing fair resource usage policies (which eventually might be a
> > cgroups limit on how much gpu time you're allowed to waste with gpu
> > resets).
>
> Killing apps that the kernel thinks are misbehaving really doesn't
> seem like a good idea to me. What if the process is a service getting
> restarted after getting killed? What if killing that process leaves
> the system in a bad state?
>
> Can't the kernel provide some information to user space so that e.g.
> systemd can handle those situations?
>
> > > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > > +the user interface from being correctly displayed. This should be done even if
> > > +the app is correct but happens to trigger some bug in the hardware/driver.
> > > +
> > > +OpenGL
> > > +~~~~~~
> > > +
> > > +Apps using OpenGL should use the available robust interfaces, like the
> > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > > +interface tells if a reset has happened, and if so, all the context state is
> > > +considered lost and the app proceeds by creating new ones. If it is possible to
> > > +determine that robustness is not in use, the UMD will terminate the app when a
> > > +reset is detected, giving that the contexts are lost and the app won't be able
> > > +to figure this out and recreate the contexts.
> > > +
> > > +Vulkan
> > > +~~~~~~
> > > +
> > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > > +This error code means, among other things, that a device reset has happened and
> > > +it needs to recreate the contexts to keep going.
> > > +
> > > +Reporting causes of resets
> > > +--------------------------
> > > +
> > > +Apart from propagating the reset through the stack so apps can recover, it's
> > > +really useful for driver developers to learn more about what caused the reset in
> > > +first place. DRM devices should make use of devcoredump to store relevant
> > > +information about the reset, so this information can be added to user bug
> > > +reports.
> >
> > Since we do not seem to have a solid consensus in the community about
> > non-robust userspace, maybe we could just document that lack of consensus
> > to unblock this patch? Something like this:
> >
> > Non-Robust Userspace
> > --------------------
> >
> > Userspace that doesn't support robust interfaces (like an non-robust
> > OpenGL context or API without any robustness support like libva) leave the
> > robustness handling entirely to the userspace driver. There is no strong
> > community consensus on what the userspace driver should do in that case,
> > since all reasonable approaches have some clear downsides.
> >
> > With the s/UMD/KMD/ further up and maybe something added to record the
> > non-robustness non-consensus:
> >
> > Acked-by: Daniel Vetter <[email protected]>
> >
> > Cheers, Daniel
> >
> >
> >
> > > +
> > > .. _drm_driver_ioctl:
> > >
> > > IOCTL Support on Device Nodes
> > > --
> > > 2.41.0
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> >
>

2023-08-09 08:47:32

by Michel Dänzer

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On 8/8/23 19:03, Marek Olšák wrote:
> It's the same situation as SIGSEGV. A process can catch the signal,
> but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
> to catch the GPU error and prevent the process termination. If you
> don't use the API, you'll get undefined behavior, which means anything
> can happen, including process termination.

Got a spec reference for that?

I know the spec allows process termination in response to e.g. out of bounds buffer access by the application (which corresponds to SIGSEGV). There are other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec says:

If the reset notification behavior is NO_RESET_NOTIFICATION_ARB,
then the implementation will never deliver notification of reset
events, and GetGraphicsResetStatusARB will always return
NO_ERROR[fn1].
[fn1: In this case it is recommended that implementations should
not allow loss of context state no matter what events occur.
However, this is only a recommendation, and cannot be relied
upon by applications.]

No mention of process termination, that rather sounds to me like the GL implementation should do its best to keep the application running.


--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer


2023-08-09 19:59:41

by Marek Olšák

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On Wed, Aug 9, 2023 at 3:35 AM Michel Dänzer <[email protected]> wrote:
>
> On 8/8/23 19:03, Marek Olšák wrote:
> > It's the same situation as SIGSEGV. A process can catch the signal,
> > but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
> > to catch the GPU error and prevent the process termination. If you
> > don't use the API, you'll get undefined behavior, which means anything
> > can happen, including process termination.
>
> Got a spec reference for that?
>
> I know the spec allows process termination in response to e.g. out of bounds buffer access by the application (which corresponds to SIGSEGV). There are other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec says:
>
> If the reset notification behavior is NO_RESET_NOTIFICATION_ARB,
> then the implementation will never deliver notification of reset
> events, and GetGraphicsResetStatusARB will always return
> NO_ERROR[fn1].
> [fn1: In this case it is recommended that implementations should
> not allow loss of context state no matter what events occur.
> However, this is only a recommendation, and cannot be relied
> upon by applications.]
>
> No mention of process termination, that rather sounds to me like the GL implementation should do its best to keep the application running.

It basically says that we can do anything.

A frozen window or flipping between 2 random frames can't be described
as "keeping the application running". That's the worst user
experience. I will not accept it.

A window system can force-enable robustness for its non-robust apps
and control that. That's the best possible user experience and it's
achievable everywhere. Everything else doesn't matter.

Marek




Marek

2023-08-10 08:12:33

by Michel Dänzer

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On 8/9/23 21:15, Marek Olšák wrote:
> On Wed, Aug 9, 2023 at 3:35 AM Michel Dänzer <[email protected]> wrote:
>> On 8/8/23 19:03, Marek Olšák wrote:
>>> It's the same situation as SIGSEGV. A process can catch the signal,
>>> but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
>>> to catch the GPU error and prevent the process termination. If you
>>> don't use the API, you'll get undefined behavior, which means anything
>>> can happen, including process termination.
>>
>> Got a spec reference for that?
>>
>> I know the spec allows process termination in response to e.g. out of bounds buffer access by the application (which corresponds to SIGSEGV). There are other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec says:
>>
>> If the reset notification behavior is NO_RESET_NOTIFICATION_ARB,
>> then the implementation will never deliver notification of reset
>> events, and GetGraphicsResetStatusARB will always return
>> NO_ERROR[fn1].
>> [fn1: In this case it is recommended that implementations should
>> not allow loss of context state no matter what events occur.
>> However, this is only a recommendation, and cannot be relied
>> upon by applications.]
>>
>> No mention of process termination, that rather sounds to me like the GL implementation should do its best to keep the application running.
>
> It basically says that we can do anything.

Not really? If program termination is a possible outcome, the spec otherwise mentions that explicitly, ala "including program termination".


> A frozen window or flipping between 2 random frames can't be described
> as "keeping the application running".

This assumes that an application which uses OpenGL cannot have any other purpose than using OpenGL.


--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer