LinuxLists.cc - [Regression] drm/scheduler: track GPU active time per entity

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

Hi Danilo,

Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> Hi all,
>
> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> tries to track the accumulated time that a job was active on the GPU
> writing it to the entity through which the job was deployed to the
> scheduler originally. This is done within drm_sched_get_cleanup_job()
> which fetches a job from the schedulers pending_list.
>
> Doing this can result in a race condition where the entity is already
> freed, but the entity's newly added elapsed_ns field is still accessed
> once the job is fetched from the pending_list.
>
> After drm_sched_entity_destroy() being called it should be safe to free
> the structure that embeds the entity. However, a job originally handed
> over to the scheduler by this entity might still reside in the
> schedulers pending_list for cleanup after drm_sched_entity_destroy()
> already being called and the entity being freed. Hence, we can run into
> a UAF.
>
Sorry about that, I clearly didn't properly consider this case.

> In my case it happened that a job, as explained above, was just picked
> from the schedulers pending_list after the entity was freed due to the
> client application exiting. Meanwhile this freed up memory was already
> allocated for a subsequent client applications job structure again.
> Hence, the new jobs memory got corrupted. Luckily, I was able to
> reproduce the same corruption over and over again by just using
> deqp-runner to run a specific set of VK test cases in parallel.
>
> Fixing this issue doesn't seem to be very straightforward though (unless
> I miss something), which is why I'm writing this mail instead of sending
> a fix directly.
>
> Spontaneously, I see three options to fix it:
>
> 1. Rather than embedding the entity into driver specific structures
> (e.g. tied to file_priv) we could allocate the entity separately and
> reference count it, such that it's only freed up once all jobs that were
> deployed through this entity are fetched from the schedulers pending list.
>
My vote is on this or something in similar vain for the long term. I
have some hope to be able to add a GPU scheduling algorithm with a bit
more fairness than the current one sometime in the future, which
requires execution time tracking on the entities.

> 2. Somehow make sure drm_sched_entity_destroy() does block until all
> jobs deployed through this entity were fetched from the schedulers
> pending list. Though, I'm pretty sure that this is not really desirable.
>
> 3. Just revert the change and let drivers implement tracking of GPU
> active times themselves.
>
Given that we are already pretty late in the release cycle and etnaviv
being the only driver so far making use of the scheduler elapsed time
tracking I think the right short term solution is to either move the
tracking into etnaviv or just revert the change for now. I'll have a
look at this.

Regards,
Lucas

> In the case of just reverting the change I'd propose to also set a jobs
> entity pointer to NULL once the job was taken from the entity, such
> that in case of a future issue we fail where the actual issue resides
> and to make it more obvious that the field shouldn't be used anymore
> after the job was taken from the entity.
>
> I'm happy to implement the solution we agree on. However, it might also
> make sense to revert the change until we have a solution in place. I'm
> also happy to send a revert with a proper description of the problem.
> Please let me know what you think.
>
> - Danilo
>

2023-03-28 10:35:23

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

The same strategy was suggested before for amdgpu and reverted as well
because of running into the same problems.

No idea how that slipped, Andrew reviewed it and IIRC he was also the
one who reverted the initial approach for amdgpu.

Instead of letting the scheduler sum that stuff up proactively we should
take care of that re-actively when the fdinfo data is queried like
amdgpu does as well.

We should probably revert that patch for now and then discuss how we can
do this in a generalized helper.

Regards,
Christian.

Am 28.03.23 um 02:54 schrieb Danilo Krummrich:
> Hi all,
>
> Commit df622729ddbf ("drm/scheduler: track GPU active time per
> entity") tries to track the accumulated time that a job was active on
> the GPU writing it to the entity through which the job was deployed to
> the scheduler originally. This is done within
> drm_sched_get_cleanup_job() which fetches a job from the schedulers
> pending_list.
>
> Doing this can result in a race condition where the entity is already
> freed, but the entity's newly added elapsed_ns field is still accessed
> once the job is fetched from the pending_list.
>
> After drm_sched_entity_destroy() being called it should be safe to
> free the structure that embeds the entity. However, a job originally
> handed over to the scheduler by this entity might still reside in the
> schedulers pending_list for cleanup after drm_sched_entity_destroy()
> already being called and the entity being freed. Hence, we can run
> into a UAF.
>
> In my case it happened that a job, as explained above, was just picked
> from the schedulers pending_list after the entity was freed due to the
> client application exiting. Meanwhile this freed up memory was already
> allocated for a subsequent client applications job structure again.
> Hence, the new jobs memory got corrupted. Luckily, I was able to
> reproduce the same corruption over and over again by just using
> deqp-runner to run a specific set of VK test cases in parallel.
>
> Fixing this issue doesn't seem to be very straightforward though
> (unless I miss something), which is why I'm writing this mail instead
> of sending a fix directly.
>
> Spontaneously, I see three options to fix it:
>
> 1. Rather than embedding the entity into driver specific structures
> (e.g. tied to file_priv) we could allocate the entity separately and
> reference count it, such that it's only freed up once all jobs that
> were deployed through this entity are fetched from the schedulers
> pending list.
>
> 2. Somehow make sure drm_sched_entity_destroy() does block until all
> jobs deployed through this entity were fetched from the schedulers
> pending list. Though, I'm pretty sure that this is not really desirable.
>
> 3. Just revert the change and let drivers implement tracking of GPU
> active times themselves.
>
> In the case of just reverting the change I'd propose to also set a
> jobs entity pointer to NULL once the job was taken from the entity,
> such that in case of a future issue we fail where the actual issue
> resides and to make it more obvious that the field shouldn't be used
> anymore after the job was taken from the entity.
>
> I'm happy to implement the solution we agree on. However, it might
> also make sense to revert the change until we have a solution in
> place. I'm also happy to send a revert with a proper description of
> the problem. Please let me know what you think.
>
> - Danilo

2023-04-04 04:39:09

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On 2023-03-28 04:54, Lucas Stach wrote:
> Hi Danilo,
>
> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>> Hi all,
>>
>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>> tries to track the accumulated time that a job was active on the GPU
>> writing it to the entity through which the job was deployed to the
>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>> which fetches a job from the schedulers pending_list.
>>
>> Doing this can result in a race condition where the entity is already
>> freed, but the entity's newly added elapsed_ns field is still accessed
>> once the job is fetched from the pending_list.
>>
>> After drm_sched_entity_destroy() being called it should be safe to free
>> the structure that embeds the entity. However, a job originally handed
>> over to the scheduler by this entity might still reside in the
>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>> already being called and the entity being freed. Hence, we can run into
>> a UAF.
>>
> Sorry about that, I clearly didn't properly consider this case.
>
>> In my case it happened that a job, as explained above, was just picked
>> from the schedulers pending_list after the entity was freed due to the
>> client application exiting. Meanwhile this freed up memory was already
>> allocated for a subsequent client applications job structure again.
>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>> reproduce the same corruption over and over again by just using
>> deqp-runner to run a specific set of VK test cases in parallel.
>>
>> Fixing this issue doesn't seem to be very straightforward though (unless
>> I miss something), which is why I'm writing this mail instead of sending
>> a fix directly.
>>
>> Spontaneously, I see three options to fix it:
>>
>> 1. Rather than embedding the entity into driver specific structures
>> (e.g. tied to file_priv) we could allocate the entity separately and
>> reference count it, such that it's only freed up once all jobs that were
>> deployed through this entity are fetched from the schedulers pending list.
>>
> My vote is on this or something in similar vain for the long term. I
> have some hope to be able to add a GPU scheduling algorithm with a bit
> more fairness than the current one sometime in the future, which
> requires execution time tracking on the entities.

Danilo,

Using kref is preferable, i.e. option 1 above.

Lucas, can you shed some light on,

1. In what way the current FIFO scheduling is unfair, and
2. shed some details on this "scheduling algorithm with a bit
more fairness than the current one"?

Regards,
Luben

>
>> 2. Somehow make sure drm_sched_entity_destroy() does block until all
>> jobs deployed through this entity were fetched from the schedulers
>> pending list. Though, I'm pretty sure that this is not really desirable.
>>
>> 3. Just revert the change and let drivers implement tracking of GPU
>> active times themselves.
>>
> Given that we are already pretty late in the release cycle and etnaviv
> being the only driver so far making use of the scheduler elapsed time
> tracking I think the right short term solution is to either move the
> tracking into etnaviv or just revert the change for now. I'll have a
> look at this.
>
> Regards,
> Lucas
>
>> In the case of just reverting the change I'd propose to also set a jobs
>> entity pointer to NULL once the job was taken from the entity, such
>> that in case of a future issue we fail where the actual issue resides
>> and to make it more obvious that the field shouldn't be used anymore
>> after the job was taken from the entity.
>>
>> I'm happy to implement the solution we agree on. However, it might also
>> make sense to revert the change until we have a solution in place. I'm
>> also happy to send a revert with a proper description of the problem.
>> Please let me know what you think.
>>
>> - Danilo
>>
>

2023-04-05 15:40:25

by Danilo Krummrich

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On 4/4/23 06:31, Luben Tuikov wrote:
> On 2023-03-28 04:54, Lucas Stach wrote:
>> Hi Danilo,
>>
>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>> Hi all,
>>>
>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>> tries to track the accumulated time that a job was active on the GPU
>>> writing it to the entity through which the job was deployed to the
>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>> which fetches a job from the schedulers pending_list.
>>>
>>> Doing this can result in a race condition where the entity is already
>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>> once the job is fetched from the pending_list.
>>>
>>> After drm_sched_entity_destroy() being called it should be safe to free
>>> the structure that embeds the entity. However, a job originally handed
>>> over to the scheduler by this entity might still reside in the
>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>> already being called and the entity being freed. Hence, we can run into
>>> a UAF.
>>>
>> Sorry about that, I clearly didn't properly consider this case.
>>
>>> In my case it happened that a job, as explained above, was just picked
>>> from the schedulers pending_list after the entity was freed due to the
>>> client application exiting. Meanwhile this freed up memory was already
>>> allocated for a subsequent client applications job structure again.
>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>> reproduce the same corruption over and over again by just using
>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>
>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>> I miss something), which is why I'm writing this mail instead of sending
>>> a fix directly.
>>>
>>> Spontaneously, I see three options to fix it:
>>>
>>> 1. Rather than embedding the entity into driver specific structures
>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>> reference count it, such that it's only freed up once all jobs that were
>>> deployed through this entity are fetched from the schedulers pending list.
>>>
>> My vote is on this or something in similar vain for the long term. I
>> have some hope to be able to add a GPU scheduling algorithm with a bit
>> more fairness than the current one sometime in the future, which
>> requires execution time tracking on the entities.
>
> Danilo,
>
> Using kref is preferable, i.e. option 1 above.

I think the only real motivation for doing that would be for generically
tracking job statistics within the entity a job was deployed through. If
we all agree on tracking job statistics this way I am happy to prepare a
patch for this option and drop this one:
https://lore.kernel.org/all/[email protected]/T/#u

Christian mentioned amdgpu tried something similar to what Lucas tried
running into similar trouble, backed off and implemented it in another
way - a driver specific way I guess?

>
> Lucas, can you shed some light on,
>
> 1. In what way the current FIFO scheduling is unfair, and
> 2. shed some details on this "scheduling algorithm with a bit
> more fairness than the current one"?
>
> Regards,
> Luben
>
>>
>>> 2. Somehow make sure drm_sched_entity_destroy() does block until all
>>> jobs deployed through this entity were fetched from the schedulers
>>> pending list. Though, I'm pretty sure that this is not really desirable.
>>>
>>> 3. Just revert the change and let drivers implement tracking of GPU
>>> active times themselves.
>>>
>> Given that we are already pretty late in the release cycle and etnaviv
>> being the only driver so far making use of the scheduler elapsed time
>> tracking I think the right short term solution is to either move the
>> tracking into etnaviv or just revert the change for now. I'll have a
>> look at this.
>>
>> Regards,
>> Lucas
>>
>>> In the case of just reverting the change I'd propose to also set a jobs
>>> entity pointer to NULL once the job was taken from the entity, such
>>> that in case of a future issue we fail where the actual issue resides
>>> and to make it more obvious that the field shouldn't be used anymore
>>> after the job was taken from the entity.
>>>
>>> I'm happy to implement the solution we agree on. However, it might also
>>> make sense to revert the change until we have a solution in place. I'm
>>> also happy to send a revert with a proper description of the problem.
>>> Please let me know what you think.
>>>
>>> - Danilo
>>>
>>
>

2023-04-05 16:12:56

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On 2023-04-05 10:05, Danilo Krummrich wrote:
> On 4/4/23 06:31, Luben Tuikov wrote:
>> On 2023-03-28 04:54, Lucas Stach wrote:
>>> Hi Danilo,
>>>
>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>> Hi all,
>>>>
>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>> tries to track the accumulated time that a job was active on the GPU
>>>> writing it to the entity through which the job was deployed to the
>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>> which fetches a job from the schedulers pending_list.
>>>>
>>>> Doing this can result in a race condition where the entity is already
>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>> once the job is fetched from the pending_list.
>>>>
>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>> the structure that embeds the entity. However, a job originally handed
>>>> over to the scheduler by this entity might still reside in the
>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>> already being called and the entity being freed. Hence, we can run into
>>>> a UAF.
>>>>
>>> Sorry about that, I clearly didn't properly consider this case.
>>>
>>>> In my case it happened that a job, as explained above, was just picked
>>>> from the schedulers pending_list after the entity was freed due to the
>>>> client application exiting. Meanwhile this freed up memory was already
>>>> allocated for a subsequent client applications job structure again.
>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>> reproduce the same corruption over and over again by just using
>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>
>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>> I miss something), which is why I'm writing this mail instead of sending
>>>> a fix directly.
>>>>
>>>> Spontaneously, I see three options to fix it:
>>>>
>>>> 1. Rather than embedding the entity into driver specific structures
>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>> reference count it, such that it's only freed up once all jobs that were
>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>
>>> My vote is on this or something in similar vain for the long term. I
>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>> more fairness than the current one sometime in the future, which
>>> requires execution time tracking on the entities.
>>
>> Danilo,
>>
>> Using kref is preferable, i.e. option 1 above.
>
> I think the only real motivation for doing that would be for generically
> tracking job statistics within the entity a job was deployed through. If
> we all agree on tracking job statistics this way I am happy to prepare a
> patch for this option and drop this one:
> https://lore.kernel.org/all/[email protected]/T/#u

Hmm, I never thought about "job statistics" when I preferred using kref above.
The reason kref is attractive is because one doesn't need to worry about
it--when the last user drops the kref, the release is called to do
housekeeping. If this never happens, we know that we have a bug to debug.

Regarding the patch above--I did look around the code, and it seems safe,
as per your analysis, I didn't see any reference to entity after job submission,
but I'll comment on that thread as well for the record.

Regards,
Luben

>
> Christian mentioned amdgpu tried something similar to what Lucas tried
> running into similar trouble, backed off and implemented it in another
> way - a driver specific way I guess?
>
>>
>> Lucas, can you shed some light on,
>>
>> 1. In what way the current FIFO scheduling is unfair, and
>> 2. shed some details on this "scheduling algorithm with a bit
>> more fairness than the current one"?
>>
>> Regards,
>> Luben
>>
>>>
>>>> 2. Somehow make sure drm_sched_entity_destroy() does block until all
>>>> jobs deployed through this entity were fetched from the schedulers
>>>> pending list. Though, I'm pretty sure that this is not really desirable.
>>>>
>>>> 3. Just revert the change and let drivers implement tracking of GPU
>>>> active times themselves.
>>>>
>>> Given that we are already pretty late in the release cycle and etnaviv
>>> being the only driver so far making use of the scheduler elapsed time
>>> tracking I think the right short term solution is to either move the
>>> tracking into etnaviv or just revert the change for now. I'll have a
>>> look at this.
>>>
>>> Regards,
>>> Lucas
>>>
>>>> In the case of just reverting the change I'd propose to also set a jobs
>>>> entity pointer to NULL once the job was taken from the entity, such
>>>> that in case of a future issue we fail where the actual issue resides
>>>> and to make it more obvious that the field shouldn't be used anymore
>>>> after the job was taken from the entity.
>>>>
>>>> I'm happy to implement the solution we agree on. However, it might also
>>>> make sense to revert the change until we have a solution in place. I'm
>>>> also happy to send a revert with a proper description of the problem.
>>>> Please let me know what you think.
>>>>
>>>> - Danilo
>>>>
>>>
>>
>

2023-04-05 17:49:29

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

Hi Luben,

Am Dienstag, dem 04.04.2023 um 00:31 -0400 schrieb Luben Tuikov:
> On 2023-03-28 04:54, Lucas Stach wrote:
> > Hi Danilo,
> >
> > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > Hi all,
> > >
> > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > tries to track the accumulated time that a job was active on the GPU
> > > writing it to the entity through which the job was deployed to the
> > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > which fetches a job from the schedulers pending_list.
> > >
> > > Doing this can result in a race condition where the entity is already
> > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > once the job is fetched from the pending_list.
> > >
> > > After drm_sched_entity_destroy() being called it should be safe to free
> > > the structure that embeds the entity. However, a job originally handed
> > > over to the scheduler by this entity might still reside in the
> > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > already being called and the entity being freed. Hence, we can run into
> > > a UAF.
> > >
> > Sorry about that, I clearly didn't properly consider this case.
> >
> > > In my case it happened that a job, as explained above, was just picked
> > > from the schedulers pending_list after the entity was freed due to the
> > > client application exiting. Meanwhile this freed up memory was already
> > > allocated for a subsequent client applications job structure again.
> > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > reproduce the same corruption over and over again by just using
> > > deqp-runner to run a specific set of VK test cases in parallel.
> > >
> > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > I miss something), which is why I'm writing this mail instead of sending
> > > a fix directly.
> > >
> > > Spontaneously, I see three options to fix it:
> > >
> > > 1. Rather than embedding the entity into driver specific structures
> > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > reference count it, such that it's only freed up once all jobs that were
> > > deployed through this entity are fetched from the schedulers pending list.
> > >
> > My vote is on this or something in similar vain for the long term. I
> > have some hope to be able to add a GPU scheduling algorithm with a bit
> > more fairness than the current one sometime in the future, which
> > requires execution time tracking on the entities.
>
> Danilo,
>
> Using kref is preferable, i.e. option 1 above.
>
> Lucas, can you shed some light on,
>
> 1. In what way the current FIFO scheduling is unfair, and
> 2. shed some details on this "scheduling algorithm with a bit
> more fairness than the current one"?

I don't have a specific implementation in mind yet. However the current
FIFO algorithm can be very unfair if you have a sparse workload compete
with one that generates a lot of jobs without any throttling aside from
the entity queue length. By tracking the actual GPU time consumed by
the entities we could implement something with a bit more fairness like
deficit round robin (don't pin me on the specific algorithm, as I
haven't given it much thought yet).

Regards,
Lucas

2023-04-05 17:52:07

by Danilo Krummrich

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On 4/5/23 18:09, Luben Tuikov wrote:
> On 2023-04-05 10:05, Danilo Krummrich wrote:
>> On 4/4/23 06:31, Luben Tuikov wrote:
>>> On 2023-03-28 04:54, Lucas Stach wrote:
>>>> Hi Danilo,
>>>>
>>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>>> Hi all,
>>>>>
>>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>>> tries to track the accumulated time that a job was active on the GPU
>>>>> writing it to the entity through which the job was deployed to the
>>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>>> which fetches a job from the schedulers pending_list.
>>>>>
>>>>> Doing this can result in a race condition where the entity is already
>>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>>> once the job is fetched from the pending_list.
>>>>>
>>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>>> the structure that embeds the entity. However, a job originally handed
>>>>> over to the scheduler by this entity might still reside in the
>>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>>> already being called and the entity being freed. Hence, we can run into
>>>>> a UAF.
>>>>>
>>>> Sorry about that, I clearly didn't properly consider this case.
>>>>
>>>>> In my case it happened that a job, as explained above, was just picked
>>>>> from the schedulers pending_list after the entity was freed due to the
>>>>> client application exiting. Meanwhile this freed up memory was already
>>>>> allocated for a subsequent client applications job structure again.
>>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>>> reproduce the same corruption over and over again by just using
>>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>>
>>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>>> I miss something), which is why I'm writing this mail instead of sending
>>>>> a fix directly.
>>>>>
>>>>> Spontaneously, I see three options to fix it:
>>>>>
>>>>> 1. Rather than embedding the entity into driver specific structures
>>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>>> reference count it, such that it's only freed up once all jobs that were
>>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>>
>>>> My vote is on this or something in similar vain for the long term. I
>>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>>> more fairness than the current one sometime in the future, which
>>>> requires execution time tracking on the entities.
>>>
>>> Danilo,
>>>
>>> Using kref is preferable, i.e. option 1 above.
>>
>> I think the only real motivation for doing that would be for generically
>> tracking job statistics within the entity a job was deployed through. If
>> we all agree on tracking job statistics this way I am happy to prepare a
>> patch for this option and drop this one:
>> https://lore.kernel.org/all/[email protected]/T/#u
>
> Hmm, I never thought about "job statistics" when I preferred using kref above.
> The reason kref is attractive is because one doesn't need to worry about
> it--when the last user drops the kref, the release is called to do
> housekeeping. If this never happens, we know that we have a bug to debug.
>

I tied it to the use case of (accumulated) job statistics since I think
using kref might only make sense if we have a reason why a job needs to
access the entity it was deployed through. And the only real reason to
keep this connection seems to be (accumulated) job statistics.

If it is only about allowing drivers to derive a driver specific entity
structure from a job I think it might depend on whether this is a
"typical application" nearly every driver does (which it seems not to
be) or if it is an exception.

In the latter case the driver could still store the corresponding
pointers in its driver specific structures and take care of not freeing
the structure the entity is embedded in until it is safe to do so.

Since it was already tried twice to generically get (accumulated) job
statistics into entities personally I think doing so plus using kref
would be quite nice. However, I'd like to be sure this fits for
everyone's use cases.

> Regarding the patch above--I did look around the code, and it seems safe,
> as per your analysis, I didn't see any reference to entity after job submission,
> but I'll comment on that thread as well for the record.
>
> Regards,
> Luben
>
>>
>> Christian mentioned amdgpu tried something similar to what Lucas tried
>> running into similar trouble, backed off and implemented it in another
>> way - a driver specific way I guess?
>>
>>>
>>> Lucas, can you shed some light on,
>>>
>>> 1. In what way the current FIFO scheduling is unfair, and
>>> 2. shed some details on this "scheduling algorithm with a bit
>>> more fairness than the current one"?
>>>
>>> Regards,
>>> Luben
>>>
>>>>
>>>>> 2. Somehow make sure drm_sched_entity_destroy() does block until all
>>>>> jobs deployed through this entity were fetched from the schedulers
>>>>> pending list. Though, I'm pretty sure that this is not really desirable.
>>>>>
>>>>> 3. Just revert the change and let drivers implement tracking of GPU
>>>>> active times themselves.
>>>>>
>>>> Given that we are already pretty late in the release cycle and etnaviv
>>>> being the only driver so far making use of the scheduler elapsed time
>>>> tracking I think the right short term solution is to either move the
>>>> tracking into etnaviv or just revert the change for now. I'll have a
>>>> look at this.
>>>>
>>>> Regards,
>>>> Lucas
>>>>
>>>>> In the case of just reverting the change I'd propose to also set a jobs
>>>>> entity pointer to NULL once the job was taken from the entity, such
>>>>> that in case of a future issue we fail where the actual issue resides
>>>>> and to make it more obvious that the field shouldn't be used anymore
>>>>> after the job was taken from the entity.
>>>>>
>>>>> I'm happy to implement the solution we agree on. However, it might also
>>>>> make sense to revert the change until we have a solution in place. I'm
>>>>> also happy to send a revert with a proper description of the problem.
>>>>> Please let me know what you think.
>>>>>
>>>>> - Danilo
>>>>>
>>>>
>>>
>>
>

2023-04-05 20:46:32

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On 2023-04-05 13:44, Lucas Stach wrote:
> Hi Luben,
>
> Am Dienstag, dem 04.04.2023 um 00:31 -0400 schrieb Luben Tuikov:
>> On 2023-03-28 04:54, Lucas Stach wrote:
>>> Hi Danilo,
>>>
>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>> Hi all,
>>>>
>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>> tries to track the accumulated time that a job was active on the GPU
>>>> writing it to the entity through which the job was deployed to the
>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>> which fetches a job from the schedulers pending_list.
>>>>
>>>> Doing this can result in a race condition where the entity is already
>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>> once the job is fetched from the pending_list.
>>>>
>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>> the structure that embeds the entity. However, a job originally handed
>>>> over to the scheduler by this entity might still reside in the
>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>> already being called and the entity being freed. Hence, we can run into
>>>> a UAF.
>>>>
>>> Sorry about that, I clearly didn't properly consider this case.
>>>
>>>> In my case it happened that a job, as explained above, was just picked
>>>> from the schedulers pending_list after the entity was freed due to the
>>>> client application exiting. Meanwhile this freed up memory was already
>>>> allocated for a subsequent client applications job structure again.
>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>> reproduce the same corruption over and over again by just using
>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>
>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>> I miss something), which is why I'm writing this mail instead of sending
>>>> a fix directly.
>>>>
>>>> Spontaneously, I see three options to fix it:
>>>>
>>>> 1. Rather than embedding the entity into driver specific structures
>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>> reference count it, such that it's only freed up once all jobs that were
>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>
>>> My vote is on this or something in similar vain for the long term. I
>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>> more fairness than the current one sometime in the future, which
>>> requires execution time tracking on the entities.
>>
>> Danilo,
>>
>> Using kref is preferable, i.e. option 1 above.
>>
>> Lucas, can you shed some light on,
>>
>> 1. In what way the current FIFO scheduling is unfair, and
>> 2. shed some details on this "scheduling algorithm with a bit
>> more fairness than the current one"?
>
> I don't have a specific implementation in mind yet. However the current
> FIFO algorithm can be very unfair if you have a sparse workload compete
> with one that generates a lot of jobs without any throttling aside from
> the entity queue length.

Ah, that's interesting, let's see, a "sparse workload compete with one that
generates a lot of jobs", so basically we have a sparse workload compete
with a dense workload. So we can represent this with two entities, A, B,
whose jobs we're going to represent by the entities, names A and B.
So if we let A be the sparse workload and B the dense workload,
we have this, wlog,

First/oldest job, .........................., latter/new jobs.
Subm: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...
Time: t0,t1,t2,t3,t4,t5,t6,t7,t8,t9, .....

The current FIFO algorithm, would prefer to execute those jobs
in order of submission, i.e. oldest-ready-first job. Assume
that all jobs are ready. Then we'll execute them in order.
This is desirable and fair. We want to execute the jobs
in the order they were submitted, given also that they are
ready to be executed. So perhaps we want to execute them like this:

First/oldest job, .........................., latter/new jobs.
Subm: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...
Time: t0,t1,t2,t3,t4,t5,t6,t7,t8,t9, ....
Exec: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...

Any other ordering would starve either A, or B. If we executed the 2nd A
job at t6 or t7, then that would starve the 3rd/4th job in B, since the 2nd A job
arrives at the same time as that of the 3rd B job, at time t6.
The time t3-t0 is some delta > 0, some initial scheduler-start up time.

IOW, we don't want to delay a job any more than it should wait--the oldest
job, which is also ready, should execute next, so that we're fair how
it executes in real time. We cannot boot B at t6, so that we execute A,
just because it is sparse, but just arrived.

From A's point of view, it shouldn't expect its job execution time distribution
to be any different than its submission time distribution.

Do you think there's a job permutation which offers a fairer scheduling
than the Exec line above for the Submission line above?

> By tracking the actual GPU time consumed by
> the entities we could implement something with a bit more fairness like
> deficit round robin (don't pin me on the specific algorithm, as I
> haven't given it much thought yet).

Since there's no preemption, this would be hard to achieve--you're at the mercy
of the execution time of job A_i for an entity A job i. (Assuming there's no
preemption as it is the current state of the GPU scheduler.)

The only thing you can do, is punish the next job from this entity, A_i+1,
to execute much later. However, you don't know how long A_i+1 would take. If A_i+1
takes very little time, then you're better off executing it at the next opportune
time, i.e. when it would normally execute. But such an algorithm, which doesn't
know a priori the execution time of a job, would punish A_i+1 to execute much later.

But if A_i+1 takes time as long or longer than A_i, then punishing it to execute much
later, would simply delay it, from an observer's point a view, it wouldn't teach
the context to submit smaller jobs, so that GPU sharing is more fair.

(Note that if we know the job's time a priori, we could do something like bin packing
to accommodate fair scheduling over the long run.)

One way to partially remedy the situation is parallelism. The more parallel execution
units (schedulers) you have, the more the alleviation. We'd get in trouble iff we get
all jobs executing in all schedulers, each taking a long time, with probability going
to 0 as you increase the parallel execution units (not considering a pathological
execution time distribution, where all jobs take a long time universally.)

Thinking about the FIFO discussion above: even if one job in a sparse-with-dense load
distributions from a number of context, takes a long long time to execute, the very next
job you'd want to execute is the oldest (the one who's been waiting to most) ready
job--why delay it any further--that'll starve it.
--
Regards,
Luben

2023-04-06 08:26:51

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> On 2023-04-05 10:05, Danilo Krummrich wrote:
>> On 4/4/23 06:31, Luben Tuikov wrote:
>>> On 2023-03-28 04:54, Lucas Stach wrote:
>>>> Hi Danilo,
>>>>
>>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>>> Hi all,
>>>>>
>>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>>> tries to track the accumulated time that a job was active on the GPU
>>>>> writing it to the entity through which the job was deployed to the
>>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>>> which fetches a job from the schedulers pending_list.
>>>>>
>>>>> Doing this can result in a race condition where the entity is already
>>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>>> once the job is fetched from the pending_list.
>>>>>
>>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>>> the structure that embeds the entity. However, a job originally handed
>>>>> over to the scheduler by this entity might still reside in the
>>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>>> already being called and the entity being freed. Hence, we can run into
>>>>> a UAF.
>>>>>
>>>> Sorry about that, I clearly didn't properly consider this case.
>>>>
>>>>> In my case it happened that a job, as explained above, was just picked
>>>>> from the schedulers pending_list after the entity was freed due to the
>>>>> client application exiting. Meanwhile this freed up memory was already
>>>>> allocated for a subsequent client applications job structure again.
>>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>>> reproduce the same corruption over and over again by just using
>>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>>
>>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>>> I miss something), which is why I'm writing this mail instead of sending
>>>>> a fix directly.
>>>>>
>>>>> Spontaneously, I see three options to fix it:
>>>>>
>>>>> 1. Rather than embedding the entity into driver specific structures
>>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>>> reference count it, such that it's only freed up once all jobs that were
>>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>>
>>>> My vote is on this or something in similar vain for the long term. I
>>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>>> more fairness than the current one sometime in the future, which
>>>> requires execution time tracking on the entities.
>>> Danilo,
>>>
>>> Using kref is preferable, i.e. option 1 above.
>> I think the only real motivation for doing that would be for generically
>> tracking job statistics within the entity a job was deployed through. If
>> we all agree on tracking job statistics this way I am happy to prepare a
>> patch for this option and drop this one:
>> https://lore.kernel.org/all/[email protected]/T/#u
> Hmm, I never thought about "job statistics" when I preferred using kref above.
> The reason kref is attractive is because one doesn't need to worry about
> it--when the last user drops the kref, the release is called to do
> housekeeping. If this never happens, we know that we have a bug to debug.

Yeah, reference counting unfortunately have some traps as well. For
example rarely dropping the last reference from interrupt context or
with some unexpected locks help when the cleanup function doesn't expect
that is a good recipe for problems as well.

> Regarding the patch above--I did look around the code, and it seems safe,
> as per your analysis, I didn't see any reference to entity after job submission,
> but I'll comment on that thread as well for the record.

Reference counting the entities was suggested before. The intentionally
avoided that so far because the entity might be the tip of the iceberg
of stuff you need to keep around.

For example for command submission you also need the VM and when you
keep the VM alive you also need to keep the file private alive....

Additional to that we have some ugly inter dependencies between tearing
down an application (potential with a KILL signal from the OOM killer)
and backward compatibility for some applications which render something
and quit before the rendering is completed in the hardware.

Regards,
Christian.

>
> Regards,
> Luben
>
>> Christian mentioned amdgpu tried something similar to what Lucas tried
>> running into similar trouble, backed off and implemented it in another
>> way - a driver specific way I guess?
>>
>>> Lucas, can you shed some light on,
>>>
>>> 1. In what way the current FIFO scheduling is unfair, and
>>> 2. shed some details on this "scheduling algorithm with a bit
>>> more fairness than the current one"?
>>>
>>> Regards,
>>> Luben
>>>
>>>>> 2. Somehow make sure drm_sched_entity_destroy() does block until all
>>>>> jobs deployed through this entity were fetched from the schedulers
>>>>> pending list. Though, I'm pretty sure that this is not really desirable.
>>>>>
>>>>> 3. Just revert the change and let drivers implement tracking of GPU
>>>>> active times themselves.
>>>>>
>>>> Given that we are already pretty late in the release cycle and etnaviv
>>>> being the only driver so far making use of the scheduler elapsed time
>>>> tracking I think the right short term solution is to either move the
>>>> tracking into etnaviv or just revert the change for now. I'll have a
>>>> look at this.
>>>>
>>>> Regards,
>>>> Lucas
>>>>
>>>>> In the case of just reverting the change I'd propose to also set a jobs
>>>>> entity pointer to NULL once the job was taken from the entity, such
>>>>> that in case of a future issue we fail where the actual issue resides
>>>>> and to make it more obvious that the field shouldn't be used anymore
>>>>> after the job was taken from the entity.
>>>>>
>>>>> I'm happy to implement the solution we agree on. However, it might also
>>>>> make sense to revert the change until we have a solution in place. I'm
>>>>> also happy to send a revert with a proper description of the problem.
>>>>> Please let me know what you think.
>>>>>
>>>>> - Danilo
>>>>>

2023-04-06 08:28:27

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On Thu, 6 Apr 2023 at 10:22, Christian König <[email protected]> wrote:
>
> Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> > On 2023-04-05 10:05, Danilo Krummrich wrote:
> >> On 4/4/23 06:31, Luben Tuikov wrote:
> >>> On 2023-03-28 04:54, Lucas Stach wrote:
> >>>> Hi Danilo,
> >>>>
> >>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> >>>>> Hi all,
> >>>>>
> >>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> >>>>> tries to track the accumulated time that a job was active on the GPU
> >>>>> writing it to the entity through which the job was deployed to the
> >>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
> >>>>> which fetches a job from the schedulers pending_list.
> >>>>>
> >>>>> Doing this can result in a race condition where the entity is already
> >>>>> freed, but the entity's newly added elapsed_ns field is still accessed
> >>>>> once the job is fetched from the pending_list.
> >>>>>
> >>>>> After drm_sched_entity_destroy() being called it should be safe to free
> >>>>> the structure that embeds the entity. However, a job originally handed
> >>>>> over to the scheduler by this entity might still reside in the
> >>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
> >>>>> already being called and the entity being freed. Hence, we can run into
> >>>>> a UAF.
> >>>>>
> >>>> Sorry about that, I clearly didn't properly consider this case.
> >>>>
> >>>>> In my case it happened that a job, as explained above, was just picked
> >>>>> from the schedulers pending_list after the entity was freed due to the
> >>>>> client application exiting. Meanwhile this freed up memory was already
> >>>>> allocated for a subsequent client applications job structure again.
> >>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
> >>>>> reproduce the same corruption over and over again by just using
> >>>>> deqp-runner to run a specific set of VK test cases in parallel.
> >>>>>
> >>>>> Fixing this issue doesn't seem to be very straightforward though (unless
> >>>>> I miss something), which is why I'm writing this mail instead of sending
> >>>>> a fix directly.
> >>>>>
> >>>>> Spontaneously, I see three options to fix it:
> >>>>>
> >>>>> 1. Rather than embedding the entity into driver specific structures
> >>>>> (e.g. tied to file_priv) we could allocate the entity separately and
> >>>>> reference count it, such that it's only freed up once all jobs that were
> >>>>> deployed through this entity are fetched from the schedulers pending list.
> >>>>>
> >>>> My vote is on this or something in similar vain for the long term. I
> >>>> have some hope to be able to add a GPU scheduling algorithm with a bit
> >>>> more fairness than the current one sometime in the future, which
> >>>> requires execution time tracking on the entities.
> >>> Danilo,
> >>>
> >>> Using kref is preferable, i.e. option 1 above.
> >> I think the only real motivation for doing that would be for generically
> >> tracking job statistics within the entity a job was deployed through. If
> >> we all agree on tracking job statistics this way I am happy to prepare a
> >> patch for this option and drop this one:
> >> https://lore.kernel.org/all/[email protected]/T/#u
> > Hmm, I never thought about "job statistics" when I preferred using kref above.
> > The reason kref is attractive is because one doesn't need to worry about
> > it--when the last user drops the kref, the release is called to do
> > housekeeping. If this never happens, we know that we have a bug to debug.
>
> Yeah, reference counting unfortunately have some traps as well. For
> example rarely dropping the last reference from interrupt context or
> with some unexpected locks help when the cleanup function doesn't expect
> that is a good recipe for problems as well.
>
> > Regarding the patch above--I did look around the code, and it seems safe,
> > as per your analysis, I didn't see any reference to entity after job submission,
> > but I'll comment on that thread as well for the record.
>
> Reference counting the entities was suggested before. The intentionally
> avoided that so far because the entity might be the tip of the iceberg
> of stuff you need to keep around.
>
> For example for command submission you also need the VM and when you
> keep the VM alive you also need to keep the file private alive....

Yeah refcounting looks often like the easy way out to avoid
use-after-free issue, until you realize you've just made lifetimes
unbounded and have some enourmous leaks: entity keeps vm alive, vm
keeps all the bo alives, somehow every crash wastes more memory
because vk_device_lost means userspace allocates new stuff for
everything.

If possible a lifetime design where lifetimes have hard bounds and you
just borrow a reference under a lock (or some other ownership rule) is
generally much more solid. But also much harder to design correctly
:-/

> Additional to that we have some ugly inter dependencies between tearing
> down an application (potential with a KILL signal from the OOM killer)
> and backward compatibility for some applications which render something
> and quit before the rendering is completed in the hardware.

Yeah I think that part would also be good to sort out once&for all in
drm/sched, because i915 has/had the same struggle.
-Daniel

>
> Regards,
> Christian.
>
> >
> > Regards,
> > Luben
> >
> >> Christian mentioned amdgpu tried something similar to what Lucas tried
> >> running into similar trouble, backed off and implemented it in another
> >> way - a driver specific way I guess?
> >>
> >>> Lucas, can you shed some light on,
> >>>
> >>> 1. In what way the current FIFO scheduling is unfair, and
> >>> 2. shed some details on this "scheduling algorithm with a bit
> >>> more fairness than the current one"?
> >>>
> >>> Regards,
> >>> Luben
> >>>
> >>>>> 2. Somehow make sure drm_sched_entity_destroy() does block until all
> >>>>> jobs deployed through this entity were fetched from the schedulers
> >>>>> pending list. Though, I'm pretty sure that this is not really desirable.
> >>>>>
> >>>>> 3. Just revert the change and let drivers implement tracking of GPU
> >>>>> active times themselves.
> >>>>>
> >>>> Given that we are already pretty late in the release cycle and etnaviv
> >>>> being the only driver so far making use of the scheduler elapsed time
> >>>> tracking I think the right short term solution is to either move the
> >>>> tracking into etnaviv or just revert the change for now. I'll have a
> >>>> look at this.
> >>>>
> >>>> Regards,
> >>>> Lucas
> >>>>
> >>>>> In the case of just reverting the change I'd propose to also set a jobs
> >>>>> entity pointer to NULL once the job was taken from the entity, such
> >>>>> that in case of a future issue we fail where the actual issue resides
> >>>>> and to make it more obvious that the field shouldn't be used anymore
> >>>>> after the job was taken from the entity.
> >>>>>
> >>>>> I'm happy to implement the solution we agree on. However, it might also
> >>>>> make sense to revert the change until we have a solution in place. I'm
> >>>>> also happy to send a revert with a proper description of the problem.
> >>>>> Please let me know what you think.
> >>>>>
> >>>>> - Danilo
> >>>>>
>

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2023-04-06 09:12:08

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On 06/04/2023 17.27, Daniel Vetter wrote:
> On Thu, 6 Apr 2023 at 10:22, Christian König <[email protected]> wrote:
>>
>> Am 05.04.23 um 18:09 schrieb Luben Tuikov:
>>> On 2023-04-05 10:05, Danilo Krummrich wrote:
>>>> On 4/4/23 06:31, Luben Tuikov wrote:
>>>>> On 2023-03-28 04:54, Lucas Stach wrote:
>>>>>> Hi Danilo,
>>>>>>
>>>>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>>>>> tries to track the accumulated time that a job was active on the GPU
>>>>>>> writing it to the entity through which the job was deployed to the
>>>>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>>>>> which fetches a job from the schedulers pending_list.
>>>>>>>
>>>>>>> Doing this can result in a race condition where the entity is already
>>>>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>>>>> once the job is fetched from the pending_list.
>>>>>>>
>>>>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>>>>> the structure that embeds the entity. However, a job originally handed
>>>>>>> over to the scheduler by this entity might still reside in the
>>>>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>>>>> already being called and the entity being freed. Hence, we can run into
>>>>>>> a UAF.
>>>>>>>
>>>>>> Sorry about that, I clearly didn't properly consider this case.
>>>>>>
>>>>>>> In my case it happened that a job, as explained above, was just picked
>>>>>>> from the schedulers pending_list after the entity was freed due to the
>>>>>>> client application exiting. Meanwhile this freed up memory was already
>>>>>>> allocated for a subsequent client applications job structure again.
>>>>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>>>>> reproduce the same corruption over and over again by just using
>>>>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>>>>
>>>>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>>>>> I miss something), which is why I'm writing this mail instead of sending
>>>>>>> a fix directly.
>>>>>>>
>>>>>>> Spontaneously, I see three options to fix it:
>>>>>>>
>>>>>>> 1. Rather than embedding the entity into driver specific structures
>>>>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>>>>> reference count it, such that it's only freed up once all jobs that were
>>>>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>>>>
>>>>>> My vote is on this or something in similar vain for the long term. I
>>>>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>>>>> more fairness than the current one sometime in the future, which
>>>>>> requires execution time tracking on the entities.
>>>>> Danilo,
>>>>>
>>>>> Using kref is preferable, i.e. option 1 above.
>>>> I think the only real motivation for doing that would be for generically
>>>> tracking job statistics within the entity a job was deployed through. If
>>>> we all agree on tracking job statistics this way I am happy to prepare a
>>>> patch for this option and drop this one:
>>>> https://lore.kernel.org/all/[email protected]/T/#u
>>> Hmm, I never thought about "job statistics" when I preferred using kref above.
>>> The reason kref is attractive is because one doesn't need to worry about
>>> it--when the last user drops the kref, the release is called to do
>>> housekeeping. If this never happens, we know that we have a bug to debug.
>>
>> Yeah, reference counting unfortunately have some traps as well. For
>> example rarely dropping the last reference from interrupt context or
>> with some unexpected locks help when the cleanup function doesn't expect
>> that is a good recipe for problems as well.
>>
>>> Regarding the patch above--I did look around the code, and it seems safe,
>>> as per your analysis, I didn't see any reference to entity after job submission,
>>> but I'll comment on that thread as well for the record.
>>
>> Reference counting the entities was suggested before. The intentionally
>> avoided that so far because the entity might be the tip of the iceberg
>> of stuff you need to keep around.
>>
>> For example for command submission you also need the VM and when you
>> keep the VM alive you also need to keep the file private alive....
>
> Yeah refcounting looks often like the easy way out to avoid
> use-after-free issue, until you realize you've just made lifetimes
> unbounded and have some enourmous leaks: entity keeps vm alive, vm
> keeps all the bo alives, somehow every crash wastes more memory
> because vk_device_lost means userspace allocates new stuff for
> everything.

Refcounting everywhere has been working well for us, so well that so far
all the oopses we've hit have been... drm_sched bugs like this one, not
anything in the driver. But at least in Rust you have the advantage that
you can't just forget a decref in a rarely-hit error path (or worse,
forget an incref somewhere important)... ^^

> If possible a lifetime design where lifetimes have hard bounds and you
> just borrow a reference under a lock (or some other ownership rule) is
> generally much more solid. But also much harder to design correctly
> :-/
>
>> Additional to that we have some ugly inter dependencies between tearing
>> down an application (potential with a KILL signal from the OOM killer)
>> and backward compatibility for some applications which render something
>> and quit before the rendering is completed in the hardware.
>
> Yeah I think that part would also be good to sort out once&for all in
> drm/sched, because i915 has/had the same struggle.
> -Daniel
>

Is this really a thing? I think that's never going to work well for
explicit sync, since the kernel doesn't even know what BOs it has to
keep alive for a job... I guess it could keep the entire file and all of
its objects/VMs/etc alive until all of its submissions complete but... ewww.

Our Mesa implementation synchronously waits for all jobs on context
destroy for this reason, but if you just kill the app, yeah, you get
faults as running GPU jobs have BOs yanked out from under them. Kill
loops make for a good way of testing fault handling...

~~ Lina

2023-04-06 10:25:34

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On Thu, Apr 06, 2023 at 06:05:11PM +0900, Asahi Lina wrote:
> On 06/04/2023 17.27, Daniel Vetter wrote:
> > On Thu, 6 Apr 2023 at 10:22, Christian K?nig <[email protected]> wrote:
> > >
> > > Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> > > > On 2023-04-05 10:05, Danilo Krummrich wrote:
> > > > > On 4/4/23 06:31, Luben Tuikov wrote:
> > > > > > On 2023-03-28 04:54, Lucas Stach wrote:
> > > > > > > Hi Danilo,
> > > > > > >
> > > > > > > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > > > > > > tries to track the accumulated time that a job was active on the GPU
> > > > > > > > writing it to the entity through which the job was deployed to the
> > > > > > > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > > > > > > which fetches a job from the schedulers pending_list.
> > > > > > > >
> > > > > > > > Doing this can result in a race condition where the entity is already
> > > > > > > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > > > > > > once the job is fetched from the pending_list.
> > > > > > > >
> > > > > > > > After drm_sched_entity_destroy() being called it should be safe to free
> > > > > > > > the structure that embeds the entity. However, a job originally handed
> > > > > > > > over to the scheduler by this entity might still reside in the
> > > > > > > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > > > > > > already being called and the entity being freed. Hence, we can run into
> > > > > > > > a UAF.
> > > > > > > >
> > > > > > > Sorry about that, I clearly didn't properly consider this case.
> > > > > > >
> > > > > > > > In my case it happened that a job, as explained above, was just picked
> > > > > > > > from the schedulers pending_list after the entity was freed due to the
> > > > > > > > client application exiting. Meanwhile this freed up memory was already
> > > > > > > > allocated for a subsequent client applications job structure again.
> > > > > > > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > > > > > > reproduce the same corruption over and over again by just using
> > > > > > > > deqp-runner to run a specific set of VK test cases in parallel.
> > > > > > > >
> > > > > > > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > > > > > > I miss something), which is why I'm writing this mail instead of sending
> > > > > > > > a fix directly.
> > > > > > > >
> > > > > > > > Spontaneously, I see three options to fix it:
> > > > > > > >
> > > > > > > > 1. Rather than embedding the entity into driver specific structures
> > > > > > > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > > > > > > reference count it, such that it's only freed up once all jobs that were
> > > > > > > > deployed through this entity are fetched from the schedulers pending list.
> > > > > > > >
> > > > > > > My vote is on this or something in similar vain for the long term. I
> > > > > > > have some hope to be able to add a GPU scheduling algorithm with a bit
> > > > > > > more fairness than the current one sometime in the future, which
> > > > > > > requires execution time tracking on the entities.
> > > > > > Danilo,
> > > > > >
> > > > > > Using kref is preferable, i.e. option 1 above.
> > > > > I think the only real motivation for doing that would be for generically
> > > > > tracking job statistics within the entity a job was deployed through. If
> > > > > we all agree on tracking job statistics this way I am happy to prepare a
> > > > > patch for this option and drop this one:
> > > > > https://lore.kernel.org/all/[email protected]/T/#u
> > > > Hmm, I never thought about "job statistics" when I preferred using kref above.
> > > > The reason kref is attractive is because one doesn't need to worry about
> > > > it--when the last user drops the kref, the release is called to do
> > > > housekeeping. If this never happens, we know that we have a bug to debug.
> > >
> > > Yeah, reference counting unfortunately have some traps as well. For
> > > example rarely dropping the last reference from interrupt context or
> > > with some unexpected locks help when the cleanup function doesn't expect
> > > that is a good recipe for problems as well.
> > >
> > > > Regarding the patch above--I did look around the code, and it seems safe,
> > > > as per your analysis, I didn't see any reference to entity after job submission,
> > > > but I'll comment on that thread as well for the record.
> > >
> > > Reference counting the entities was suggested before. The intentionally
> > > avoided that so far because the entity might be the tip of the iceberg
> > > of stuff you need to keep around.
> > >
> > > For example for command submission you also need the VM and when you
> > > keep the VM alive you also need to keep the file private alive....
> >
> > Yeah refcounting looks often like the easy way out to avoid
> > use-after-free issue, until you realize you've just made lifetimes
> > unbounded and have some enourmous leaks: entity keeps vm alive, vm
> > keeps all the bo alives, somehow every crash wastes more memory
> > because vk_device_lost means userspace allocates new stuff for
> > everything.
>
> Refcounting everywhere has been working well for us, so well that so far all
> the oopses we've hit have been... drm_sched bugs like this one, not anything
> in the driver. But at least in Rust you have the advantage that you can't
> just forget a decref in a rarely-hit error path (or worse, forget an incref
> somewhere important)... ^^
>
> > If possible a lifetime design where lifetimes have hard bounds and you
> > just borrow a reference under a lock (or some other ownership rule) is
> > generally much more solid. But also much harder to design correctly
> > :-/
> >
> > > Additional to that we have some ugly inter dependencies between tearing
> > > down an application (potential with a KILL signal from the OOM killer)
> > > and backward compatibility for some applications which render something
> > > and quit before the rendering is completed in the hardware.
> >
> > Yeah I think that part would also be good to sort out once&for all in
> > drm/sched, because i915 has/had the same struggle.
> > -Daniel
> >
>
> Is this really a thing? I think that's never going to work well for explicit
> sync, since the kernel doesn't even know what BOs it has to keep alive for a
> job... I guess it could keep the entire file and all of its objects/VMs/etc
> alive until all of its submissions complete but... ewww.
>
> Our Mesa implementation synchronously waits for all jobs on context destroy
> for this reason, but if you just kill the app, yeah, you get faults as
> running GPU jobs have BOs yanked out from under them. Kill loops make for a
> good way of testing fault handling...

You wind down the entire thing on file close? Like
- stop all context
- tear down all context
- tear down all vm
- tear down all obj

Just winding things down in a random order and then letting gpu fault
handling sort out the mess doesn't strike me as particularly clean design
...

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2023-04-06 11:09:12

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

Am Donnerstag, dem 06.04.2023 um 10:27 +0200 schrieb Daniel Vetter:
> On Thu, 6 Apr 2023 at 10:22, Christian König <[email protected]> wrote:
> >
> > Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> > > On 2023-04-05 10:05, Danilo Krummrich wrote:
> > > > On 4/4/23 06:31, Luben Tuikov wrote:
> > > > > On 2023-03-28 04:54, Lucas Stach wrote:
> > > > > > Hi Danilo,
> > > > > >
> > > > > > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > > > > > tries to track the accumulated time that a job was active on the GPU
> > > > > > > writing it to the entity through which the job was deployed to the
> > > > > > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > > > > > which fetches a job from the schedulers pending_list.
> > > > > > >
> > > > > > > Doing this can result in a race condition where the entity is already
> > > > > > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > > > > > once the job is fetched from the pending_list.
> > > > > > >
> > > > > > > After drm_sched_entity_destroy() being called it should be safe to free
> > > > > > > the structure that embeds the entity. However, a job originally handed
> > > > > > > over to the scheduler by this entity might still reside in the
> > > > > > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > > > > > already being called and the entity being freed. Hence, we can run into
> > > > > > > a UAF.
> > > > > > >
> > > > > > Sorry about that, I clearly didn't properly consider this case.
> > > > > >
> > > > > > > In my case it happened that a job, as explained above, was just picked
> > > > > > > from the schedulers pending_list after the entity was freed due to the
> > > > > > > client application exiting. Meanwhile this freed up memory was already
> > > > > > > allocated for a subsequent client applications job structure again.
> > > > > > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > > > > > reproduce the same corruption over and over again by just using
> > > > > > > deqp-runner to run a specific set of VK test cases in parallel.
> > > > > > >
> > > > > > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > > > > > I miss something), which is why I'm writing this mail instead of sending
> > > > > > > a fix directly.
> > > > > > >
> > > > > > > Spontaneously, I see three options to fix it:
> > > > > > >
> > > > > > > 1. Rather than embedding the entity into driver specific structures
> > > > > > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > > > > > reference count it, such that it's only freed up once all jobs that were
> > > > > > > deployed through this entity are fetched from the schedulers pending list.
> > > > > > >
> > > > > > My vote is on this or something in similar vain for the long term. I
> > > > > > have some hope to be able to add a GPU scheduling algorithm with a bit
> > > > > > more fairness than the current one sometime in the future, which
> > > > > > requires execution time tracking on the entities.
> > > > > Danilo,
> > > > >
> > > > > Using kref is preferable, i.e. option 1 above.
> > > > I think the only real motivation for doing that would be for generically
> > > > tracking job statistics within the entity a job was deployed through. If
> > > > we all agree on tracking job statistics this way I am happy to prepare a
> > > > patch for this option and drop this one:
> > > > https://lore.kernel.org/all/[email protected]/T/#u
> > > Hmm, I never thought about "job statistics" when I preferred using kref above.
> > > The reason kref is attractive is because one doesn't need to worry about
> > > it--when the last user drops the kref, the release is called to do
> > > housekeeping. If this never happens, we know that we have a bug to debug.
> >
> > Yeah, reference counting unfortunately have some traps as well. For
> > example rarely dropping the last reference from interrupt context or
> > with some unexpected locks help when the cleanup function doesn't expect
> > that is a good recipe for problems as well.
> >
Fully agreed.

> > > Regarding the patch above--I did look around the code, and it seems safe,
> > > as per your analysis, I didn't see any reference to entity after job submission,
> > > but I'll comment on that thread as well for the record.
> >
> > Reference counting the entities was suggested before. The intentionally
> > avoided that so far because the entity might be the tip of the iceberg
> > of stuff you need to keep around.
> >
> > For example for command submission you also need the VM and when you
> > keep the VM alive you also need to keep the file private alive....
>
> Yeah refcounting looks often like the easy way out to avoid
> use-after-free issue, until you realize you've just made lifetimes
> unbounded and have some enourmous leaks: entity keeps vm alive, vm
> keeps all the bo alives, somehow every crash wastes more memory
> because vk_device_lost means userspace allocates new stuff for
> everything.
>
> If possible a lifetime design where lifetimes have hard bounds and you
> just borrow a reference under a lock (or some other ownership rule) is
> generally much more solid. But also much harder to design correctly
> :-/
>
The use we are discussing here is to keep the entity alive as long as
jobs from that entity are still active on the HW. While there are no
hard bounds on when a job will get inactive, at least it's not
unbounded. On a crash/fault the job will be removed from the hardware
pretty soon.

Well behaved jobs after a application shutdown might take a little
longer, but I don't really see the new problem with keeping the entity
alive? As long as a job is active on the hardware, we can't throw out
the VM or BOs, no difference whether the entity is kept alive or not.

Some hardware might have ways to expedite job inactivation by
deactivating scheduling queues, or just taking a fault, but for some HW
we'll just have to wait for the job to finish.

Regards,
Lucas

> > Additional to that we have some ugly inter dependencies between tearing
> > down an application (potential with a KILL signal from the OOM killer)
> > and backward compatibility for some applications which render something
> > and quit before the rendering is completed in the hardware.
>
> Yeah I think that part would also be good to sort out once&for all in
> drm/sched, because i915 has/had the same struggle.
> -Daniel
>
> >
> > Regards,
> > Christian.
> >
> > >
> > > Regards,
> > > Luben
> > >
> > > > Christian mentioned amdgpu tried something similar to what Lucas tried
> > > > running into similar trouble, backed off and implemented it in another
> > > > way - a driver specific way I guess?
> > > >
> > > > > Lucas, can you shed some light on,
> > > > >
> > > > > 1. In what way the current FIFO scheduling is unfair, and
> > > > > 2. shed some details on this "scheduling algorithm with a bit
> > > > > more fairness than the current one"?
> > > > >
> > > > > Regards,
> > > > > Luben
> > > > >
> > > > > > > 2. Somehow make sure drm_sched_entity_destroy() does block until all
> > > > > > > jobs deployed through this entity were fetched from the schedulers
> > > > > > > pending list. Though, I'm pretty sure that this is not really desirable.
> > > > > > >
> > > > > > > 3. Just revert the change and let drivers implement tracking of GPU
> > > > > > > active times themselves.
> > > > > > >
> > > > > > Given that we are already pretty late in the release cycle and etnaviv
> > > > > > being the only driver so far making use of the scheduler elapsed time
> > > > > > tracking I think the right short term solution is to either move the
> > > > > > tracking into etnaviv or just revert the change for now. I'll have a
> > > > > > look at this.
> > > > > >
> > > > > > Regards,
> > > > > > Lucas
> > > > > >
> > > > > > > In the case of just reverting the change I'd propose to also set a jobs
> > > > > > > entity pointer to NULL once the job was taken from the entity, such
> > > > > > > that in case of a future issue we fail where the actual issue resides
> > > > > > > and to make it more obvious that the field shouldn't be used anymore
> > > > > > > after the job was taken from the entity.
> > > > > > >
> > > > > > > I'm happy to implement the solution we agree on. However, it might also
> > > > > > > make sense to revert the change until we have a solution in place. I'm
> > > > > > > also happy to send a revert with a proper description of the problem.
> > > > > > > Please let me know what you think.
> > > > > > >
> > > > > > > - Danilo
> > > > > > >
> >
>
>

2023-04-06 12:10:17

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On Thu, Apr 06, 2023 at 12:45:12PM +0200, Lucas Stach wrote:
> Am Donnerstag, dem 06.04.2023 um 10:27 +0200 schrieb Daniel Vetter:
> > On Thu, 6 Apr 2023 at 10:22, Christian K?nig <[email protected]> wrote:
> > >
> > > Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> > > > On 2023-04-05 10:05, Danilo Krummrich wrote:
> > > > > On 4/4/23 06:31, Luben Tuikov wrote:
> > > > > > On 2023-03-28 04:54, Lucas Stach wrote:
> > > > > > > Hi Danilo,
> > > > > > >
> > > > > > > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > > > > > > tries to track the accumulated time that a job was active on the GPU
> > > > > > > > writing it to the entity through which the job was deployed to the
> > > > > > > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > > > > > > which fetches a job from the schedulers pending_list.
> > > > > > > >
> > > > > > > > Doing this can result in a race condition where the entity is already
> > > > > > > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > > > > > > once the job is fetched from the pending_list.
> > > > > > > >
> > > > > > > > After drm_sched_entity_destroy() being called it should be safe to free
> > > > > > > > the structure that embeds the entity. However, a job originally handed
> > > > > > > > over to the scheduler by this entity might still reside in the
> > > > > > > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > > > > > > already being called and the entity being freed. Hence, we can run into
> > > > > > > > a UAF.
> > > > > > > >
> > > > > > > Sorry about that, I clearly didn't properly consider this case.
> > > > > > >
> > > > > > > > In my case it happened that a job, as explained above, was just picked
> > > > > > > > from the schedulers pending_list after the entity was freed due to the
> > > > > > > > client application exiting. Meanwhile this freed up memory was already
> > > > > > > > allocated for a subsequent client applications job structure again.
> > > > > > > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > > > > > > reproduce the same corruption over and over again by just using
> > > > > > > > deqp-runner to run a specific set of VK test cases in parallel.
> > > > > > > >
> > > > > > > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > > > > > > I miss something), which is why I'm writing this mail instead of sending
> > > > > > > > a fix directly.
> > > > > > > >
> > > > > > > > Spontaneously, I see three options to fix it:
> > > > > > > >
> > > > > > > > 1. Rather than embedding the entity into driver specific structures
> > > > > > > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > > > > > > reference count it, such that it's only freed up once all jobs that were
> > > > > > > > deployed through this entity are fetched from the schedulers pending list.
> > > > > > > >
> > > > > > > My vote is on this or something in similar vain for the long term. I
> > > > > > > have some hope to be able to add a GPU scheduling algorithm with a bit
> > > > > > > more fairness than the current one sometime in the future, which
> > > > > > > requires execution time tracking on the entities.
> > > > > > Danilo,
> > > > > >
> > > > > > Using kref is preferable, i.e. option 1 above.
> > > > > I think the only real motivation for doing that would be for generically
> > > > > tracking job statistics within the entity a job was deployed through. If
> > > > > we all agree on tracking job statistics this way I am happy to prepare a
> > > > > patch for this option and drop this one:
> > > > > https://lore.kernel.org/all/[email protected]/T/#u
> > > > Hmm, I never thought about "job statistics" when I preferred using kref above.
> > > > The reason kref is attractive is because one doesn't need to worry about
> > > > it--when the last user drops the kref, the release is called to do
> > > > housekeeping. If this never happens, we know that we have a bug to debug.
> > >
> > > Yeah, reference counting unfortunately have some traps as well. For
> > > example rarely dropping the last reference from interrupt context or
> > > with some unexpected locks help when the cleanup function doesn't expect
> > > that is a good recipe for problems as well.
> > >
> Fully agreed.
>
> > > > Regarding the patch above--I did look around the code, and it seems safe,
> > > > as per your analysis, I didn't see any reference to entity after job submission,
> > > > but I'll comment on that thread as well for the record.
> > >
> > > Reference counting the entities was suggested before. The intentionally
> > > avoided that so far because the entity might be the tip of the iceberg
> > > of stuff you need to keep around.
> > >
> > > For example for command submission you also need the VM and when you
> > > keep the VM alive you also need to keep the file private alive....
> >
> > Yeah refcounting looks often like the easy way out to avoid
> > use-after-free issue, until you realize you've just made lifetimes
> > unbounded and have some enourmous leaks: entity keeps vm alive, vm
> > keeps all the bo alives, somehow every crash wastes more memory
> > because vk_device_lost means userspace allocates new stuff for
> > everything.
> >
> > If possible a lifetime design where lifetimes have hard bounds and you
> > just borrow a reference under a lock (or some other ownership rule) is
> > generally much more solid. But also much harder to design correctly
> > :-/
> >
> The use we are discussing here is to keep the entity alive as long as
> jobs from that entity are still active on the HW. While there are no
> hard bounds on when a job will get inactive, at least it's not
> unbounded. On a crash/fault the job will be removed from the hardware
> pretty soon.
>
> Well behaved jobs after a application shutdown might take a little
> longer, but I don't really see the new problem with keeping the entity
> alive? As long as a job is active on the hardware, we can't throw out
> the VM or BOs, no difference whether the entity is kept alive or not.
>
> Some hardware might have ways to expedite job inactivation by
> deactivating scheduling queues, or just taking a fault, but for some HW
> we'll just have to wait for the job to finish.

Shouldn't the scheduler's timed_out/tdr logic take care of these? It's
probably not good to block in something like the close(drmfd) or process
exit() for these, but it's all dma_fence underneath and those _must_
singal in finite time no matter what. So shouldn't be a deadlock problem,
but might still be a "userspace really doesn't like a big stall there"
problem.
-Daniel

> Regards,
> Lucas
>
> > > Additional to that we have some ugly inter dependencies between tearing
> > > down an application (potential with a KILL signal from the OOM killer)
> > > and backward compatibility for some applications which render something
> > > and quit before the rendering is completed in the hardware.
> >
> > Yeah I think that part would also be good to sort out once&for all in
> > drm/sched, because i915 has/had the same struggle.
> > -Daniel
> >
> > >
> > > Regards,
> > > Christian.
> > >
> > > >
> > > > Regards,
> > > > Luben
> > > >
> > > > > Christian mentioned amdgpu tried something similar to what Lucas tried
> > > > > running into similar trouble, backed off and implemented it in another
> > > > > way - a driver specific way I guess?
> > > > >
> > > > > > Lucas, can you shed some light on,
> > > > > >
> > > > > > 1. In what way the current FIFO scheduling is unfair, and
> > > > > > 2. shed some details on this "scheduling algorithm with a bit
> > > > > > more fairness than the current one"?
> > > > > >
> > > > > > Regards,
> > > > > > Luben
> > > > > >
> > > > > > > > 2. Somehow make sure drm_sched_entity_destroy() does block until all
> > > > > > > > jobs deployed through this entity were fetched from the schedulers
> > > > > > > > pending list. Though, I'm pretty sure that this is not really desirable.
> > > > > > > >
> > > > > > > > 3. Just revert the change and let drivers implement tracking of GPU
> > > > > > > > active times themselves.
> > > > > > > >
> > > > > > > Given that we are already pretty late in the release cycle and etnaviv
> > > > > > > being the only driver so far making use of the scheduler elapsed time
> > > > > > > tracking I think the right short term solution is to either move the
> > > > > > > tracking into etnaviv or just revert the change for now. I'll have a
> > > > > > > look at this.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Lucas
> > > > > > >
> > > > > > > > In the case of just reverting the change I'd propose to also set a jobs
> > > > > > > > entity pointer to NULL once the job was taken from the entity, such
> > > > > > > > that in case of a future issue we fail where the actual issue resides
> > > > > > > > and to make it more obvious that the field shouldn't be used anymore
> > > > > > > > after the job was taken from the entity.
> > > > > > > >
> > > > > > > > I'm happy to implement the solution we agree on. However, it might also
> > > > > > > > make sense to revert the change until we have a solution in place. I'm
> > > > > > > > also happy to send a revert with a proper description of the problem.
> > > > > > > > Please let me know what you think.
> > > > > > > >
> > > > > > > > - Danilo
> > > > > > > >
> > >
> >
> >
>

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2023-04-06 12:39:55

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On 06/04/2023 19.09, Daniel Vetter wrote:
> On Thu, Apr 06, 2023 at 06:05:11PM +0900, Asahi Lina wrote:
>> On 06/04/2023 17.27, Daniel Vetter wrote:
>>> On Thu, 6 Apr 2023 at 10:22, Christian König <[email protected]> wrote:
>>>>
>>>> Am 05.04.23 um 18:09 schrieb Luben Tuikov:
>>>>> On 2023-04-05 10:05, Danilo Krummrich wrote:
>>>>>> On 4/4/23 06:31, Luben Tuikov wrote:
>>>>>>> On 2023-03-28 04:54, Lucas Stach wrote:
>>>>>>>> Hi Danilo,
>>>>>>>>
>>>>>>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>>>>>>> tries to track the accumulated time that a job was active on the GPU
>>>>>>>>> writing it to the entity through which the job was deployed to the
>>>>>>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>>>>>>> which fetches a job from the schedulers pending_list.
>>>>>>>>>
>>>>>>>>> Doing this can result in a race condition where the entity is already
>>>>>>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>>>>>>> once the job is fetched from the pending_list.
>>>>>>>>>
>>>>>>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>>>>>>> the structure that embeds the entity. However, a job originally handed
>>>>>>>>> over to the scheduler by this entity might still reside in the
>>>>>>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>>>>>>> already being called and the entity being freed. Hence, we can run into
>>>>>>>>> a UAF.
>>>>>>>>>
>>>>>>>> Sorry about that, I clearly didn't properly consider this case.
>>>>>>>>
>>>>>>>>> In my case it happened that a job, as explained above, was just picked
>>>>>>>>> from the schedulers pending_list after the entity was freed due to the
>>>>>>>>> client application exiting. Meanwhile this freed up memory was already
>>>>>>>>> allocated for a subsequent client applications job structure again.
>>>>>>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>>>>>>> reproduce the same corruption over and over again by just using
>>>>>>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>>>>>>
>>>>>>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>>>>>>> I miss something), which is why I'm writing this mail instead of sending
>>>>>>>>> a fix directly.
>>>>>>>>>
>>>>>>>>> Spontaneously, I see three options to fix it:
>>>>>>>>>
>>>>>>>>> 1. Rather than embedding the entity into driver specific structures
>>>>>>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>>>>>>> reference count it, such that it's only freed up once all jobs that were
>>>>>>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>>>>>>
>>>>>>>> My vote is on this or something in similar vain for the long term. I
>>>>>>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>>>>>>> more fairness than the current one sometime in the future, which
>>>>>>>> requires execution time tracking on the entities.
>>>>>>> Danilo,
>>>>>>>
>>>>>>> Using kref is preferable, i.e. option 1 above.
>>>>>> I think the only real motivation for doing that would be for generically
>>>>>> tracking job statistics within the entity a job was deployed through. If
>>>>>> we all agree on tracking job statistics this way I am happy to prepare a
>>>>>> patch for this option and drop this one:
>>>>>> https://lore.kernel.org/all/[email protected]/T/#u
>>>>> Hmm, I never thought about "job statistics" when I preferred using kref above.
>>>>> The reason kref is attractive is because one doesn't need to worry about
>>>>> it--when the last user drops the kref, the release is called to do
>>>>> housekeeping. If this never happens, we know that we have a bug to debug.
>>>>
>>>> Yeah, reference counting unfortunately have some traps as well. For
>>>> example rarely dropping the last reference from interrupt context or
>>>> with some unexpected locks help when the cleanup function doesn't expect
>>>> that is a good recipe for problems as well.
>>>>
>>>>> Regarding the patch above--I did look around the code, and it seems safe,
>>>>> as per your analysis, I didn't see any reference to entity after job submission,
>>>>> but I'll comment on that thread as well for the record.
>>>>
>>>> Reference counting the entities was suggested before. The intentionally
>>>> avoided that so far because the entity might be the tip of the iceberg
>>>> of stuff you need to keep around.
>>>>
>>>> For example for command submission you also need the VM and when you
>>>> keep the VM alive you also need to keep the file private alive....
>>>
>>> Yeah refcounting looks often like the easy way out to avoid
>>> use-after-free issue, until you realize you've just made lifetimes
>>> unbounded and have some enourmous leaks: entity keeps vm alive, vm
>>> keeps all the bo alives, somehow every crash wastes more memory
>>> because vk_device_lost means userspace allocates new stuff for
>>> everything.
>>
>> Refcounting everywhere has been working well for us, so well that so far all
>> the oopses we've hit have been... drm_sched bugs like this one, not anything
>> in the driver. But at least in Rust you have the advantage that you can't
>> just forget a decref in a rarely-hit error path (or worse, forget an incref
>> somewhere important)... ^^
>>
>>> If possible a lifetime design where lifetimes have hard bounds and you
>>> just borrow a reference under a lock (or some other ownership rule) is
>>> generally much more solid. But also much harder to design correctly
>>> :-/
>>>
>>>> Additional to that we have some ugly inter dependencies between tearing
>>>> down an application (potential with a KILL signal from the OOM killer)
>>>> and backward compatibility for some applications which render something
>>>> and quit before the rendering is completed in the hardware.
>>>
>>> Yeah I think that part would also be good to sort out once&for all in
>>> drm/sched, because i915 has/had the same struggle.
>>> -Daniel
>>>
>>
>> Is this really a thing? I think that's never going to work well for explicit
>> sync, since the kernel doesn't even know what BOs it has to keep alive for a
>> job... I guess it could keep the entire file and all of its objects/VMs/etc
>> alive until all of its submissions complete but... ewww.
>>
>> Our Mesa implementation synchronously waits for all jobs on context destroy
>> for this reason, but if you just kill the app, yeah, you get faults as
>> running GPU jobs have BOs yanked out from under them. Kill loops make for a
>> good way of testing fault handling...
>
> You wind down the entire thing on file close? Like
> - stop all context
> - tear down all context
> - tear down all vm
> - tear down all obj
>
> Just winding things down in a random order and then letting gpu fault
> handling sort out the mess doesn't strike me as particularly clean design
> ...
The idea is that object drop order generally doesn't matter since things
that care about other things should own them or hold references to them
anyway, so the dependency graph of all the resources is encoded directly
in the type hierarchy instead of having to open-code a "cleanup
procedure"... which then invariably leads to corner cases when you have
to do the same thing, or part of it, for error handling.

This has been working *very* well! It solves the issue of error handling
since error handling just unwinds whatever was done to that point
naturally in Rust (? operator), so there's practically no open-coded
error handling code anywhere. The first time we ran into OOMs (Xonotic
with no Mesa texture compression support yet, on 8GB machines on max
settings...) the whole thing just worked. OOM killer went rampant and
shmem doesn't account stuff to processes properly of course, but all the
error paths, allocation errors, etc... all of that just worked, first
try, across dozens of error paths that had never been tested before, not
a single oops or deadlock or leak or anything in sight. Similarly,
yesterday I did manage to run into drm_sched failing to create kthreads
(the scaling issue Matthew's series fixes)... and still, that was fine.
That happens on queue creation so it just bubbled up to Mesa as a failed
ioctl and things kept moving along nicely otherwise. I even have nice
ergonomic XArray semantics so that you can reserve a new slot, allocate
some object, then populate it, and if you don't (because you errored out
in between) it automatically gets freed again without explicit cleanup code.

And it also means that I can encode *firmware* resource dependencies in
the type system (with Rust lifetimes attached to *GPU* pointers even -
it's a bit dodgy how it's done but it works well in practice). Since it
is absolutely critical that the firmware objects respect their lifetimes
or else the whole thing crashes irrecoverably, this is the only way I
feel it's been even practical to write this driver and not be a firmware
crash mess. Of course we still get some crashes due to flaws in how I
understand the firmware, but it's always things I don't know, not things
I accidentally messed up in some corner case code path we don't normally
hit, since I just don't have to think about that as long as the
hierarchy is right.

I actually don't know exactly what precise order things get dropped in
for this reason! I could find out, and it's predictable in Rust, what I
mean is that thinking about a total order like that is not necessary for
correctness as long as I got the ownership right. Off the top of my head
though, it goes very roughly like this:

- On File close, all the GEM objects get closed (DRM core does this)
- This triggers explicit drops of all mappings in those GEM objects
owned by that File (identified by unique ID, this is the one annoying
circular reference thing I mentioned in the other thread...). At this
point the GPU probably faults but we don't care. *
- The File itself gets dropped, which drops the XArrays for queues and
(UAPI) VMs
- UAPI VMs getting dropped doesn't do much other than unmap a single
dummy object. The underlying MMU VM is refcounted and jobs hold
references. This also drops the userspace VM object allocator used for
kernel-managed allocations, but that too is internally refcounted and
won't go away until all of its allocations do.
- Queues get dropped, which mostly drops a bunch of references to things
that no longer matter, along with the scheduler and scheduler entity.
- The entity already has a reference to the scheduler in the abstraction
(to meet the soundness requirement), so the entity necessarily goes
first. That kills all not yet scheduled jobs, freeing any resources they
might use.
- Then the scheduler gets torn down, and with my other patch that
logically kills all in-flight jobs, detaching their hardware fences and
dropping the job objects. This... still doesn't do much other than drop
some references that we don't care about.
- At this point, if any jobs are in flight, their firmware objects and
all of the type hierarchy that goes with them is still alive, as well as
the firmware queues they're in and the Rust objects representing them,
the VMs they use, the Events they have been allocated...
- Jobs complete (successfully or fault), then when complete get popped
off of the Rust-side queue objects that represent the firmware queues.
- When any given FW queue is empty, it relinquishes its assigned
firmware event ID. That causes the event system to drop its owner
reference to it, which means the queue itself gets dropped (since the
UAPI Queue that also held a reference is gone). That then also drops a
reference to what I call the GpuContext.
- To avoid deadlocks, completed job objects are freed in another thread
(ugly hack right now, should be done better in the future). Eventually
as that happens, any resources they reference are dropped, including
some shared ones that are logically tied to the scheduler/queues,
references to the MMU VM address space, references to the VM slot that
address space is assigned to, objects allocated out of user VM space,
everything. Refcounts everywhere for anything shared, direct ownership
of child structures for anything that isn't (work buffers, firmware
command lists, etc.). I once tried to make a slide of the references and
pointers involved in just the vertex half of a single GPU job and...
even just that was quite interesting.
- When the last job completes, that drops the last reference to the VM
slot, which means the backing VM is logically detached from the GPU MMU
(this is lazy though, it's still there in practice).
- When the last firmware queue is dropped for a logical queue/sched/etc
(which means no more jobs are running at the GPU for that context), that
drops the last reference to the GpuContext. That also gets shoved into
another thread context for cleanup to avoid deadlocks with fault recovery.
- When that is finally cleaned up, a firmware command is sent to
invalidate the GpuContext. I'm still figuring out what that does and
what the lifetime rules are here (this is the only explicit invalidation
command that exists), but as of yesterday I know that at the very least
we need to keep hold of any Tiled Vertex Buffer associated with it until
after inval, so that now has a reference to it that gets dropped after
the firmware acknowledges the invalidation (unless it's a
non-render-capable Queue, then no TVB necessary).
- When the Buffer gets dropped, that frees both its backing memory and
(virtual) page list structures, which are in user VM space, as well as
some kernel firmware objects.
- If things have happened in the order I listed here, those will be the
last allocations in the two user VM space heap object allocators, so
those now get dropped, which drops the mappings of their backing GEM
objects, unmapping them from the MMU VM page tables.
- Those mappings will now be the last references to the actual MMU VM
object, so that it gets destroyed (the lazy detach comes into effect
here, PT base address is removed from the VM context base table, full
ASID invalidate, etc.), which with it drops the IoPgTable that backs it,
which frees the page tables.
- Then finally the GEM objects backing the userspace allocators get
dropped as well, which will be the last reference to them, so those get
freed.

I probably got more than one thing wrong there, and there's layers of
complexity I glossed over, but that's the rough idea ^^

* If we need to fix this then we're going to need some kind of signal
from the DRM core that this is happening and it's not normal
user-triggered GEM closing, and it's going to be interesting... it also
means we need some kind of mechanism to transfer responsibility over
those mappings to all in-flight jobs themselves, because normally
userspace is strictly responsible for all mappings in an explicit sync
VM bind style world, and now we're adding a special case where we freeze
the state of the VM until all in-flight jobs are done when the File goes
away instead of eagerly freeing things. That's a very weird departure
from how the whole thing normally works, so if we really want that I'm
going to have to think of how to do it reasonably. It might be easier
once we implement the full VM map range tracking which will likely flip
the VM<->GEM object relationship around a bit.

~~ Lina

2023-04-06 12:44:35

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

Am Donnerstag, dem 06.04.2023 um 14:09 +0200 schrieb Daniel Vetter:
> On Thu, Apr 06, 2023 at 12:45:12PM +0200, Lucas Stach wrote:
> > Am Donnerstag, dem 06.04.2023 um 10:27 +0200 schrieb Daniel Vetter:
> > > On Thu, 6 Apr 2023 at 10:22, Christian König <[email protected]> wrote:
> > > >
> > > > Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> > > > > On 2023-04-05 10:05, Danilo Krummrich wrote:
> > > > > > On 4/4/23 06:31, Luben Tuikov wrote:
> > > > > > > On 2023-03-28 04:54, Lucas Stach wrote:
> > > > > > > > Hi Danilo,
> > > > > > > >
> > > > > > > > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > > > > > > > tries to track the accumulated time that a job was active on the GPU
> > > > > > > > > writing it to the entity through which the job was deployed to the
> > > > > > > > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > > > > > > > which fetches a job from the schedulers pending_list.
> > > > > > > > >
> > > > > > > > > Doing this can result in a race condition where the entity is already
> > > > > > > > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > > > > > > > once the job is fetched from the pending_list.
> > > > > > > > >
> > > > > > > > > After drm_sched_entity_destroy() being called it should be safe to free
> > > > > > > > > the structure that embeds the entity. However, a job originally handed
> > > > > > > > > over to the scheduler by this entity might still reside in the
> > > > > > > > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > > > > > > > already being called and the entity being freed. Hence, we can run into
> > > > > > > > > a UAF.
> > > > > > > > >
> > > > > > > > Sorry about that, I clearly didn't properly consider this case.
> > > > > > > >
> > > > > > > > > In my case it happened that a job, as explained above, was just picked
> > > > > > > > > from the schedulers pending_list after the entity was freed due to the
> > > > > > > > > client application exiting. Meanwhile this freed up memory was already
> > > > > > > > > allocated for a subsequent client applications job structure again.
> > > > > > > > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > > > > > > > reproduce the same corruption over and over again by just using
> > > > > > > > > deqp-runner to run a specific set of VK test cases in parallel.
> > > > > > > > >
> > > > > > > > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > > > > > > > I miss something), which is why I'm writing this mail instead of sending
> > > > > > > > > a fix directly.
> > > > > > > > >
> > > > > > > > > Spontaneously, I see three options to fix it:
> > > > > > > > >
> > > > > > > > > 1. Rather than embedding the entity into driver specific structures
> > > > > > > > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > > > > > > > reference count it, such that it's only freed up once all jobs that were
> > > > > > > > > deployed through this entity are fetched from the schedulers pending list.
> > > > > > > > >
> > > > > > > > My vote is on this or something in similar vain for the long term. I
> > > > > > > > have some hope to be able to add a GPU scheduling algorithm with a bit
> > > > > > > > more fairness than the current one sometime in the future, which
> > > > > > > > requires execution time tracking on the entities.
> > > > > > > Danilo,
> > > > > > >
> > > > > > > Using kref is preferable, i.e. option 1 above.
> > > > > > I think the only real motivation for doing that would be for generically
> > > > > > tracking job statistics within the entity a job was deployed through. If
> > > > > > we all agree on tracking job statistics this way I am happy to prepare a
> > > > > > patch for this option and drop this one:
> > > > > > https://lore.kernel.org/all/[email protected]/T/#u
> > > > > Hmm, I never thought about "job statistics" when I preferred using kref above.
> > > > > The reason kref is attractive is because one doesn't need to worry about
> > > > > it--when the last user drops the kref, the release is called to do
> > > > > housekeeping. If this never happens, we know that we have a bug to debug.
> > > >
> > > > Yeah, reference counting unfortunately have some traps as well. For
> > > > example rarely dropping the last reference from interrupt context or
> > > > with some unexpected locks help when the cleanup function doesn't expect
> > > > that is a good recipe for problems as well.
> > > >
> > Fully agreed.
> >
> > > > > Regarding the patch above--I did look around the code, and it seems safe,
> > > > > as per your analysis, I didn't see any reference to entity after job submission,
> > > > > but I'll comment on that thread as well for the record.
> > > >
> > > > Reference counting the entities was suggested before. The intentionally
> > > > avoided that so far because the entity might be the tip of the iceberg
> > > > of stuff you need to keep around.
> > > >
> > > > For example for command submission you also need the VM and when you
> > > > keep the VM alive you also need to keep the file private alive....
> > >
> > > Yeah refcounting looks often like the easy way out to avoid
> > > use-after-free issue, until you realize you've just made lifetimes
> > > unbounded and have some enourmous leaks: entity keeps vm alive, vm
> > > keeps all the bo alives, somehow every crash wastes more memory
> > > because vk_device_lost means userspace allocates new stuff for
> > > everything.
> > >
> > > If possible a lifetime design where lifetimes have hard bounds and you
> > > just borrow a reference under a lock (or some other ownership rule) is
> > > generally much more solid. But also much harder to design correctly
> > > :-/
> > >
> > The use we are discussing here is to keep the entity alive as long as
> > jobs from that entity are still active on the HW. While there are no
> > hard bounds on when a job will get inactive, at least it's not
> > unbounded. On a crash/fault the job will be removed from the hardware
> > pretty soon.
> >
> > Well behaved jobs after a application shutdown might take a little
> > longer, but I don't really see the new problem with keeping the entity
> > alive? As long as a job is active on the hardware, we can't throw out
> > the VM or BOs, no difference whether the entity is kept alive or not.
> >
> > Some hardware might have ways to expedite job inactivation by
> > deactivating scheduling queues, or just taking a fault, but for some HW
> > we'll just have to wait for the job to finish.
>
> Shouldn't the scheduler's timed_out/tdr logic take care of these? It's
> probably not good to block in something like the close(drmfd) or process
> exit() for these, but it's all dma_fence underneath and those _must_
> singal in finite time no matter what. So shouldn't be a deadlock problem,
> but might still be a "userspace really doesn't like a big stall there"
> problem.

I'm not sure if we are talking past each other here. I don't really
follow where you see the problem here?

If the hardware works as expected and the job is behaving well, it will
finish in finite time when the HW is done with the job. When the job is
bad and crashes the HW, sure it will be shot down by the timeout
handling. Both cases will signal the fences and clean up resources
eventually.

Keeping the scheduler entity alive is really orthogonal to that. If the
entity is kept alive until the job is cleaned up we could potentially
add more common state, like the GPU time tracking, to the entity
without the risk of use after free.

Regards,
Lucas

2023-04-06 13:39:17

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On Thu, Apr 06, 2023 at 09:21:47PM +0900, Asahi Lina wrote:
> On 06/04/2023 19.09, Daniel Vetter wrote:
> > On Thu, Apr 06, 2023 at 06:05:11PM +0900, Asahi Lina wrote:
> > > On 06/04/2023 17.27, Daniel Vetter wrote:
> > > > On Thu, 6 Apr 2023 at 10:22, Christian K?nig <[email protected]> wrote:
> > > > >
> > > > > Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> > > > > > On 2023-04-05 10:05, Danilo Krummrich wrote:
> > > > > > > On 4/4/23 06:31, Luben Tuikov wrote:
> > > > > > > > On 2023-03-28 04:54, Lucas Stach wrote:
> > > > > > > > > Hi Danilo,
> > > > > > > > >
> > > > > > > > > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > > > > > > > > tries to track the accumulated time that a job was active on the GPU
> > > > > > > > > > writing it to the entity through which the job was deployed to the
> > > > > > > > > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > > > > > > > > which fetches a job from the schedulers pending_list.
> > > > > > > > > >
> > > > > > > > > > Doing this can result in a race condition where the entity is already
> > > > > > > > > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > > > > > > > > once the job is fetched from the pending_list.
> > > > > > > > > >
> > > > > > > > > > After drm_sched_entity_destroy() being called it should be safe to free
> > > > > > > > > > the structure that embeds the entity. However, a job originally handed
> > > > > > > > > > over to the scheduler by this entity might still reside in the
> > > > > > > > > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > > > > > > > > already being called and the entity being freed. Hence, we can run into
> > > > > > > > > > a UAF.
> > > > > > > > > >
> > > > > > > > > Sorry about that, I clearly didn't properly consider this case.
> > > > > > > > >
> > > > > > > > > > In my case it happened that a job, as explained above, was just picked
> > > > > > > > > > from the schedulers pending_list after the entity was freed due to the
> > > > > > > > > > client application exiting. Meanwhile this freed up memory was already
> > > > > > > > > > allocated for a subsequent client applications job structure again.
> > > > > > > > > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > > > > > > > > reproduce the same corruption over and over again by just using
> > > > > > > > > > deqp-runner to run a specific set of VK test cases in parallel.
> > > > > > > > > >
> > > > > > > > > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > > > > > > > > I miss something), which is why I'm writing this mail instead of sending
> > > > > > > > > > a fix directly.
> > > > > > > > > >
> > > > > > > > > > Spontaneously, I see three options to fix it:
> > > > > > > > > >
> > > > > > > > > > 1. Rather than embedding the entity into driver specific structures
> > > > > > > > > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > > > > > > > > reference count it, such that it's only freed up once all jobs that were
> > > > > > > > > > deployed through this entity are fetched from the schedulers pending list.
> > > > > > > > > >
> > > > > > > > > My vote is on this or something in similar vain for the long term. I
> > > > > > > > > have some hope to be able to add a GPU scheduling algorithm with a bit
> > > > > > > > > more fairness than the current one sometime in the future, which
> > > > > > > > > requires execution time tracking on the entities.
> > > > > > > > Danilo,
> > > > > > > >
> > > > > > > > Using kref is preferable, i.e. option 1 above.
> > > > > > > I think the only real motivation for doing that would be for generically
> > > > > > > tracking job statistics within the entity a job was deployed through. If
> > > > > > > we all agree on tracking job statistics this way I am happy to prepare a
> > > > > > > patch for this option and drop this one:
> > > > > > > https://lore.kernel.org/all/[email protected]/T/#u
> > > > > > Hmm, I never thought about "job statistics" when I preferred using kref above.
> > > > > > The reason kref is attractive is because one doesn't need to worry about
> > > > > > it--when the last user drops the kref, the release is called to do
> > > > > > housekeeping. If this never happens, we know that we have a bug to debug.
> > > > >
> > > > > Yeah, reference counting unfortunately have some traps as well. For
> > > > > example rarely dropping the last reference from interrupt context or
> > > > > with some unexpected locks help when the cleanup function doesn't expect
> > > > > that is a good recipe for problems as well.
> > > > >
> > > > > > Regarding the patch above--I did look around the code, and it seems safe,
> > > > > > as per your analysis, I didn't see any reference to entity after job submission,
> > > > > > but I'll comment on that thread as well for the record.
> > > > >
> > > > > Reference counting the entities was suggested before. The intentionally
> > > > > avoided that so far because the entity might be the tip of the iceberg
> > > > > of stuff you need to keep around.
> > > > >
> > > > > For example for command submission you also need the VM and when you
> > > > > keep the VM alive you also need to keep the file private alive....
> > > >
> > > > Yeah refcounting looks often like the easy way out to avoid
> > > > use-after-free issue, until you realize you've just made lifetimes
> > > > unbounded and have some enourmous leaks: entity keeps vm alive, vm
> > > > keeps all the bo alives, somehow every crash wastes more memory
> > > > because vk_device_lost means userspace allocates new stuff for
> > > > everything.
> > >
> > > Refcounting everywhere has been working well for us, so well that so far all
> > > the oopses we've hit have been... drm_sched bugs like this one, not anything
> > > in the driver. But at least in Rust you have the advantage that you can't
> > > just forget a decref in a rarely-hit error path (or worse, forget an incref
> > > somewhere important)... ^^
> > >
> > > > If possible a lifetime design where lifetimes have hard bounds and you
> > > > just borrow a reference under a lock (or some other ownership rule) is
> > > > generally much more solid. But also much harder to design correctly
> > > > :-/
> > > >
> > > > > Additional to that we have some ugly inter dependencies between tearing
> > > > > down an application (potential with a KILL signal from the OOM killer)
> > > > > and backward compatibility for some applications which render something
> > > > > and quit before the rendering is completed in the hardware.
> > > >
> > > > Yeah I think that part would also be good to sort out once&for all in
> > > > drm/sched, because i915 has/had the same struggle.
> > > > -Daniel
> > > >
> > >
> > > Is this really a thing? I think that's never going to work well for explicit
> > > sync, since the kernel doesn't even know what BOs it has to keep alive for a
> > > job... I guess it could keep the entire file and all of its objects/VMs/etc
> > > alive until all of its submissions complete but... ewww.
> > >
> > > Our Mesa implementation synchronously waits for all jobs on context destroy
> > > for this reason, but if you just kill the app, yeah, you get faults as
> > > running GPU jobs have BOs yanked out from under them. Kill loops make for a
> > > good way of testing fault handling...
> >
> > You wind down the entire thing on file close? Like
> > - stop all context
> > - tear down all context
> > - tear down all vm
> > - tear down all obj
> >
> > Just winding things down in a random order and then letting gpu fault
> > handling sort out the mess doesn't strike me as particularly clean design
> > ...
> The idea is that object drop order generally doesn't matter since things
> that care about other things should own them or hold references to them
> anyway, so the dependency graph of all the resources is encoded directly in
> the type hierarchy instead of having to open-code a "cleanup procedure"...
> which then invariably leads to corner cases when you have to do the same
> thing, or part of it, for error handling.
>
> This has been working *very* well! It solves the issue of error handling
> since error handling just unwinds whatever was done to that point naturally
> in Rust (? operator), so there's practically no open-coded error handling
> code anywhere. The first time we ran into OOMs (Xonotic with no Mesa texture
> compression support yet, on 8GB machines on max settings...) the whole thing
> just worked. OOM killer went rampant and shmem doesn't account stuff to
> processes properly of course, but all the error paths, allocation errors,
> etc... all of that just worked, first try, across dozens of error paths that
> had never been tested before, not a single oops or deadlock or leak or
> anything in sight. Similarly, yesterday I did manage to run into drm_sched
> failing to create kthreads (the scaling issue Matthew's series fixes)... and
> still, that was fine. That happens on queue creation so it just bubbled up
> to Mesa as a failed ioctl and things kept moving along nicely otherwise. I
> even have nice ergonomic XArray semantics so that you can reserve a new
> slot, allocate some object, then populate it, and if you don't (because you
> errored out in between) it automatically gets freed again without explicit
> cleanup code.
>
> And it also means that I can encode *firmware* resource dependencies in the
> type system (with Rust lifetimes attached to *GPU* pointers even - it's a
> bit dodgy how it's done but it works well in practice). Since it is
> absolutely critical that the firmware objects respect their lifetimes or
> else the whole thing crashes irrecoverably, this is the only way I feel it's
> been even practical to write this driver and not be a firmware crash mess.
> Of course we still get some crashes due to flaws in how I understand the
> firmware, but it's always things I don't know, not things I accidentally
> messed up in some corner case code path we don't normally hit, since I just
> don't have to think about that as long as the hierarchy is right.
>
> I actually don't know exactly what precise order things get dropped in for
> this reason! I could find out, and it's predictable in Rust, what I mean is
> that thinking about a total order like that is not necessary for correctness
> as long as I got the ownership right. Off the top of my head though, it goes
> very roughly like this:
>
> - On File close, all the GEM objects get closed (DRM core does this)
> - This triggers explicit drops of all mappings in those GEM objects owned by
> that File (identified by unique ID, this is the one annoying circular
> reference thing I mentioned in the other thread...). At this point the GPU
> probably faults but we don't care. *
> - The File itself gets dropped, which drops the XArrays for queues and
> (UAPI) VMs
> - UAPI VMs getting dropped doesn't do much other than unmap a single dummy
> object. The underlying MMU VM is refcounted and jobs hold references. This
> also drops the userspace VM object allocator used for kernel-managed
> allocations, but that too is internally refcounted and won't go away until
> all of its allocations do.
> - Queues get dropped, which mostly drops a bunch of references to things
> that no longer matter, along with the scheduler and scheduler entity.
> - The entity already has a reference to the scheduler in the abstraction (to
> meet the soundness requirement), so the entity necessarily goes first. That
> kills all not yet scheduled jobs, freeing any resources they might use.
> - Then the scheduler gets torn down, and with my other patch that logically
> kills all in-flight jobs, detaching their hardware fences and dropping the
> job objects. This... still doesn't do much other than drop some references
> that we don't care about.
> - At this point, if any jobs are in flight, their firmware objects and all
> of the type hierarchy that goes with them is still alive, as well as the
> firmware queues they're in and the Rust objects representing them, the VMs
> they use, the Events they have been allocated...
> - Jobs complete (successfully or fault), then when complete get popped off
> of the Rust-side queue objects that represent the firmware queues.
> - When any given FW queue is empty, it relinquishes its assigned firmware
> event ID. That causes the event system to drop its owner reference to it,
> which means the queue itself gets dropped (since the UAPI Queue that also
> held a reference is gone). That then also drops a reference to what I call
> the GpuContext.
> - To avoid deadlocks, completed job objects are freed in another thread
> (ugly hack right now, should be done better in the future). Eventually as
> that happens, any resources they reference are dropped, including some
> shared ones that are logically tied to the scheduler/queues, references to
> the MMU VM address space, references to the VM slot that address space is
> assigned to, objects allocated out of user VM space, everything. Refcounts
> everywhere for anything shared, direct ownership of child structures for
> anything that isn't (work buffers, firmware command lists, etc.). I once
> tried to make a slide of the references and pointers involved in just the
> vertex half of a single GPU job and... even just that was quite interesting.
> - When the last job completes, that drops the last reference to the VM slot,
> which means the backing VM is logically detached from the GPU MMU (this is
> lazy though, it's still there in practice).
> - When the last firmware queue is dropped for a logical queue/sched/etc
> (which means no more jobs are running at the GPU for that context), that
> drops the last reference to the GpuContext. That also gets shoved into
> another thread context for cleanup to avoid deadlocks with fault recovery.
> - When that is finally cleaned up, a firmware command is sent to invalidate
> the GpuContext. I'm still figuring out what that does and what the lifetime
> rules are here (this is the only explicit invalidation command that exists),
> but as of yesterday I know that at the very least we need to keep hold of
> any Tiled Vertex Buffer associated with it until after inval, so that now
> has a reference to it that gets dropped after the firmware acknowledges the
> invalidation (unless it's a non-render-capable Queue, then no TVB
> necessary).
> - When the Buffer gets dropped, that frees both its backing memory and
> (virtual) page list structures, which are in user VM space, as well as some
> kernel firmware objects.
> - If things have happened in the order I listed here, those will be the last
> allocations in the two user VM space heap object allocators, so those now
> get dropped, which drops the mappings of their backing GEM objects,
> unmapping them from the MMU VM page tables.
> - Those mappings will now be the last references to the actual MMU VM
> object, so that it gets destroyed (the lazy detach comes into effect here,
> PT base address is removed from the VM context base table, full ASID
> invalidate, etc.), which with it drops the IoPgTable that backs it, which
> frees the page tables.
> - Then finally the GEM objects backing the userspace allocators get dropped
> as well, which will be the last reference to them, so those get freed.

Thanks for the write up!

Maybe some more fundamental thoughts on this entire endeavour:

I think one thing that's causing a lot of friction that in C drivers at
least some of these things are done with implied/borrowed references. If
you want an absolute shocking example, the gpu crash dump sampling for a
driver that does dynamic memory management can be pretty epic:

- the timeout handler is guaranteed (well if the driver didn't screw up
things, which mostly boils down to call drm_sched_stop() before it
analyzes anything at all wrt gpu hw state) to hold up the drm_job
completion fence

- this fence is (at submit ioctl time) installed into all the dma_resv
fence containers for the vm (single shared dma_resv for all vm-private obj for
efficiency, all shareable gem_bo have their own dma_resv)

- the memory manager guarantees that it will not free (or even move in the
case of ttm) the buffer while a fence is unsingalled. ttm does this by
delaying the free as needed, i915-gem and some of the others did this by
trying to hold onto extra references. the latter causes way too much
problems with reference dropping in my experience looking at a decade of
drivers getting details wrong here.

- which means the crash dump handler can have a plain list of gem_bo to
record, without any references, because they wont go away before it
finishes.

- more fun even, you could directly sample the memory locations at ioctl
submit time, and even when ttm moves the buffers around in a pipelined
fashion: your timeout handler wont run before all the jobs to move the
buffer into the right location have completed, and it will not unblock
any subsequent move or eviction that's already queued up. Which means
even the raw memory address will point at the right stuff.

This is just one example. drm and the kernel is full of these borrowed
reference tricks with often cross through the entire stack, so rust has to
be able to cope. The sales pitch of rust, and really the reason it's the
first non-C language in linux, is that with the borrow semantics, it can
validate this stuff at compile time, and allow kernel drivers to continue
playing these tricks to outright avoid any and all refcounting needs. If
rust doesn't deliver and we need to have all the refcounting again to make
rust drivers safe I think that'll cast serious doubts on the viability of
rust-in-linux.

Now there's definitely going to be hilarious bugs uncovered on the C side,
and semantics that need to be clarified, but I think enabling scary tricks
like the above one is what'll fundamentally make or break rust in linux.

In the end it'll be lots of detail work, and I really hope it all works
out.

> I probably got more than one thing wrong there, and there's layers of
> complexity I glossed over, but that's the rough idea ^^
>
> * If we need to fix this then we're going to need some kind of signal from
> the DRM core that this is happening and it's not normal user-triggered GEM
> closing, and it's going to be interesting... it also means we need some kind
> of mechanism to transfer responsibility over those mappings to all in-flight
> jobs themselves, because normally userspace is strictly responsible for all
> mappings in an explicit sync VM bind style world, and now we're adding a
> special case where we freeze the state of the VM until all in-flight jobs
> are done when the File goes away instead of eagerly freeing things. That's a
> very weird departure from how the whole thing normally works, so if we
> really want that I'm going to have to think of how to do it reasonably. It
> might be easier once we implement the full VM map range tracking which will
> likely flip the VM<->GEM object relationship around a bit.

See my other reply for you how can untangle this potentially on the vm_id
subthread. I think in general rust Drop is awesome, but there's cases
where we'll have to be more explicit at unwinding things. At least in the
linux kernel the general consensus is that too much refcounting is a
maintenance disaster because you end up dropping the final reference in
all kinds of really bad places (interrupt handlers, while holding the
wrong locks, holding them too long). Somewhat related I also expect that
rust drivers will need to have quite a few manual drop calls to
artificially limit lifetime, because sometimes when you drop things the
wrong way round (e.g. drop the refcount while still having the mutex guard
live) results in deadlocks.

There's going to be lots of fun here :-)
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2023-04-06 13:58:04

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On Thu, Apr 06, 2023 at 02:19:20PM +0200, Lucas Stach wrote:
> Am Donnerstag, dem 06.04.2023 um 14:09 +0200 schrieb Daniel Vetter:
> > On Thu, Apr 06, 2023 at 12:45:12PM +0200, Lucas Stach wrote:
> > > Am Donnerstag, dem 06.04.2023 um 10:27 +0200 schrieb Daniel Vetter:
> > > > On Thu, 6 Apr 2023 at 10:22, Christian K?nig <[email protected]> wrote:
> > > > >
> > > > > Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> > > > > > On 2023-04-05 10:05, Danilo Krummrich wrote:
> > > > > > > On 4/4/23 06:31, Luben Tuikov wrote:
> > > > > > > > On 2023-03-28 04:54, Lucas Stach wrote:
> > > > > > > > > Hi Danilo,
> > > > > > > > >
> > > > > > > > > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > > > > > > > > tries to track the accumulated time that a job was active on the GPU
> > > > > > > > > > writing it to the entity through which the job was deployed to the
> > > > > > > > > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > > > > > > > > which fetches a job from the schedulers pending_list.
> > > > > > > > > >
> > > > > > > > > > Doing this can result in a race condition where the entity is already
> > > > > > > > > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > > > > > > > > once the job is fetched from the pending_list.
> > > > > > > > > >
> > > > > > > > > > After drm_sched_entity_destroy() being called it should be safe to free
> > > > > > > > > > the structure that embeds the entity. However, a job originally handed
> > > > > > > > > > over to the scheduler by this entity might still reside in the
> > > > > > > > > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > > > > > > > > already being called and the entity being freed. Hence, we can run into
> > > > > > > > > > a UAF.
> > > > > > > > > >
> > > > > > > > > Sorry about that, I clearly didn't properly consider this case.
> > > > > > > > >
> > > > > > > > > > In my case it happened that a job, as explained above, was just picked
> > > > > > > > > > from the schedulers pending_list after the entity was freed due to the
> > > > > > > > > > client application exiting. Meanwhile this freed up memory was already
> > > > > > > > > > allocated for a subsequent client applications job structure again.
> > > > > > > > > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > > > > > > > > reproduce the same corruption over and over again by just using
> > > > > > > > > > deqp-runner to run a specific set of VK test cases in parallel.
> > > > > > > > > >
> > > > > > > > > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > > > > > > > > I miss something), which is why I'm writing this mail instead of sending
> > > > > > > > > > a fix directly.
> > > > > > > > > >
> > > > > > > > > > Spontaneously, I see three options to fix it:
> > > > > > > > > >
> > > > > > > > > > 1. Rather than embedding the entity into driver specific structures
> > > > > > > > > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > > > > > > > > reference count it, such that it's only freed up once all jobs that were
> > > > > > > > > > deployed through this entity are fetched from the schedulers pending list.
> > > > > > > > > >
> > > > > > > > > My vote is on this or something in similar vain for the long term. I
> > > > > > > > > have some hope to be able to add a GPU scheduling algorithm with a bit
> > > > > > > > > more fairness than the current one sometime in the future, which
> > > > > > > > > requires execution time tracking on the entities.
> > > > > > > > Danilo,
> > > > > > > >
> > > > > > > > Using kref is preferable, i.e. option 1 above.
> > > > > > > I think the only real motivation for doing that would be for generically
> > > > > > > tracking job statistics within the entity a job was deployed through. If
> > > > > > > we all agree on tracking job statistics this way I am happy to prepare a
> > > > > > > patch for this option and drop this one:
> > > > > > > https://lore.kernel.org/all/[email protected]/T/#u
> > > > > > Hmm, I never thought about "job statistics" when I preferred using kref above.
> > > > > > The reason kref is attractive is because one doesn't need to worry about
> > > > > > it--when the last user drops the kref, the release is called to do
> > > > > > housekeeping. If this never happens, we know that we have a bug to debug.
> > > > >
> > > > > Yeah, reference counting unfortunately have some traps as well. For
> > > > > example rarely dropping the last reference from interrupt context or
> > > > > with some unexpected locks help when the cleanup function doesn't expect
> > > > > that is a good recipe for problems as well.
> > > > >
> > > Fully agreed.
> > >
> > > > > > Regarding the patch above--I did look around the code, and it seems safe,
> > > > > > as per your analysis, I didn't see any reference to entity after job submission,
> > > > > > but I'll comment on that thread as well for the record.
> > > > >
> > > > > Reference counting the entities was suggested before. The intentionally
> > > > > avoided that so far because the entity might be the tip of the iceberg
> > > > > of stuff you need to keep around.
> > > > >
> > > > > For example for command submission you also need the VM and when you
> > > > > keep the VM alive you also need to keep the file private alive....
> > > >
> > > > Yeah refcounting looks often like the easy way out to avoid
> > > > use-after-free issue, until you realize you've just made lifetimes
> > > > unbounded and have some enourmous leaks: entity keeps vm alive, vm
> > > > keeps all the bo alives, somehow every crash wastes more memory
> > > > because vk_device_lost means userspace allocates new stuff for
> > > > everything.
> > > >
> > > > If possible a lifetime design where lifetimes have hard bounds and you
> > > > just borrow a reference under a lock (or some other ownership rule) is
> > > > generally much more solid. But also much harder to design correctly
> > > > :-/
> > > >
> > > The use we are discussing here is to keep the entity alive as long as
> > > jobs from that entity are still active on the HW. While there are no
> > > hard bounds on when a job will get inactive, at least it's not
> > > unbounded. On a crash/fault the job will be removed from the hardware
> > > pretty soon.
> > >
> > > Well behaved jobs after a application shutdown might take a little
> > > longer, but I don't really see the new problem with keeping the entity
> > > alive? As long as a job is active on the hardware, we can't throw out
> > > the VM or BOs, no difference whether the entity is kept alive or not.
> > >
> > > Some hardware might have ways to expedite job inactivation by
> > > deactivating scheduling queues, or just taking a fault, but for some HW
> > > we'll just have to wait for the job to finish.
> >
> > Shouldn't the scheduler's timed_out/tdr logic take care of these? It's
> > probably not good to block in something like the close(drmfd) or process
> > exit() for these, but it's all dma_fence underneath and those _must_
> > singal in finite time no matter what. So shouldn't be a deadlock problem,
> > but might still be a "userspace really doesn't like a big stall there"
> > problem.
>
> I'm not sure if we are talking past each other here. I don't really
> follow where you see the problem here?
>
> If the hardware works as expected and the job is behaving well, it will
> finish in finite time when the HW is done with the job. When the job is
> bad and crashes the HW, sure it will be shot down by the timeout
> handling. Both cases will signal the fences and clean up resources
> eventually.
>
> Keeping the scheduler entity alive is really orthogonal to that. If the
> entity is kept alive until the job is cleaned up we could potentially
> add more common state, like the GPU time tracking, to the entity
> without the risk of use after free.

I think we're both saying the same thing, just gotten a bit confused with
phrasing things ...
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2023-04-06 14:31:10

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

Am 06.04.23 um 12:45 schrieb Lucas Stach:
> Am Donnerstag, dem 06.04.2023 um 10:27 +0200 schrieb Daniel Vetter:
>> On Thu, 6 Apr 2023 at 10:22, Christian König <[email protected]> wrote:
>>> Am 05.04.23 um 18:09 schrieb Luben Tuikov:
>>>> On 2023-04-05 10:05, Danilo Krummrich wrote:
>>>>> On 4/4/23 06:31, Luben Tuikov wrote:
>>>>>> On 2023-03-28 04:54, Lucas Stach wrote:
>>>>>>> Hi Danilo,
>>>>>>>
>>>>>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>>>>>> tries to track the accumulated time that a job was active on the GPU
>>>>>>>> writing it to the entity through which the job was deployed to the
>>>>>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>>>>>> which fetches a job from the schedulers pending_list.
>>>>>>>>
>>>>>>>> Doing this can result in a race condition where the entity is already
>>>>>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>>>>>> once the job is fetched from the pending_list.
>>>>>>>>
>>>>>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>>>>>> the structure that embeds the entity. However, a job originally handed
>>>>>>>> over to the scheduler by this entity might still reside in the
>>>>>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>>>>>> already being called and the entity being freed. Hence, we can run into
>>>>>>>> a UAF.
>>>>>>>>
>>>>>>> Sorry about that, I clearly didn't properly consider this case.
>>>>>>>
>>>>>>>> In my case it happened that a job, as explained above, was just picked
>>>>>>>> from the schedulers pending_list after the entity was freed due to the
>>>>>>>> client application exiting. Meanwhile this freed up memory was already
>>>>>>>> allocated for a subsequent client applications job structure again.
>>>>>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>>>>>> reproduce the same corruption over and over again by just using
>>>>>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>>>>>
>>>>>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>>>>>> I miss something), which is why I'm writing this mail instead of sending
>>>>>>>> a fix directly.
>>>>>>>>
>>>>>>>> Spontaneously, I see three options to fix it:
>>>>>>>>
>>>>>>>> 1. Rather than embedding the entity into driver specific structures
>>>>>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>>>>>> reference count it, such that it's only freed up once all jobs that were
>>>>>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>>>>>
>>>>>>> My vote is on this or something in similar vain for the long term. I
>>>>>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>>>>>> more fairness than the current one sometime in the future, which
>>>>>>> requires execution time tracking on the entities.
>>>>>> Danilo,
>>>>>>
>>>>>> Using kref is preferable, i.e. option 1 above.
>>>>> I think the only real motivation for doing that would be for generically
>>>>> tracking job statistics within the entity a job was deployed through. If
>>>>> we all agree on tracking job statistics this way I am happy to prepare a
>>>>> patch for this option and drop this one:
>>>>> https://lore.kernel.org/all/[email protected]/T/#u
>>>> Hmm, I never thought about "job statistics" when I preferred using kref above.
>>>> The reason kref is attractive is because one doesn't need to worry about
>>>> it--when the last user drops the kref, the release is called to do
>>>> housekeeping. If this never happens, we know that we have a bug to debug.
>>> Yeah, reference counting unfortunately have some traps as well. For
>>> example rarely dropping the last reference from interrupt context or
>>> with some unexpected locks help when the cleanup function doesn't expect
>>> that is a good recipe for problems as well.
>>>
> Fully agreed.
>
>>>> Regarding the patch above--I did look around the code, and it seems safe,
>>>> as per your analysis, I didn't see any reference to entity after job submission,
>>>> but I'll comment on that thread as well for the record.
>>> Reference counting the entities was suggested before. The intentionally
>>> avoided that so far because the entity might be the tip of the iceberg
>>> of stuff you need to keep around.
>>>
>>> For example for command submission you also need the VM and when you
>>> keep the VM alive you also need to keep the file private alive....
>> Yeah refcounting looks often like the easy way out to avoid
>> use-after-free issue, until you realize you've just made lifetimes
>> unbounded and have some enourmous leaks: entity keeps vm alive, vm
>> keeps all the bo alives, somehow every crash wastes more memory
>> because vk_device_lost means userspace allocates new stuff for
>> everything.
>>
>> If possible a lifetime design where lifetimes have hard bounds and you
>> just borrow a reference under a lock (or some other ownership rule) is
>> generally much more solid. But also much harder to design correctly
>> :-/
>>
> The use we are discussing here is to keep the entity alive as long as
> jobs from that entity are still active on the HW. While there are no
> hard bounds on when a job will get inactive, at least it's not
> unbounded. On a crash/fault the job will be removed from the hardware
> pretty soon.
>
> Well behaved jobs after a application shutdown might take a little
> longer, but I don't really see the new problem with keeping the entity
> alive? As long as a job is active on the hardware, we can't throw out
> the VM or BOs, no difference whether the entity is kept alive or not.

Exactly that's the problem. VM & BOs are dropped as soon as the process
is destroyed, we *don't* wait for the hw to finish before doing so.

Just the backing store managed by all the house keeping objects isn't
freed until the hw is idle preventing a crash or accessing freed memory.

This behavior is rather important for the OOM killer since we need to be
able to tear down the process as fast as possible in that case.

Changing that is possible, but that's quite a huge change I'm not really
willing to do just for tracking the time spend.

What we could do is to track the unsignaled fences in each entity
similar to what amdgpu is doing.

Regards,
Christian.

>
> Some hardware might have ways to expedite job inactivation by
> deactivating scheduling queues, or just taking a fault, but for some HW
> we'll just have to wait for the job to finish.
>
> Regards,
> Lucas
>

2023-04-06 14:47:19

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On 06/04/2023 22.37, Daniel Vetter wrote:
> On Thu, Apr 06, 2023 at 09:21:47PM +0900, Asahi Lina wrote:
>> On 06/04/2023 19.09, Daniel Vetter wrote:
>>> On Thu, Apr 06, 2023 at 06:05:11PM +0900, Asahi Lina wrote:
>>>> On 06/04/2023 17.27, Daniel Vetter wrote:
>>>>> On Thu, 6 Apr 2023 at 10:22, Christian König <[email protected]> wrote:
>>>>>>
>>>>>> Am 05.04.23 um 18:09 schrieb Luben Tuikov:
>>>>>>> On 2023-04-05 10:05, Danilo Krummrich wrote:
>>>>>>>> On 4/4/23 06:31, Luben Tuikov wrote:
>>>>>>>>> On 2023-03-28 04:54, Lucas Stach wrote:
>>>>>>>>>> Hi Danilo,
>>>>>>>>>>
>>>>>>>>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>>>>>>>>> tries to track the accumulated time that a job was active on the GPU
>>>>>>>>>>> writing it to the entity through which the job was deployed to the
>>>>>>>>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>>>>>>>>> which fetches a job from the schedulers pending_list.
>>>>>>>>>>>
>>>>>>>>>>> Doing this can result in a race condition where the entity is already
>>>>>>>>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>>>>>>>>> once the job is fetched from the pending_list.
>>>>>>>>>>>
>>>>>>>>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>>>>>>>>> the structure that embeds the entity. However, a job originally handed
>>>>>>>>>>> over to the scheduler by this entity might still reside in the
>>>>>>>>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>>>>>>>>> already being called and the entity being freed. Hence, we can run into
>>>>>>>>>>> a UAF.
>>>>>>>>>>>
>>>>>>>>>> Sorry about that, I clearly didn't properly consider this case.
>>>>>>>>>>
>>>>>>>>>>> In my case it happened that a job, as explained above, was just picked
>>>>>>>>>>> from the schedulers pending_list after the entity was freed due to the
>>>>>>>>>>> client application exiting. Meanwhile this freed up memory was already
>>>>>>>>>>> allocated for a subsequent client applications job structure again.
>>>>>>>>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>>>>>>>>> reproduce the same corruption over and over again by just using
>>>>>>>>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>>>>>>>>
>>>>>>>>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>>>>>>>>> I miss something), which is why I'm writing this mail instead of sending
>>>>>>>>>>> a fix directly.
>>>>>>>>>>>
>>>>>>>>>>> Spontaneously, I see three options to fix it:
>>>>>>>>>>>
>>>>>>>>>>> 1. Rather than embedding the entity into driver specific structures
>>>>>>>>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>>>>>>>>> reference count it, such that it's only freed up once all jobs that were
>>>>>>>>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>>>>>>>>
>>>>>>>>>> My vote is on this or something in similar vain for the long term. I
>>>>>>>>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>>>>>>>>> more fairness than the current one sometime in the future, which
>>>>>>>>>> requires execution time tracking on the entities.
>>>>>>>>> Danilo,
>>>>>>>>>
>>>>>>>>> Using kref is preferable, i.e. option 1 above.
>>>>>>>> I think the only real motivation for doing that would be for generically
>>>>>>>> tracking job statistics within the entity a job was deployed through. If
>>>>>>>> we all agree on tracking job statistics this way I am happy to prepare a
>>>>>>>> patch for this option and drop this one:
>>>>>>>> https://lore.kernel.org/all/[email protected]/T/#u
>>>>>>> Hmm, I never thought about "job statistics" when I preferred using kref above.
>>>>>>> The reason kref is attractive is because one doesn't need to worry about
>>>>>>> it--when the last user drops the kref, the release is called to do
>>>>>>> housekeeping. If this never happens, we know that we have a bug to debug.
>>>>>>
>>>>>> Yeah, reference counting unfortunately have some traps as well. For
>>>>>> example rarely dropping the last reference from interrupt context or
>>>>>> with some unexpected locks help when the cleanup function doesn't expect
>>>>>> that is a good recipe for problems as well.
>>>>>>
>>>>>>> Regarding the patch above--I did look around the code, and it seems safe,
>>>>>>> as per your analysis, I didn't see any reference to entity after job submission,
>>>>>>> but I'll comment on that thread as well for the record.
>>>>>>
>>>>>> Reference counting the entities was suggested before. The intentionally
>>>>>> avoided that so far because the entity might be the tip of the iceberg
>>>>>> of stuff you need to keep around.
>>>>>>
>>>>>> For example for command submission you also need the VM and when you
>>>>>> keep the VM alive you also need to keep the file private alive....
>>>>>
>>>>> Yeah refcounting looks often like the easy way out to avoid
>>>>> use-after-free issue, until you realize you've just made lifetimes
>>>>> unbounded and have some enourmous leaks: entity keeps vm alive, vm
>>>>> keeps all the bo alives, somehow every crash wastes more memory
>>>>> because vk_device_lost means userspace allocates new stuff for
>>>>> everything.
>>>>
>>>> Refcounting everywhere has been working well for us, so well that so far all
>>>> the oopses we've hit have been... drm_sched bugs like this one, not anything
>>>> in the driver. But at least in Rust you have the advantage that you can't
>>>> just forget a decref in a rarely-hit error path (or worse, forget an incref
>>>> somewhere important)... ^^
>>>>
>>>>> If possible a lifetime design where lifetimes have hard bounds and you
>>>>> just borrow a reference under a lock (or some other ownership rule) is
>>>>> generally much more solid. But also much harder to design correctly
>>>>> :-/
>>>>>
>>>>>> Additional to that we have some ugly inter dependencies between tearing
>>>>>> down an application (potential with a KILL signal from the OOM killer)
>>>>>> and backward compatibility for some applications which render something
>>>>>> and quit before the rendering is completed in the hardware.
>>>>>
>>>>> Yeah I think that part would also be good to sort out once&for all in
>>>>> drm/sched, because i915 has/had the same struggle.
>>>>> -Daniel
>>>>>
>>>>
>>>> Is this really a thing? I think that's never going to work well for explicit
>>>> sync, since the kernel doesn't even know what BOs it has to keep alive for a
>>>> job... I guess it could keep the entire file and all of its objects/VMs/etc
>>>> alive until all of its submissions complete but... ewww.
>>>>
>>>> Our Mesa implementation synchronously waits for all jobs on context destroy
>>>> for this reason, but if you just kill the app, yeah, you get faults as
>>>> running GPU jobs have BOs yanked out from under them. Kill loops make for a
>>>> good way of testing fault handling...
>>>
>>> You wind down the entire thing on file close? Like
>>> - stop all context
>>> - tear down all context
>>> - tear down all vm
>>> - tear down all obj
>>>
>>> Just winding things down in a random order and then letting gpu fault
>>> handling sort out the mess doesn't strike me as particularly clean design
>>> ...
>> The idea is that object drop order generally doesn't matter since things
>> that care about other things should own them or hold references to them
>> anyway, so the dependency graph of all the resources is encoded directly in
>> the type hierarchy instead of having to open-code a "cleanup procedure"...
>> which then invariably leads to corner cases when you have to do the same
>> thing, or part of it, for error handling.
>>
>> This has been working *very* well! It solves the issue of error handling
>> since error handling just unwinds whatever was done to that point naturally
>> in Rust (? operator), so there's practically no open-coded error handling
>> code anywhere. The first time we ran into OOMs (Xonotic with no Mesa texture
>> compression support yet, on 8GB machines on max settings...) the whole thing
>> just worked. OOM killer went rampant and shmem doesn't account stuff to
>> processes properly of course, but all the error paths, allocation errors,
>> etc... all of that just worked, first try, across dozens of error paths that
>> had never been tested before, not a single oops or deadlock or leak or
>> anything in sight. Similarly, yesterday I did manage to run into drm_sched
>> failing to create kthreads (the scaling issue Matthew's series fixes)... and
>> still, that was fine. That happens on queue creation so it just bubbled up
>> to Mesa as a failed ioctl and things kept moving along nicely otherwise. I
>> even have nice ergonomic XArray semantics so that you can reserve a new
>> slot, allocate some object, then populate it, and if you don't (because you
>> errored out in between) it automatically gets freed again without explicit
>> cleanup code.
>>
>> And it also means that I can encode *firmware* resource dependencies in the
>> type system (with Rust lifetimes attached to *GPU* pointers even - it's a
>> bit dodgy how it's done but it works well in practice). Since it is
>> absolutely critical that the firmware objects respect their lifetimes or
>> else the whole thing crashes irrecoverably, this is the only way I feel it's
>> been even practical to write this driver and not be a firmware crash mess.
>> Of course we still get some crashes due to flaws in how I understand the
>> firmware, but it's always things I don't know, not things I accidentally
>> messed up in some corner case code path we don't normally hit, since I just
>> don't have to think about that as long as the hierarchy is right.
>>
>> I actually don't know exactly what precise order things get dropped in for
>> this reason! I could find out, and it's predictable in Rust, what I mean is
>> that thinking about a total order like that is not necessary for correctness
>> as long as I got the ownership right. Off the top of my head though, it goes
>> very roughly like this:
>>
>> - On File close, all the GEM objects get closed (DRM core does this)
>> - This triggers explicit drops of all mappings in those GEM objects owned by
>> that File (identified by unique ID, this is the one annoying circular
>> reference thing I mentioned in the other thread...). At this point the GPU
>> probably faults but we don't care. *
>> - The File itself gets dropped, which drops the XArrays for queues and
>> (UAPI) VMs
>> - UAPI VMs getting dropped doesn't do much other than unmap a single dummy
>> object. The underlying MMU VM is refcounted and jobs hold references. This
>> also drops the userspace VM object allocator used for kernel-managed
>> allocations, but that too is internally refcounted and won't go away until
>> all of its allocations do.
>> - Queues get dropped, which mostly drops a bunch of references to things
>> that no longer matter, along with the scheduler and scheduler entity.
>> - The entity already has a reference to the scheduler in the abstraction (to
>> meet the soundness requirement), so the entity necessarily goes first. That
>> kills all not yet scheduled jobs, freeing any resources they might use.
>> - Then the scheduler gets torn down, and with my other patch that logically
>> kills all in-flight jobs, detaching their hardware fences and dropping the
>> job objects. This... still doesn't do much other than drop some references
>> that we don't care about.
>> - At this point, if any jobs are in flight, their firmware objects and all
>> of the type hierarchy that goes with them is still alive, as well as the
>> firmware queues they're in and the Rust objects representing them, the VMs
>> they use, the Events they have been allocated...
>> - Jobs complete (successfully or fault), then when complete get popped off
>> of the Rust-side queue objects that represent the firmware queues.
>> - When any given FW queue is empty, it relinquishes its assigned firmware
>> event ID. That causes the event system to drop its owner reference to it,
>> which means the queue itself gets dropped (since the UAPI Queue that also
>> held a reference is gone). That then also drops a reference to what I call
>> the GpuContext.
>> - To avoid deadlocks, completed job objects are freed in another thread
>> (ugly hack right now, should be done better in the future). Eventually as
>> that happens, any resources they reference are dropped, including some
>> shared ones that are logically tied to the scheduler/queues, references to
>> the MMU VM address space, references to the VM slot that address space is
>> assigned to, objects allocated out of user VM space, everything. Refcounts
>> everywhere for anything shared, direct ownership of child structures for
>> anything that isn't (work buffers, firmware command lists, etc.). I once
>> tried to make a slide of the references and pointers involved in just the
>> vertex half of a single GPU job and... even just that was quite interesting.
>> - When the last job completes, that drops the last reference to the VM slot,
>> which means the backing VM is logically detached from the GPU MMU (this is
>> lazy though, it's still there in practice).
>> - When the last firmware queue is dropped for a logical queue/sched/etc
>> (which means no more jobs are running at the GPU for that context), that
>> drops the last reference to the GpuContext. That also gets shoved into
>> another thread context for cleanup to avoid deadlocks with fault recovery.
>> - When that is finally cleaned up, a firmware command is sent to invalidate
>> the GpuContext. I'm still figuring out what that does and what the lifetime
>> rules are here (this is the only explicit invalidation command that exists),
>> but as of yesterday I know that at the very least we need to keep hold of
>> any Tiled Vertex Buffer associated with it until after inval, so that now
>> has a reference to it that gets dropped after the firmware acknowledges the
>> invalidation (unless it's a non-render-capable Queue, then no TVB
>> necessary).
>> - When the Buffer gets dropped, that frees both its backing memory and
>> (virtual) page list structures, which are in user VM space, as well as some
>> kernel firmware objects.
>> - If things have happened in the order I listed here, those will be the last
>> allocations in the two user VM space heap object allocators, so those now
>> get dropped, which drops the mappings of their backing GEM objects,
>> unmapping them from the MMU VM page tables.
>> - Those mappings will now be the last references to the actual MMU VM
>> object, so that it gets destroyed (the lazy detach comes into effect here,
>> PT base address is removed from the VM context base table, full ASID
>> invalidate, etc.), which with it drops the IoPgTable that backs it, which
>> frees the page tables.
>> - Then finally the GEM objects backing the userspace allocators get dropped
>> as well, which will be the last reference to them, so those get freed.
>
> Thanks for the write up!
>
> Maybe some more fundamental thoughts on this entire endeavour:
>
> I think one thing that's causing a lot of friction that in C drivers at
> least some of these things are done with implied/borrowed references. If
> you want an absolute shocking example, the gpu crash dump sampling for a
> driver that does dynamic memory management can be pretty epic:
>
> - the timeout handler is guaranteed (well if the driver didn't screw up
> things, which mostly boils down to call drm_sched_stop() before it
> analyzes anything at all wrt gpu hw state) to hold up the drm_job
> completion fence
>
> - this fence is (at submit ioctl time) installed into all the dma_resv
> fence containers for the vm (single shared dma_resv for all vm-private obj for
> efficiency, all shareable gem_bo have their own dma_resv)
>
> - the memory manager guarantees that it will not free (or even move in the
> case of ttm) the buffer while a fence is unsingalled. ttm does this by
> delaying the free as needed, i915-gem and some of the others did this by
> trying to hold onto extra references. the latter causes way too much
> problems with reference dropping in my experience looking at a decade of
> drivers getting details wrong here.
>
> - which means the crash dump handler can have a plain list of gem_bo to
> record, without any references, because they wont go away before it
> finishes.
>
> - more fun even, you could directly sample the memory locations at ioctl
> submit time, and even when ttm moves the buffers around in a pipelined
> fashion: your timeout handler wont run before all the jobs to move the
> buffer into the right location have completed, and it will not unblock
> any subsequent move or eviction that's already queued up. Which means
> even the raw memory address will point at the right stuff.
>
> This is just one example. drm and the kernel is full of these borrowed
> reference tricks with often cross through the entire stack, so rust has to
> be able to cope. The sales pitch of rust, and really the reason it's the
> first non-C language in linux, is that with the borrow semantics, it can
> validate this stuff at compile time, and allow kernel drivers to continue
> playing these tricks to outright avoid any and all refcounting needs. If
> rust doesn't deliver and we need to have all the refcounting again to make
> rust drivers safe I think that'll cast serious doubts on the viability of
> rust-in-linux.

Right, so the thing is Rust can encode *some* borrowed reference
semantics at compile time, namely what Rust lifetimes can represent. But
you can't encode arbitrarily complex "I can guarantee this is alive
right now because reasons A,B,C,D" semantics in the type system. For
arbitrarily complex rules like that, you need to either refcount, or use
unsafe and raw pointers. For example, in my driver I use raw pointers in
the event slot acquisition to point to the actual u32s that hold the
event values, because I can guarantee through the semantics of that
module that those will always be valid pointers to the live array (and
never alias too), even though I can't encode that in the type system.
Throwing around refcounted pointers to the whole object there would be
silly, so instead I just used unsafe and left a SAFETY comment
explaining why it's safe.

In the end, unsafe isn't any worse than C (it's strictly better, you
still get the borrow checker and type safety and everything else), so I
don't think Rust not being able to solve all of these problems is really
a good argument against Rust. The idea, and where Rust really shines, is
that even when you have to do this you are *containing* the unsafe code
within places where it can be audited and explained. And when the
majority of the code is safe, and when you strive to make the interfaces
between modules safe, and when you limit the ability of cross-module
interactions to break safety (Not eliminate! In abstractions yes, and in
userspace Rust this is expected, but within a driver it gets more
interesting and I can spend a long time talking about how safety
boundaries in drivers are complicated and a bit porous...) you end up
reducing the chances of safety bugs by orders of magnitude anyway.

And it worked, right? I have ~128 unsafe{} blocks in the driver and
still no oopses in production ^^

But (big but!) the idea with safe Rust abstractions (the stuff in rust/)
is that they absolutely do need to be safe (unless marked unsafe, but
that sucks...), because then implicitly you are documenting the
safety/lifetime requirements in the type system and that's hugely
beneficial for common code. It means users of those APIs don't have to
think about the internals and safety invariants and all that. Drivers
still have to worry about internal safety, but they don't have to worry
about messing up the interface with common code. That's also very
beneficial for allowing refactoring of that common code without worrying
that you'll break users.

So technically we could just give up and mark the drm_sched abstraction
unsafe and actually document all of these safety requirements (which is
the other side of the coin - if you want to use unsafe, you get to write
the docs, lawyer style! That's the rule! No getting by with vague docs
with dozens of unexplained corner cases!). But, honestly, I think that
would be kind of sad... drm_sched really should have a self-contained,
safe API, and I just don't see a good reason why it can't other than it
not having been designed with this in mind initially. It was ported out
of a driver, and not that long ago, right? That explains a lot, because
it means it is probably leaking all of these assumptions from that
driver's internal design... and it probably doesn't help that the people
maintaining it all seem to be from that particular company either. It's
easy to have tunnel vision if you mostly work on one use case...

> Now there's definitely going to be hilarious bugs uncovered on the C side,
> and semantics that need to be clarified, but I think enabling scary tricks
> like the above one is what'll fundamentally make or break rust in linux.

For the specific case of crash dumps and stop-the-world stuff like
that... yeah, those are some of the worst things to fit in otherwise
nice Rust designs because they violate layering left and right. I don't
have a good answer for how exactly I'd do this in my driver yet. I'm
sure it *can* be done, and there's always unsafe if needed, since at
that point you're dealing in driver code and that's a lot less
contentious than an abstraction being unsafe itself...

My initial plan for FW crash dumps is going to be fairly easy because I
only really care about dumping the firmware object arena and maybe
actual firmware .data, and that's all GPU-global anyway, and it won't be
hard to poke my way into the global allocators to take a snapshot of
their backing GEM objects. There's a global alloc lock involved there
which is all I need for basic consistency and getting the locking right
for allocs as far as deadlock risk is something I need to care about for
other reasons anyway. It won't guarantee that nobody else is *writing*
to any of that memory while I take a snapshot, but I don't really care
about that since it would be rare and should be limited to
newly-allocated memory the firmware doesn't yet care about anyway.

In the end, nothing stops you from having piles of raw pointer
backdoors, having a rule that "this is only used for the crash handler",
and then having a couple pages of SAFETY comments in that bit of the
code with a big "here be dragons" disclaimer. It won't make that
particular chunk of code any less likely to have bugs than the
equivalent C code, but at least the rest of the driver isn't affected,
and hopefully you can at least avoid unsafe in some subsets of the
troublesome part too ^^

There's also a flipside to this though: Who cares if you take extra
redundant GEM BO references in the crash handler due to API safety?
After all, it's not a perf-sensitive codepath. Of course, if safety
causes you to add refcounting at all where none would otherwise be
needed, or to have to do more of it in hot paths, or to introduce extra
locking that could cause deadlocks that's a different story. But I think
the thought process is very different between Rust and C here, when
writing drivers. In C, it takes extra work to be safe, so chances are
most people won't write defensive code like that even where there is no
performance reason not to, since it's outright easier to read code the
less safe it is. Rust isn't like that! Most safety is encoded in types
(automatic reference counting is a big one here), so it doesn't take any
extra effort to be safe, while it usually does to do it unsafely (unsafe
{} blocks if nothing else). So then you write safe code by default...
which yes, might do some extra refcounting and locking and stuff, but
you can always optimize that later, profiler in hand. Once you know you
need unsafe for real perf reasons, by all means, and I expect to end up
doing a round of profiling in the future and have to poke some holes in
things too. But until then... safe is just easier and, well, safer!

> In the end it'll be lots of detail work, and I really hope it all works
> out.
>
>> I probably got more than one thing wrong there, and there's layers of
>> complexity I glossed over, but that's the rough idea ^^
>>
>> * If we need to fix this then we're going to need some kind of signal from
>> the DRM core that this is happening and it's not normal user-triggered GEM
>> closing, and it's going to be interesting... it also means we need some kind
>> of mechanism to transfer responsibility over those mappings to all in-flight
>> jobs themselves, because normally userspace is strictly responsible for all
>> mappings in an explicit sync VM bind style world, and now we're adding a
>> special case where we freeze the state of the VM until all in-flight jobs
>> are done when the File goes away instead of eagerly freeing things. That's a
>> very weird departure from how the whole thing normally works, so if we
>> really want that I'm going to have to think of how to do it reasonably. It
>> might be easier once we implement the full VM map range tracking which will
>> likely flip the VM<->GEM object relationship around a bit.
>
> See my other reply for you how can untangle this potentially on the vm_id
> subthread. I think in general rust Drop is awesome, but there's cases
> where we'll have to be more explicit at unwinding things. At least in the
> linux kernel the general consensus is that too much refcounting is a
> maintenance disaster because you end up dropping the final reference in
> all kinds of really bad places (interrupt handlers, while holding the
> wrong locks, holding them too long).

Yeah, that's one I've run into, it's why I have a few places with subtle
code to avoid that and more things will need cleaning up too...

This is something that I expect will become Rust compile-time magic in
the future though! There's already been talk about being able to prove
that things are only ever called from the right execution context. Then
if you have a Drop of a reference whose underlying type is not safe to
Drop from IRQ context in IRQ context, the compiler complains. It's not
clear exactly what this will look like yet, but it's an active topic of
discussion. And that's another nice thing about Rust: we can actually
help drive the evolution of the language, much faster than with C!

> Somewhat related I also expect that
> rust drivers will need to have quite a few manual drop calls to
> artificially limit lifetime, because sometimes when you drop things the
> wrong way round (e.g. drop the refcount while still having the mutex guard
> live) results in deadlocks.

Yup! ^^

drivers/gpu/drm/asahi$ grep -r core::mem::drop | wc -l
17

Rust is magic but it won't magically solve all of our problems!

(I don't think all of those 17 are necessary, some are just places where
I wanted to be explicit/safer or bound critical sections more, but a few
are definitely there to avoid deadlocks.)

Simple Arc<Mutex<T>> usage isn't a problem though, you can't drop the
refcount without dropping the guard first there (lifetimes prevent it).
The tricky bit is when you have to call into *another* agent that owns a
reference to the same object, after having done something while holding
the mutex. Then you really need to drop the guard first.

> There's going to be lots of fun here :-)

It's already been fun and I don't expect it to get any less fun ^^

~~ Lina

2023-04-06 15:30:40

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

Am Donnerstag, dem 06.04.2023 um 16:21 +0200 schrieb Christian König:
> Am 06.04.23 um 12:45 schrieb Lucas Stach:
> > Am Donnerstag, dem 06.04.2023 um 10:27 +0200 schrieb Daniel Vetter:
> > > On Thu, 6 Apr 2023 at 10:22, Christian König <[email protected]> wrote:
> > > > Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> > > > > On 2023-04-05 10:05, Danilo Krummrich wrote:
> > > > > > On 4/4/23 06:31, Luben Tuikov wrote:
> > > > > > > On 2023-03-28 04:54, Lucas Stach wrote:
> > > > > > > > Hi Danilo,
> > > > > > > >
> > > > > > > > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > > > > > > > tries to track the accumulated time that a job was active on the GPU
> > > > > > > > > writing it to the entity through which the job was deployed to the
> > > > > > > > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > > > > > > > which fetches a job from the schedulers pending_list.
> > > > > > > > >
> > > > > > > > > Doing this can result in a race condition where the entity is already
> > > > > > > > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > > > > > > > once the job is fetched from the pending_list.
> > > > > > > > >
> > > > > > > > > After drm_sched_entity_destroy() being called it should be safe to free
> > > > > > > > > the structure that embeds the entity. However, a job originally handed
> > > > > > > > > over to the scheduler by this entity might still reside in the
> > > > > > > > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > > > > > > > already being called and the entity being freed. Hence, we can run into
> > > > > > > > > a UAF.
> > > > > > > > >
> > > > > > > > Sorry about that, I clearly didn't properly consider this case.
> > > > > > > >
> > > > > > > > > In my case it happened that a job, as explained above, was just picked
> > > > > > > > > from the schedulers pending_list after the entity was freed due to the
> > > > > > > > > client application exiting. Meanwhile this freed up memory was already
> > > > > > > > > allocated for a subsequent client applications job structure again.
> > > > > > > > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > > > > > > > reproduce the same corruption over and over again by just using
> > > > > > > > > deqp-runner to run a specific set of VK test cases in parallel.
> > > > > > > > >
> > > > > > > > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > > > > > > > I miss something), which is why I'm writing this mail instead of sending
> > > > > > > > > a fix directly.
> > > > > > > > >
> > > > > > > > > Spontaneously, I see three options to fix it:
> > > > > > > > >
> > > > > > > > > 1. Rather than embedding the entity into driver specific structures
> > > > > > > > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > > > > > > > reference count it, such that it's only freed up once all jobs that were
> > > > > > > > > deployed through this entity are fetched from the schedulers pending list.
> > > > > > > > >
> > > > > > > > My vote is on this or something in similar vain for the long term. I
> > > > > > > > have some hope to be able to add a GPU scheduling algorithm with a bit
> > > > > > > > more fairness than the current one sometime in the future, which
> > > > > > > > requires execution time tracking on the entities.
> > > > > > > Danilo,
> > > > > > >
> > > > > > > Using kref is preferable, i.e. option 1 above.
> > > > > > I think the only real motivation for doing that would be for generically
> > > > > > tracking job statistics within the entity a job was deployed through. If
> > > > > > we all agree on tracking job statistics this way I am happy to prepare a
> > > > > > patch for this option and drop this one:
> > > > > > https://lore.kernel.org/all/[email protected]/T/#u
> > > > > Hmm, I never thought about "job statistics" when I preferred using kref above.
> > > > > The reason kref is attractive is because one doesn't need to worry about
> > > > > it--when the last user drops the kref, the release is called to do
> > > > > housekeeping. If this never happens, we know that we have a bug to debug.
> > > > Yeah, reference counting unfortunately have some traps as well. For
> > > > example rarely dropping the last reference from interrupt context or
> > > > with some unexpected locks help when the cleanup function doesn't expect
> > > > that is a good recipe for problems as well.
> > > >
> > Fully agreed.
> >
> > > > > Regarding the patch above--I did look around the code, and it seems safe,
> > > > > as per your analysis, I didn't see any reference to entity after job submission,
> > > > > but I'll comment on that thread as well for the record.
> > > > Reference counting the entities was suggested before. The intentionally
> > > > avoided that so far because the entity might be the tip of the iceberg
> > > > of stuff you need to keep around.
> > > >
> > > > For example for command submission you also need the VM and when you
> > > > keep the VM alive you also need to keep the file private alive....
> > > Yeah refcounting looks often like the easy way out to avoid
> > > use-after-free issue, until you realize you've just made lifetimes
> > > unbounded and have some enourmous leaks: entity keeps vm alive, vm
> > > keeps all the bo alives, somehow every crash wastes more memory
> > > because vk_device_lost means userspace allocates new stuff for
> > > everything.
> > >
> > > If possible a lifetime design where lifetimes have hard bounds and you
> > > just borrow a reference under a lock (or some other ownership rule) is
> > > generally much more solid. But also much harder to design correctly
> > > :-/
> > >
> > The use we are discussing here is to keep the entity alive as long as
> > jobs from that entity are still active on the HW. While there are no
> > hard bounds on when a job will get inactive, at least it's not
> > unbounded. On a crash/fault the job will be removed from the hardware
> > pretty soon.
> >
> > Well behaved jobs after a application shutdown might take a little
> > longer, but I don't really see the new problem with keeping the entity
> > alive? As long as a job is active on the hardware, we can't throw out
> > the VM or BOs, no difference whether the entity is kept alive or not.
>
> Exactly that's the problem. VM & BOs are dropped as soon as the process
> is destroyed, we *don't* wait for the hw to finish before doing so.
>
> Just the backing store managed by all the house keeping objects isn't
> freed until the hw is idle preventing a crash or accessing freed memory.
>
> This behavior is rather important for the OOM killer since we need to be
> able to tear down the process as fast as possible in that case.
>
Are you talking about the dropping of pending jobs in
drm_sched_entity_kill? I'm certainly not trying to change that in any
way. Those aren't put onto the hardware yet, so we can always safely
drop them and do so as fast as possible.

What I'm concerned about are the jobs that are already scheduled on the
HW. At least with Vivante hardware there is no race free way to get rid
of jobs once they are put on the ring. So whatever the scheduler or DRM
core is doing, we have to hold on to the BOs and GPU memory management
structures to keep the hardware from operating on freed memory.

That's already a lot of memory, so I don't really see the issue with
keeping the entity around in a quiescent state until all the currently
queued jobs have left the HW.

> Changing that is possible, but that's quite a huge change I'm not really
> willing to do just for tracking the time spend.
>
Yea, it's a big change and whether it's a good idea really depends on
what we a gaining from it. You seem to see quite low value in "just
tracking the time spent" and that might be true, but it also forces all
drivers that want to implement fdinfo to roll their own time tracking.
I would rather see more of this moved to the scheduler and thus shared
between drivers.

Regards,
Lucas

> What we could do is to track the unsignaled fences in each entity
> similar to what amdgpu is doing.
>
> Regards,
> Christian.
>
> >
> > Some hardware might have ways to expedite job inactivation by
> > deactivating scheduling queues, or just taking a fault, but for some HW
> > we'll just have to wait for the job to finish.
> >
> > Regards,
> > Lucas
> >
>

2023-04-06 15:33:25

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On Thu, Apr 06, 2023 at 11:43:19PM +0900, Asahi Lina wrote:
> On 06/04/2023 22.37, Daniel Vetter wrote:
> > On Thu, Apr 06, 2023 at 09:21:47PM +0900, Asahi Lina wrote:
> > > On 06/04/2023 19.09, Daniel Vetter wrote:
> > > > On Thu, Apr 06, 2023 at 06:05:11PM +0900, Asahi Lina wrote:
> > > > > On 06/04/2023 17.27, Daniel Vetter wrote:
> > > > > > On Thu, 6 Apr 2023 at 10:22, Christian K?nig <[email protected]> wrote:
> > > > > > >
> > > > > > > Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> > > > > > > > On 2023-04-05 10:05, Danilo Krummrich wrote:
> > > > > > > > > On 4/4/23 06:31, Luben Tuikov wrote:
> > > > > > > > > > On 2023-03-28 04:54, Lucas Stach wrote:
> > > > > > > > > > > Hi Danilo,
> > > > > > > > > > >
> > > > > > > > > > > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > > > > > > > > > > Hi all,
> > > > > > > > > > > >
> > > > > > > > > > > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > > > > > > > > > > tries to track the accumulated time that a job was active on the GPU
> > > > > > > > > > > > writing it to the entity through which the job was deployed to the
> > > > > > > > > > > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > > > > > > > > > > which fetches a job from the schedulers pending_list.
> > > > > > > > > > > >
> > > > > > > > > > > > Doing this can result in a race condition where the entity is already
> > > > > > > > > > > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > > > > > > > > > > once the job is fetched from the pending_list.
> > > > > > > > > > > >
> > > > > > > > > > > > After drm_sched_entity_destroy() being called it should be safe to free
> > > > > > > > > > > > the structure that embeds the entity. However, a job originally handed
> > > > > > > > > > > > over to the scheduler by this entity might still reside in the
> > > > > > > > > > > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > > > > > > > > > > already being called and the entity being freed. Hence, we can run into
> > > > > > > > > > > > a UAF.
> > > > > > > > > > > >
> > > > > > > > > > > Sorry about that, I clearly didn't properly consider this case.
> > > > > > > > > > >
> > > > > > > > > > > > In my case it happened that a job, as explained above, was just picked
> > > > > > > > > > > > from the schedulers pending_list after the entity was freed due to the
> > > > > > > > > > > > client application exiting. Meanwhile this freed up memory was already
> > > > > > > > > > > > allocated for a subsequent client applications job structure again.
> > > > > > > > > > > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > > > > > > > > > > reproduce the same corruption over and over again by just using
> > > > > > > > > > > > deqp-runner to run a specific set of VK test cases in parallel.
> > > > > > > > > > > >
> > > > > > > > > > > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > > > > > > > > > > I miss something), which is why I'm writing this mail instead of sending
> > > > > > > > > > > > a fix directly.
> > > > > > > > > > > >
> > > > > > > > > > > > Spontaneously, I see three options to fix it:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Rather than embedding the entity into driver specific structures
> > > > > > > > > > > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > > > > > > > > > > reference count it, such that it's only freed up once all jobs that were
> > > > > > > > > > > > deployed through this entity are fetched from the schedulers pending list.
> > > > > > > > > > > >
> > > > > > > > > > > My vote is on this or something in similar vain for the long term. I
> > > > > > > > > > > have some hope to be able to add a GPU scheduling algorithm with a bit
> > > > > > > > > > > more fairness than the current one sometime in the future, which
> > > > > > > > > > > requires execution time tracking on the entities.
> > > > > > > > > > Danilo,
> > > > > > > > > >
> > > > > > > > > > Using kref is preferable, i.e. option 1 above.
> > > > > > > > > I think the only real motivation for doing that would be for generically
> > > > > > > > > tracking job statistics within the entity a job was deployed through. If
> > > > > > > > > we all agree on tracking job statistics this way I am happy to prepare a
> > > > > > > > > patch for this option and drop this one:
> > > > > > > > > https://lore.kernel.org/all/[email protected]/T/#u
> > > > > > > > Hmm, I never thought about "job statistics" when I preferred using kref above.
> > > > > > > > The reason kref is attractive is because one doesn't need to worry about
> > > > > > > > it--when the last user drops the kref, the release is called to do
> > > > > > > > housekeeping. If this never happens, we know that we have a bug to debug.
> > > > > > >
> > > > > > > Yeah, reference counting unfortunately have some traps as well. For
> > > > > > > example rarely dropping the last reference from interrupt context or
> > > > > > > with some unexpected locks help when the cleanup function doesn't expect
> > > > > > > that is a good recipe for problems as well.
> > > > > > >
> > > > > > > > Regarding the patch above--I did look around the code, and it seems safe,
> > > > > > > > as per your analysis, I didn't see any reference to entity after job submission,
> > > > > > > > but I'll comment on that thread as well for the record.
> > > > > > >
> > > > > > > Reference counting the entities was suggested before. The intentionally
> > > > > > > avoided that so far because the entity might be the tip of the iceberg
> > > > > > > of stuff you need to keep around.
> > > > > > >
> > > > > > > For example for command submission you also need the VM and when you
> > > > > > > keep the VM alive you also need to keep the file private alive....
> > > > > >
> > > > > > Yeah refcounting looks often like the easy way out to avoid
> > > > > > use-after-free issue, until you realize you've just made lifetimes
> > > > > > unbounded and have some enourmous leaks: entity keeps vm alive, vm
> > > > > > keeps all the bo alives, somehow every crash wastes more memory
> > > > > > because vk_device_lost means userspace allocates new stuff for
> > > > > > everything.
> > > > >
> > > > > Refcounting everywhere has been working well for us, so well that so far all
> > > > > the oopses we've hit have been... drm_sched bugs like this one, not anything
> > > > > in the driver. But at least in Rust you have the advantage that you can't
> > > > > just forget a decref in a rarely-hit error path (or worse, forget an incref
> > > > > somewhere important)... ^^
> > > > >
> > > > > > If possible a lifetime design where lifetimes have hard bounds and you
> > > > > > just borrow a reference under a lock (or some other ownership rule) is
> > > > > > generally much more solid. But also much harder to design correctly
> > > > > > :-/
> > > > > >
> > > > > > > Additional to that we have some ugly inter dependencies between tearing
> > > > > > > down an application (potential with a KILL signal from the OOM killer)
> > > > > > > and backward compatibility for some applications which render something
> > > > > > > and quit before the rendering is completed in the hardware.
> > > > > >
> > > > > > Yeah I think that part would also be good to sort out once&for all in
> > > > > > drm/sched, because i915 has/had the same struggle.
> > > > > > -Daniel
> > > > > >
> > > > >
> > > > > Is this really a thing? I think that's never going to work well for explicit
> > > > > sync, since the kernel doesn't even know what BOs it has to keep alive for a
> > > > > job... I guess it could keep the entire file and all of its objects/VMs/etc
> > > > > alive until all of its submissions complete but... ewww.
> > > > >
> > > > > Our Mesa implementation synchronously waits for all jobs on context destroy
> > > > > for this reason, but if you just kill the app, yeah, you get faults as
> > > > > running GPU jobs have BOs yanked out from under them. Kill loops make for a
> > > > > good way of testing fault handling...
> > > >
> > > > You wind down the entire thing on file close? Like
> > > > - stop all context
> > > > - tear down all context
> > > > - tear down all vm
> > > > - tear down all obj
> > > >
> > > > Just winding things down in a random order and then letting gpu fault
> > > > handling sort out the mess doesn't strike me as particularly clean design
> > > > ...
> > > The idea is that object drop order generally doesn't matter since things
> > > that care about other things should own them or hold references to them
> > > anyway, so the dependency graph of all the resources is encoded directly in
> > > the type hierarchy instead of having to open-code a "cleanup procedure"...
> > > which then invariably leads to corner cases when you have to do the same
> > > thing, or part of it, for error handling.
> > >
> > > This has been working *very* well! It solves the issue of error handling
> > > since error handling just unwinds whatever was done to that point naturally
> > > in Rust (? operator), so there's practically no open-coded error handling
> > > code anywhere. The first time we ran into OOMs (Xonotic with no Mesa texture
> > > compression support yet, on 8GB machines on max settings...) the whole thing
> > > just worked. OOM killer went rampant and shmem doesn't account stuff to
> > > processes properly of course, but all the error paths, allocation errors,
> > > etc... all of that just worked, first try, across dozens of error paths that
> > > had never been tested before, not a single oops or deadlock or leak or
> > > anything in sight. Similarly, yesterday I did manage to run into drm_sched
> > > failing to create kthreads (the scaling issue Matthew's series fixes)... and
> > > still, that was fine. That happens on queue creation so it just bubbled up
> > > to Mesa as a failed ioctl and things kept moving along nicely otherwise. I
> > > even have nice ergonomic XArray semantics so that you can reserve a new
> > > slot, allocate some object, then populate it, and if you don't (because you
> > > errored out in between) it automatically gets freed again without explicit
> > > cleanup code.
> > >
> > > And it also means that I can encode *firmware* resource dependencies in the
> > > type system (with Rust lifetimes attached to *GPU* pointers even - it's a
> > > bit dodgy how it's done but it works well in practice). Since it is
> > > absolutely critical that the firmware objects respect their lifetimes or
> > > else the whole thing crashes irrecoverably, this is the only way I feel it's
> > > been even practical to write this driver and not be a firmware crash mess.
> > > Of course we still get some crashes due to flaws in how I understand the
> > > firmware, but it's always things I don't know, not things I accidentally
> > > messed up in some corner case code path we don't normally hit, since I just
> > > don't have to think about that as long as the hierarchy is right.
> > >
> > > I actually don't know exactly what precise order things get dropped in for
> > > this reason! I could find out, and it's predictable in Rust, what I mean is
> > > that thinking about a total order like that is not necessary for correctness
> > > as long as I got the ownership right. Off the top of my head though, it goes
> > > very roughly like this:
> > >
> > > - On File close, all the GEM objects get closed (DRM core does this)
> > > - This triggers explicit drops of all mappings in those GEM objects owned by
> > > that File (identified by unique ID, this is the one annoying circular
> > > reference thing I mentioned in the other thread...). At this point the GPU
> > > probably faults but we don't care. *
> > > - The File itself gets dropped, which drops the XArrays for queues and
> > > (UAPI) VMs
> > > - UAPI VMs getting dropped doesn't do much other than unmap a single dummy
> > > object. The underlying MMU VM is refcounted and jobs hold references. This
> > > also drops the userspace VM object allocator used for kernel-managed
> > > allocations, but that too is internally refcounted and won't go away until
> > > all of its allocations do.
> > > - Queues get dropped, which mostly drops a bunch of references to things
> > > that no longer matter, along with the scheduler and scheduler entity.
> > > - The entity already has a reference to the scheduler in the abstraction (to
> > > meet the soundness requirement), so the entity necessarily goes first. That
> > > kills all not yet scheduled jobs, freeing any resources they might use.
> > > - Then the scheduler gets torn down, and with my other patch that logically
> > > kills all in-flight jobs, detaching their hardware fences and dropping the
> > > job objects. This... still doesn't do much other than drop some references
> > > that we don't care about.
> > > - At this point, if any jobs are in flight, their firmware objects and all
> > > of the type hierarchy that goes with them is still alive, as well as the
> > > firmware queues they're in and the Rust objects representing them, the VMs
> > > they use, the Events they have been allocated...
> > > - Jobs complete (successfully or fault), then when complete get popped off
> > > of the Rust-side queue objects that represent the firmware queues.
> > > - When any given FW queue is empty, it relinquishes its assigned firmware
> > > event ID. That causes the event system to drop its owner reference to it,
> > > which means the queue itself gets dropped (since the UAPI Queue that also
> > > held a reference is gone). That then also drops a reference to what I call
> > > the GpuContext.
> > > - To avoid deadlocks, completed job objects are freed in another thread
> > > (ugly hack right now, should be done better in the future). Eventually as
> > > that happens, any resources they reference are dropped, including some
> > > shared ones that are logically tied to the scheduler/queues, references to
> > > the MMU VM address space, references to the VM slot that address space is
> > > assigned to, objects allocated out of user VM space, everything. Refcounts
> > > everywhere for anything shared, direct ownership of child structures for
> > > anything that isn't (work buffers, firmware command lists, etc.). I once
> > > tried to make a slide of the references and pointers involved in just the
> > > vertex half of a single GPU job and... even just that was quite interesting.
> > > - When the last job completes, that drops the last reference to the VM slot,
> > > which means the backing VM is logically detached from the GPU MMU (this is
> > > lazy though, it's still there in practice).
> > > - When the last firmware queue is dropped for a logical queue/sched/etc
> > > (which means no more jobs are running at the GPU for that context), that
> > > drops the last reference to the GpuContext. That also gets shoved into
> > > another thread context for cleanup to avoid deadlocks with fault recovery.
> > > - When that is finally cleaned up, a firmware command is sent to invalidate
> > > the GpuContext. I'm still figuring out what that does and what the lifetime
> > > rules are here (this is the only explicit invalidation command that exists),
> > > but as of yesterday I know that at the very least we need to keep hold of
> > > any Tiled Vertex Buffer associated with it until after inval, so that now
> > > has a reference to it that gets dropped after the firmware acknowledges the
> > > invalidation (unless it's a non-render-capable Queue, then no TVB
> > > necessary).
> > > - When the Buffer gets dropped, that frees both its backing memory and
> > > (virtual) page list structures, which are in user VM space, as well as some
> > > kernel firmware objects.
> > > - If things have happened in the order I listed here, those will be the last
> > > allocations in the two user VM space heap object allocators, so those now
> > > get dropped, which drops the mappings of their backing GEM objects,
> > > unmapping them from the MMU VM page tables.
> > > - Those mappings will now be the last references to the actual MMU VM
> > > object, so that it gets destroyed (the lazy detach comes into effect here,
> > > PT base address is removed from the VM context base table, full ASID
> > > invalidate, etc.), which with it drops the IoPgTable that backs it, which
> > > frees the page tables.
> > > - Then finally the GEM objects backing the userspace allocators get dropped
> > > as well, which will be the last reference to them, so those get freed.
> >
> > Thanks for the write up!
> >
> > Maybe some more fundamental thoughts on this entire endeavour:
> >
> > I think one thing that's causing a lot of friction that in C drivers at
> > least some of these things are done with implied/borrowed references. If
> > you want an absolute shocking example, the gpu crash dump sampling for a
> > driver that does dynamic memory management can be pretty epic:
> >
> > - the timeout handler is guaranteed (well if the driver didn't screw up
> > things, which mostly boils down to call drm_sched_stop() before it
> > analyzes anything at all wrt gpu hw state) to hold up the drm_job
> > completion fence
> >
> > - this fence is (at submit ioctl time) installed into all the dma_resv
> > fence containers for the vm (single shared dma_resv for all vm-private obj for
> > efficiency, all shareable gem_bo have their own dma_resv)
> >
> > - the memory manager guarantees that it will not free (or even move in the
> > case of ttm) the buffer while a fence is unsingalled. ttm does this by
> > delaying the free as needed, i915-gem and some of the others did this by
> > trying to hold onto extra references. the latter causes way too much
> > problems with reference dropping in my experience looking at a decade of
> > drivers getting details wrong here.
> >
> > - which means the crash dump handler can have a plain list of gem_bo to
> > record, without any references, because they wont go away before it
> > finishes.
> >
> > - more fun even, you could directly sample the memory locations at ioctl
> > submit time, and even when ttm moves the buffers around in a pipelined
> > fashion: your timeout handler wont run before all the jobs to move the
> > buffer into the right location have completed, and it will not unblock
> > any subsequent move or eviction that's already queued up. Which means
> > even the raw memory address will point at the right stuff.
> >
> > This is just one example. drm and the kernel is full of these borrowed
> > reference tricks with often cross through the entire stack, so rust has to
> > be able to cope. The sales pitch of rust, and really the reason it's the
> > first non-C language in linux, is that with the borrow semantics, it can
> > validate this stuff at compile time, and allow kernel drivers to continue
> > playing these tricks to outright avoid any and all refcounting needs. If
> > rust doesn't deliver and we need to have all the refcounting again to make
> > rust drivers safe I think that'll cast serious doubts on the viability of
> > rust-in-linux.
>
> Right, so the thing is Rust can encode *some* borrowed reference semantics
> at compile time, namely what Rust lifetimes can represent. But you can't
> encode arbitrarily complex "I can guarantee this is alive right now because
> reasons A,B,C,D" semantics in the type system. For arbitrarily complex rules
> like that, you need to either refcount, or use unsafe and raw pointers. For
> example, in my driver I use raw pointers in the event slot acquisition to
> point to the actual u32s that hold the event values, because I can guarantee
> through the semantics of that module that those will always be valid
> pointers to the live array (and never alias too), even though I can't encode
> that in the type system. Throwing around refcounted pointers to the whole
> object there would be silly, so instead I just used unsafe and left a SAFETY
> comment explaining why it's safe.
>
> In the end, unsafe isn't any worse than C (it's strictly better, you still
> get the borrow checker and type safety and everything else), so I don't
> think Rust not being able to solve all of these problems is really a good
> argument against Rust. The idea, and where Rust really shines, is that even
> when you have to do this you are *containing* the unsafe code within places
> where it can be audited and explained. And when the majority of the code is
> safe, and when you strive to make the interfaces between modules safe, and
> when you limit the ability of cross-module interactions to break safety (Not
> eliminate! In abstractions yes, and in userspace Rust this is expected, but
> within a driver it gets more interesting and I can spend a long time talking
> about how safety boundaries in drivers are complicated and a bit porous...)
> you end up reducing the chances of safety bugs by orders of magnitude
> anyway.
>
> And it worked, right? I have ~128 unsafe{} blocks in the driver and still no
> oopses in production ^^
>
> But (big but!) the idea with safe Rust abstractions (the stuff in rust/) is
> that they absolutely do need to be safe (unless marked unsafe, but that
> sucks...), because then implicitly you are documenting the safety/lifetime
> requirements in the type system and that's hugely beneficial for common
> code. It means users of those APIs don't have to think about the internals
> and safety invariants and all that. Drivers still have to worry about
> internal safety, but they don't have to worry about messing up the interface
> with common code. That's also very beneficial for allowing refactoring of
> that common code without worrying that you'll break users.
>
> So technically we could just give up and mark the drm_sched abstraction
> unsafe and actually document all of these safety requirements (which is the
> other side of the coin - if you want to use unsafe, you get to write the
> docs, lawyer style! That's the rule! No getting by with vague docs with
> dozens of unexplained corner cases!). But, honestly, I think that would be
> kind of sad... drm_sched really should have a self-contained, safe API, and
> I just don't see a good reason why it can't other than it not having been
> designed with this in mind initially. It was ported out of a driver, and not
> that long ago, right? That explains a lot, because it means it is probably
> leaking all of these assumptions from that driver's internal design... and
> it probably doesn't help that the people maintaining it all seem to be from
> that particular company either. It's easy to have tunnel vision if you
> mostly work on one use case...

Nah I don't want to give up. What I expect to happen is that rust
abstractions will need to be a lot more opionated about how to do certain
things. And then the abstractions together with the C code provide the
guarantee that the rust driver is safe.

> > Now there's definitely going to be hilarious bugs uncovered on the C side,
> > and semantics that need to be clarified, but I think enabling scary tricks
> > like the above one is what'll fundamentally make or break rust in linux.
>
> For the specific case of crash dumps and stop-the-world stuff like that...
> yeah, those are some of the worst things to fit in otherwise nice Rust
> designs because they violate layering left and right. I don't have a good
> answer for how exactly I'd do this in my driver yet. I'm sure it *can* be
> done, and there's always unsafe if needed, since at that point you're
> dealing in driver code and that's a lot less contentious than an abstraction
> being unsafe itself...
>
> My initial plan for FW crash dumps is going to be fairly easy because I only
> really care about dumping the firmware object arena and maybe actual
> firmware .data, and that's all GPU-global anyway, and it won't be hard to
> poke my way into the global allocators to take a snapshot of their backing
> GEM objects. There's a global alloc lock involved there which is all I need
> for basic consistency and getting the locking right for allocs as far as
> deadlock risk is something I need to care about for other reasons anyway. It
> won't guarantee that nobody else is *writing* to any of that memory while I
> take a snapshot, but I don't really care about that since it would be rare
> and should be limited to newly-allocated memory the firmware doesn't yet
> care about anyway.

So this is were crash dump capturing gets fun. If that lock covers any
memory allocations, you can't take it in the tdr path.

The simplest way out is to make context crashes permanent and bail out the
entire crash capturing to a worker (which needs gem bo references) that
doesn't block anything (so not even guilty/victim reporting to userspace)
and can take locks and capture things at leasure.

Only works when userspace obeys the promise of not overwriting the memory
meanwhile, which is what you get for vk_device_lost.

> In the end, nothing stops you from having piles of raw pointer backdoors,
> having a rule that "this is only used for the crash handler", and then
> having a couple pages of SAFETY comments in that bit of the code with a big
> "here be dragons" disclaimer. It won't make that particular chunk of code
> any less likely to have bugs than the equivalent C code, but at least the
> rest of the driver isn't affected, and hopefully you can at least avoid
> unsafe in some subsets of the troublesome part too ^^
>
> There's also a flipside to this though: Who cares if you take extra
> redundant GEM BO references in the crash handler due to API safety? After
> all, it's not a perf-sensitive codepath. Of course, if safety causes you to
> add refcounting at all where none would otherwise be needed, or to have to
> do more of it in hot paths, or to introduce extra locking that could cause
> deadlocks that's a different story. But I think the thought process is very
> different between Rust and C here, when writing drivers. In C, it takes
> extra work to be safe, so chances are most people won't write defensive code
> like that even where there is no performance reason not to, since it's
> outright easier to read code the less safe it is. Rust isn't like that! Most
> safety is encoded in types (automatic reference counting is a big one here),
> so it doesn't take any extra effort to be safe, while it usually does to do
> it unsafely (unsafe {} blocks if nothing else). So then you write safe code
> by default... which yes, might do some extra refcounting and locking and
> stuff, but you can always optimize that later, profiler in hand. Once you
> know you need unsafe for real perf reasons, by all means, and I expect to
> end up doing a round of profiling in the future and have to poke some holes
> in things too. But until then... safe is just easier and, well, safer!

The problem with extra refcounting is dropping them. There's endless
amounts of fun where dropping a kref at the wrong time goes boom (wrong
context, holding wrong locks). Or then you delay it and userspace gets
annoyed (task_work for struct file is fun), so just unconditionally
stuffing everything you free into a worker isn't always the right thing
either.

And rust makes this extremely easy, all the Drop stuff isn't visible
anywhere in the code, and you'll never screw up the error paths. So my
take is a bit that unless you've excluded a borrow based design as
impossible already, sprinkling Arc<T> all over isn't really solid either.

> > In the end it'll be lots of detail work, and I really hope it all works
> > out.
> >
> > > I probably got more than one thing wrong there, and there's layers of
> > > complexity I glossed over, but that's the rough idea ^^
> > >
> > > * If we need to fix this then we're going to need some kind of signal from
> > > the DRM core that this is happening and it's not normal user-triggered GEM
> > > closing, and it's going to be interesting... it also means we need some kind
> > > of mechanism to transfer responsibility over those mappings to all in-flight
> > > jobs themselves, because normally userspace is strictly responsible for all
> > > mappings in an explicit sync VM bind style world, and now we're adding a
> > > special case where we freeze the state of the VM until all in-flight jobs
> > > are done when the File goes away instead of eagerly freeing things. That's a
> > > very weird departure from how the whole thing normally works, so if we
> > > really want that I'm going to have to think of how to do it reasonably. It
> > > might be easier once we implement the full VM map range tracking which will
> > > likely flip the VM<->GEM object relationship around a bit.
> >
> > See my other reply for you how can untangle this potentially on the vm_id
> > subthread. I think in general rust Drop is awesome, but there's cases
> > where we'll have to be more explicit at unwinding things. At least in the
> > linux kernel the general consensus is that too much refcounting is a
> > maintenance disaster because you end up dropping the final reference in
> > all kinds of really bad places (interrupt handlers, while holding the
> > wrong locks, holding them too long).
>
> Yeah, that's one I've run into, it's why I have a few places with subtle
> code to avoid that and more things will need cleaning up too...
>
> This is something that I expect will become Rust compile-time magic in the
> future though! There's already been talk about being able to prove that
> things are only ever called from the right execution context. Then if you
> have a Drop of a reference whose underlying type is not safe to Drop from
> IRQ context in IRQ context, the compiler complains. It's not clear exactly
> what this will look like yet, but it's an active topic of discussion. And
> that's another nice thing about Rust: we can actually help drive the
> evolution of the language, much faster than with C!
>
> > Somewhat related I also expect that
> > rust drivers will need to have quite a few manual drop calls to
> > artificially limit lifetime, because sometimes when you drop things the
> > wrong way round (e.g. drop the refcount while still having the mutex guard
> > live) results in deadlocks.
>
> Yup! ^^
>
> drivers/gpu/drm/asahi$ grep -r core::mem::drop | wc -l
> 17
>
> Rust is magic but it won't magically solve all of our problems!
>
> (I don't think all of those 17 are necessary, some are just places where I
> wanted to be explicit/safer or bound critical sections more, but a few are
> definitely there to avoid deadlocks.)
>
> Simple Arc<Mutex<T>> usage isn't a problem though, you can't drop the
> refcount without dropping the guard first there (lifetimes prevent it). The
> tricky bit is when you have to call into *another* agent that owns a
> reference to the same object, after having done something while holding the
> mutex. Then you really need to drop the guard first.

Yeah it's the cross object/module/crate fun for refcount dropping.

The trouble is that refcount bugs are substantially worse to validate at
runtime than locking. Locking isn't optional, so generally all you need to
do is enable lockdep and run through all the codepaths and you're good.
Sometimes that's a bit too tough and you can insert fake lockdep contexts
to tie things together like fs_reclaim or dma_fence_signalling, but
by&large it's not that hard to validate at runtime to a reasonable
confidence.

Reference dropping otoh isn't, because usually userspace holds onto a
reference and drops the final reference through a syscall in a very well
defined place. Now you could just add a fake lockdep context to pretend
that any kref_put could be the final one, but generally this results in a
lot of false positives (because of borrowed or implied references
protecting you from the final reference drop). Which means unless you've
gone through the trouble of proving that your final kref_put/rust Arc
drops are never in a bad place (which means pretty much uncovering the
entire Drop hierarchy mentally in every place) you have some potentially
really nasty problems at hand.

And unlike rust in C all the _put() calls are at least explicitly staring
at you and reminding you that you're piling up a mountain that might come
crashing down in an avalanche :-) Ofc you're also all but guaranteed to
have screwed up an error path, but that's at least a runtime validation
solveable problem nowadays with kasan + code coverage checks (not that
we're any good at it tbh, syzkaller has clearly opinions to the contrary
here).

> > There's going to be lots of fun here :-)
>
> It's already been fun and I don't expect it to get any less fun ^^

Oh that's for sure.

>
> ~~ Lina
>

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2023-04-06 15:35:11

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

Am 06.04.23 um 17:24 schrieb Lucas Stach:
> Am Donnerstag, dem 06.04.2023 um 16:21 +0200 schrieb Christian König:
>> Am 06.04.23 um 12:45 schrieb Lucas Stach:
>>> Am Donnerstag, dem 06.04.2023 um 10:27 +0200 schrieb Daniel Vetter:
>>>> On Thu, 6 Apr 2023 at 10:22, Christian König <[email protected]> wrote:
>>>>> Am 05.04.23 um 18:09 schrieb Luben Tuikov:
>>>>>> On 2023-04-05 10:05, Danilo Krummrich wrote:
>>>>>>> On 4/4/23 06:31, Luben Tuikov wrote:
>>>>>>>> On 2023-03-28 04:54, Lucas Stach wrote:
>>>>>>>>> Hi Danilo,
>>>>>>>>>
>>>>>>>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>>>>>>>> tries to track the accumulated time that a job was active on the GPU
>>>>>>>>>> writing it to the entity through which the job was deployed to the
>>>>>>>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>>>>>>>> which fetches a job from the schedulers pending_list.
>>>>>>>>>>
>>>>>>>>>> Doing this can result in a race condition where the entity is already
>>>>>>>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>>>>>>>> once the job is fetched from the pending_list.
>>>>>>>>>>
>>>>>>>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>>>>>>>> the structure that embeds the entity. However, a job originally handed
>>>>>>>>>> over to the scheduler by this entity might still reside in the
>>>>>>>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>>>>>>>> already being called and the entity being freed. Hence, we can run into
>>>>>>>>>> a UAF.
>>>>>>>>>>
>>>>>>>>> Sorry about that, I clearly didn't properly consider this case.
>>>>>>>>>
>>>>>>>>>> In my case it happened that a job, as explained above, was just picked
>>>>>>>>>> from the schedulers pending_list after the entity was freed due to the
>>>>>>>>>> client application exiting. Meanwhile this freed up memory was already
>>>>>>>>>> allocated for a subsequent client applications job structure again.
>>>>>>>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>>>>>>>> reproduce the same corruption over and over again by just using
>>>>>>>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>>>>>>>
>>>>>>>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>>>>>>>> I miss something), which is why I'm writing this mail instead of sending
>>>>>>>>>> a fix directly.
>>>>>>>>>>
>>>>>>>>>> Spontaneously, I see three options to fix it:
>>>>>>>>>>
>>>>>>>>>> 1. Rather than embedding the entity into driver specific structures
>>>>>>>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>>>>>>>> reference count it, such that it's only freed up once all jobs that were
>>>>>>>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>>>>>>>
>>>>>>>>> My vote is on this or something in similar vain for the long term. I
>>>>>>>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>>>>>>>> more fairness than the current one sometime in the future, which
>>>>>>>>> requires execution time tracking on the entities.
>>>>>>>> Danilo,
>>>>>>>>
>>>>>>>> Using kref is preferable, i.e. option 1 above.
>>>>>>> I think the only real motivation for doing that would be for generically
>>>>>>> tracking job statistics within the entity a job was deployed through. If
>>>>>>> we all agree on tracking job statistics this way I am happy to prepare a
>>>>>>> patch for this option and drop this one:
>>>>>>> https://lore.kernel.org/all/[email protected]/T/#u
>>>>>> Hmm, I never thought about "job statistics" when I preferred using kref above.
>>>>>> The reason kref is attractive is because one doesn't need to worry about
>>>>>> it--when the last user drops the kref, the release is called to do
>>>>>> housekeeping. If this never happens, we know that we have a bug to debug.
>>>>> Yeah, reference counting unfortunately have some traps as well. For
>>>>> example rarely dropping the last reference from interrupt context or
>>>>> with some unexpected locks help when the cleanup function doesn't expect
>>>>> that is a good recipe for problems as well.
>>>>>
>>> Fully agreed.
>>>
>>>>>> Regarding the patch above--I did look around the code, and it seems safe,
>>>>>> as per your analysis, I didn't see any reference to entity after job submission,
>>>>>> but I'll comment on that thread as well for the record.
>>>>> Reference counting the entities was suggested before. The intentionally
>>>>> avoided that so far because the entity might be the tip of the iceberg
>>>>> of stuff you need to keep around.
>>>>>
>>>>> For example for command submission you also need the VM and when you
>>>>> keep the VM alive you also need to keep the file private alive....
>>>> Yeah refcounting looks often like the easy way out to avoid
>>>> use-after-free issue, until you realize you've just made lifetimes
>>>> unbounded and have some enourmous leaks: entity keeps vm alive, vm
>>>> keeps all the bo alives, somehow every crash wastes more memory
>>>> because vk_device_lost means userspace allocates new stuff for
>>>> everything.
>>>>
>>>> If possible a lifetime design where lifetimes have hard bounds and you
>>>> just borrow a reference under a lock (or some other ownership rule) is
>>>> generally much more solid. But also much harder to design correctly
>>>> :-/
>>>>
>>> The use we are discussing here is to keep the entity alive as long as
>>> jobs from that entity are still active on the HW. While there are no
>>> hard bounds on when a job will get inactive, at least it's not
>>> unbounded. On a crash/fault the job will be removed from the hardware
>>> pretty soon.
>>>
>>> Well behaved jobs after a application shutdown might take a little
>>> longer, but I don't really see the new problem with keeping the entity
>>> alive? As long as a job is active on the hardware, we can't throw out
>>> the VM or BOs, no difference whether the entity is kept alive or not.
>> Exactly that's the problem. VM & BOs are dropped as soon as the process
>> is destroyed, we *don't* wait for the hw to finish before doing so.
>>
>> Just the backing store managed by all the house keeping objects isn't
>> freed until the hw is idle preventing a crash or accessing freed memory.
>>
>> This behavior is rather important for the OOM killer since we need to be
>> able to tear down the process as fast as possible in that case.
>>
> Are you talking about the dropping of pending jobs in
> drm_sched_entity_kill? I'm certainly not trying to change that in any
> way. Those aren't put onto the hardware yet, so we can always safely
> drop them and do so as fast as possible.
>
> What I'm concerned about are the jobs that are already scheduled on the
> HW. At least with Vivante hardware there is no race free way to get rid
> of jobs once they are put on the ring. So whatever the scheduler or DRM
> core is doing, we have to hold on to the BOs and GPU memory management
> structures to keep the hardware from operating on freed memory.
>
> That's already a lot of memory, so I don't really see the issue with
> keeping the entity around in a quiescent state until all the currently
> queued jobs have left the HW.
>
>> Changing that is possible, but that's quite a huge change I'm not really
>> willing to do just for tracking the time spend.
>>
> Yea, it's a big change and whether it's a good idea really depends on
> what we a gaining from it. You seem to see quite low value in "just
> tracking the time spent" and that might be true, but it also forces all
> drivers that want to implement fdinfo to roll their own time tracking.
> I would rather see more of this moved to the scheduler and thus shared
> between drivers.

That's generally a good idea, but if that means that we need to
restructure the whole entity handling then I would object. That's simply
not worth it when we can implement it differently.

What we could do is to keep the submitted fences around in the entity.
Similar to the tracking amdgpu does, see struct amdgpu_ctx_entity.

This way the entity doesn't needs to stay around after it delivered the
job to the hw.

Regards,
Christian.

> Regards,
> Lucas
>
>> What we could do is to track the unsignaled fences in each entity
>> similar to what amdgpu is doing.
>>
>> Regards,
>> Christian.
>>
>>> Some hardware might have ways to expedite job inactivation by
>>> deactivating scheduling queues, or just taking a fault, but for some HW
>>> we'll just have to wait for the job to finish.
>>>
>>> Regards,
>>> Lucas
>>>

2023-04-06 15:59:55

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

Am Mittwoch, dem 05.04.2023 um 16:44 -0400 schrieb Luben Tuikov:
> On 2023-04-05 13:44, Lucas Stach wrote:
> > Hi Luben,
> >
> > Am Dienstag, dem 04.04.2023 um 00:31 -0400 schrieb Luben Tuikov:
> > > On 2023-03-28 04:54, Lucas Stach wrote:
> > > > Hi Danilo,
> > > >
> > > > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > > > Hi all,
> > > > >
> > > > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > > > tries to track the accumulated time that a job was active on the GPU
> > > > > writing it to the entity through which the job was deployed to the
> > > > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > > > which fetches a job from the schedulers pending_list.
> > > > >
> > > > > Doing this can result in a race condition where the entity is already
> > > > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > > > once the job is fetched from the pending_list.
> > > > >
> > > > > After drm_sched_entity_destroy() being called it should be safe to free
> > > > > the structure that embeds the entity. However, a job originally handed
> > > > > over to the scheduler by this entity might still reside in the
> > > > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > > > already being called and the entity being freed. Hence, we can run into
> > > > > a UAF.
> > > > >
> > > > Sorry about that, I clearly didn't properly consider this case.
> > > >
> > > > > In my case it happened that a job, as explained above, was just picked
> > > > > from the schedulers pending_list after the entity was freed due to the
> > > > > client application exiting. Meanwhile this freed up memory was already
> > > > > allocated for a subsequent client applications job structure again.
> > > > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > > > reproduce the same corruption over and over again by just using
> > > > > deqp-runner to run a specific set of VK test cases in parallel.
> > > > >
> > > > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > > > I miss something), which is why I'm writing this mail instead of sending
> > > > > a fix directly.
> > > > >
> > > > > Spontaneously, I see three options to fix it:
> > > > >
> > > > > 1. Rather than embedding the entity into driver specific structures
> > > > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > > > reference count it, such that it's only freed up once all jobs that were
> > > > > deployed through this entity are fetched from the schedulers pending list.
> > > > >
> > > > My vote is on this or something in similar vain for the long term. I
> > > > have some hope to be able to add a GPU scheduling algorithm with a bit
> > > > more fairness than the current one sometime in the future, which
> > > > requires execution time tracking on the entities.
> > >
> > > Danilo,
> > >
> > > Using kref is preferable, i.e. option 1 above.
> > >
> > > Lucas, can you shed some light on,
> > >
> > > 1. In what way the current FIFO scheduling is unfair, and
> > > 2. shed some details on this "scheduling algorithm with a bit
> > > more fairness than the current one"?
> >
> > I don't have a specific implementation in mind yet. However the current
> > FIFO algorithm can be very unfair if you have a sparse workload compete
> > with one that generates a lot of jobs without any throttling aside from
> > the entity queue length.
>
> Ah, that's interesting, let's see, a "sparse workload compete with one that
> generates a lot of jobs", so basically we have a sparse workload compete
> with a dense workload. So we can represent this with two entities, A, B,
> whose jobs we're going to represent by the entities, names A and B.
> So if we let A be the sparse workload and B the dense workload,
> we have this, wlog,
>
> First/oldest job, .........................., latter/new jobs.
> Subm: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...
> Time: t0,t1,t2,t3,t4,t5,t6,t7,t8,t9, .....
>
> The current FIFO algorithm, would prefer to execute those jobs
> in order of submission, i.e. oldest-ready-first job. Assume
> that all jobs are ready. Then we'll execute them in order.
> This is desirable and fair. We want to execute the jobs
> in the order they were submitted, given also that they are
> ready to be executed. So perhaps we want to execute them like this:
>
> First/oldest job, .........................., latter/new jobs.
> Subm: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...
> Time: t0,t1,t2,t3,t4,t5,t6,t7,t8,t9, ....
> Exec: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...
>
> Any other ordering would starve either A, or B. If we executed the 2nd A
> job at t6 or t7, then that would starve the 3rd/4th job in B, since the 2nd A job
> arrives at the same time as that of the 3rd B job, at time t6.
> The time t3-t0 is some delta > 0, some initial scheduler-start up time.
>
For simplicity now also assume that all jobs from A take 5ms of GPU
time, while jobs from B take 50ms of GPU time.

> IOW, we don't want to delay a job any more than it should wait--the oldest
> job, which is also ready, should execute next, so that we're fair how
> it executes in real time. We cannot boot B at t6, so that we execute A,
> just because it is sparse, but just arrived.
>
> From A's point of view, it shouldn't expect its job execution time distribution
> to be any different than its submission time distribution.
>
> Do you think there's a job permutation which offers a fairer scheduling
> than the Exec line above for the Submission line above?
>
Yes, if we try to keep some fairness of actual GPU time made available
to each entity by tracking the time spent over a sampling interval, we
would see that at t6 B has already spent 100ms of GPU time, while A has
only spent 5ms, so naturally we would prefer the newly submitted job
from entity A when choosing the next job to schedule.

> > By tracking the actual GPU time consumed by
> > the entities we could implement something with a bit more fairness like
> > deficit round robin (don't pin me on the specific algorithm, as I
> > haven't given it much thought yet).
>
> Since there's no preemption, this would be hard to achieve--you're at the mercy
> of the execution time of job A_i for an entity A job i. (Assuming there's no
> preemption as it is the current state of the GPU scheduler.)
>
> The only thing you can do, is punish the next job from this entity, A_i+1,
> to execute much later. However, you don't know how long A_i+1 would take. If A_i+1
> takes very little time, then you're better off executing it at the next opportune
> time, i.e. when it would normally execute. But such an algorithm, which doesn't
> know a priori the execution time of a job, would punish A_i+1 to execute much later.
>
> But if A_i+1 takes time as long or longer than A_i, then punishing it to execute much
> later, would simply delay it, from an observer's point a view, it wouldn't teach
> the context to submit smaller jobs, so that GPU sharing is more fair.

Without preemption we can't really make the scheduling absolutely fair,
but we can still do better than the simple FIFO. If a entity has taken
up a disproportionate amount of GPU time over the last sampling
interval, we can derate it when picking the next job to schedule to
allow other entities to take up more GPU time in the next sampling
interval.

Regards,
Lucas

2023-04-06 16:00:48

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On Thu, Apr 06, 2023 at 05:33:18PM +0200, Christian K?nig wrote:
> Am 06.04.23 um 17:24 schrieb Lucas Stach:
> > Am Donnerstag, dem 06.04.2023 um 16:21 +0200 schrieb Christian K?nig:
> > > Am 06.04.23 um 12:45 schrieb Lucas Stach:
> > > > Am Donnerstag, dem 06.04.2023 um 10:27 +0200 schrieb Daniel Vetter:
> > > > > On Thu, 6 Apr 2023 at 10:22, Christian K?nig <[email protected]> wrote:
> > > > > > Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> > > > > > > On 2023-04-05 10:05, Danilo Krummrich wrote:
> > > > > > > > On 4/4/23 06:31, Luben Tuikov wrote:
> > > > > > > > > On 2023-03-28 04:54, Lucas Stach wrote:
> > > > > > > > > > Hi Danilo,
> > > > > > > > > >
> > > > > > > > > > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > > > > > > > > > tries to track the accumulated time that a job was active on the GPU
> > > > > > > > > > > writing it to the entity through which the job was deployed to the
> > > > > > > > > > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > > > > > > > > > which fetches a job from the schedulers pending_list.
> > > > > > > > > > >
> > > > > > > > > > > Doing this can result in a race condition where the entity is already
> > > > > > > > > > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > > > > > > > > > once the job is fetched from the pending_list.
> > > > > > > > > > >
> > > > > > > > > > > After drm_sched_entity_destroy() being called it should be safe to free
> > > > > > > > > > > the structure that embeds the entity. However, a job originally handed
> > > > > > > > > > > over to the scheduler by this entity might still reside in the
> > > > > > > > > > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > > > > > > > > > already being called and the entity being freed. Hence, we can run into
> > > > > > > > > > > a UAF.
> > > > > > > > > > >
> > > > > > > > > > Sorry about that, I clearly didn't properly consider this case.
> > > > > > > > > >
> > > > > > > > > > > In my case it happened that a job, as explained above, was just picked
> > > > > > > > > > > from the schedulers pending_list after the entity was freed due to the
> > > > > > > > > > > client application exiting. Meanwhile this freed up memory was already
> > > > > > > > > > > allocated for a subsequent client applications job structure again.
> > > > > > > > > > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > > > > > > > > > reproduce the same corruption over and over again by just using
> > > > > > > > > > > deqp-runner to run a specific set of VK test cases in parallel.
> > > > > > > > > > >
> > > > > > > > > > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > > > > > > > > > I miss something), which is why I'm writing this mail instead of sending
> > > > > > > > > > > a fix directly.
> > > > > > > > > > >
> > > > > > > > > > > Spontaneously, I see three options to fix it:
> > > > > > > > > > >
> > > > > > > > > > > 1. Rather than embedding the entity into driver specific structures
> > > > > > > > > > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > > > > > > > > > reference count it, such that it's only freed up once all jobs that were
> > > > > > > > > > > deployed through this entity are fetched from the schedulers pending list.
> > > > > > > > > > >
> > > > > > > > > > My vote is on this or something in similar vain for the long term. I
> > > > > > > > > > have some hope to be able to add a GPU scheduling algorithm with a bit
> > > > > > > > > > more fairness than the current one sometime in the future, which
> > > > > > > > > > requires execution time tracking on the entities.
> > > > > > > > > Danilo,
> > > > > > > > >
> > > > > > > > > Using kref is preferable, i.e. option 1 above.
> > > > > > > > I think the only real motivation for doing that would be for generically
> > > > > > > > tracking job statistics within the entity a job was deployed through. If
> > > > > > > > we all agree on tracking job statistics this way I am happy to prepare a
> > > > > > > > patch for this option and drop this one:
> > > > > > > > https://lore.kernel.org/all/[email protected]/T/#u
> > > > > > > Hmm, I never thought about "job statistics" when I preferred using kref above.
> > > > > > > The reason kref is attractive is because one doesn't need to worry about
> > > > > > > it--when the last user drops the kref, the release is called to do
> > > > > > > housekeeping. If this never happens, we know that we have a bug to debug.
> > > > > > Yeah, reference counting unfortunately have some traps as well. For
> > > > > > example rarely dropping the last reference from interrupt context or
> > > > > > with some unexpected locks help when the cleanup function doesn't expect
> > > > > > that is a good recipe for problems as well.
> > > > > >
> > > > Fully agreed.
> > > >
> > > > > > > Regarding the patch above--I did look around the code, and it seems safe,
> > > > > > > as per your analysis, I didn't see any reference to entity after job submission,
> > > > > > > but I'll comment on that thread as well for the record.
> > > > > > Reference counting the entities was suggested before. The intentionally
> > > > > > avoided that so far because the entity might be the tip of the iceberg
> > > > > > of stuff you need to keep around.
> > > > > >
> > > > > > For example for command submission you also need the VM and when you
> > > > > > keep the VM alive you also need to keep the file private alive....
> > > > > Yeah refcounting looks often like the easy way out to avoid
> > > > > use-after-free issue, until you realize you've just made lifetimes
> > > > > unbounded and have some enourmous leaks: entity keeps vm alive, vm
> > > > > keeps all the bo alives, somehow every crash wastes more memory
> > > > > because vk_device_lost means userspace allocates new stuff for
> > > > > everything.
> > > > >
> > > > > If possible a lifetime design where lifetimes have hard bounds and you
> > > > > just borrow a reference under a lock (or some other ownership rule) is
> > > > > generally much more solid. But also much harder to design correctly
> > > > > :-/
> > > > >
> > > > The use we are discussing here is to keep the entity alive as long as
> > > > jobs from that entity are still active on the HW. While there are no
> > > > hard bounds on when a job will get inactive, at least it's not
> > > > unbounded. On a crash/fault the job will be removed from the hardware
> > > > pretty soon.
> > > >
> > > > Well behaved jobs after a application shutdown might take a little
> > > > longer, but I don't really see the new problem with keeping the entity
> > > > alive? As long as a job is active on the hardware, we can't throw out
> > > > the VM or BOs, no difference whether the entity is kept alive or not.
> > > Exactly that's the problem. VM & BOs are dropped as soon as the process
> > > is destroyed, we *don't* wait for the hw to finish before doing so.
> > >
> > > Just the backing store managed by all the house keeping objects isn't
> > > freed until the hw is idle preventing a crash or accessing freed memory.
> > >
> > > This behavior is rather important for the OOM killer since we need to be
> > > able to tear down the process as fast as possible in that case.
> > >
> > Are you talking about the dropping of pending jobs in
> > drm_sched_entity_kill? I'm certainly not trying to change that in any
> > way. Those aren't put onto the hardware yet, so we can always safely
> > drop them and do so as fast as possible.
> >
> > What I'm concerned about are the jobs that are already scheduled on the
> > HW. At least with Vivante hardware there is no race free way to get rid
> > of jobs once they are put on the ring. So whatever the scheduler or DRM
> > core is doing, we have to hold on to the BOs and GPU memory management
> > structures to keep the hardware from operating on freed memory.
> >
> > That's already a lot of memory, so I don't really see the issue with
> > keeping the entity around in a quiescent state until all the currently
> > queued jobs have left the HW.
> >
> > > Changing that is possible, but that's quite a huge change I'm not really
> > > willing to do just for tracking the time spend.
> > >
> > Yea, it's a big change and whether it's a good idea really depends on
> > what we a gaining from it. You seem to see quite low value in "just
> > tracking the time spent" and that might be true, but it also forces all
> > drivers that want to implement fdinfo to roll their own time tracking.
> > I would rather see more of this moved to the scheduler and thus shared
> > between drivers.
>
> That's generally a good idea, but if that means that we need to restructure
> the whole entity handling then I would object. That's simply not worth it
> when we can implement it differently.
>
> What we could do is to keep the submitted fences around in the entity.
> Similar to the tracking amdgpu does, see struct amdgpu_ctx_entity.
>
> This way the entity doesn't needs to stay around after it delivered the job
> to the hw.

I've done a _very_ cursory look, but from that the design seems to be that
we only keep something very small around to avoid oopsing (just
drm_sched_entity) and the overall gpu ctx goes away synchronously (more or
less) when userspace destroys it.

And then the actual fdinfo reporting or any reporting is a pure pull model
where you come from drm_file -> gpu ctx -> amdgpu_ctx_entity and have
borrowed references by holding enough locks.

The push model, where the scheduler job actively pushes the stats all the
way to the userspace/uapi ctx object is imo impossible to get right.
i915-gem tried a few times, and solutions start with sprinkling rcu
everywhere and only get worse. Imo absolute no-go.

If I understand this right then yes I think extracting the
amgpu_ctx_entity split into a bit of helper. Might want to extract the
context xarray and lookup too and put it into drm_file, including the
amdgpu_ctx_mgr->lock and iterating over stats for fdinfo maybe too to make
this really worth it.

Of all the failed attempts I've seen for exposing ctx stats this (amdgpu
push model) is imo the only solid one.

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2023-04-06 16:42:11

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On Thu, 6 Apr 2023 at 17:59, Daniel Vetter <[email protected]> wrote:
>
> On Thu, Apr 06, 2023 at 05:33:18PM +0200, Christian König wrote:
> > Am 06.04.23 um 17:24 schrieb Lucas Stach:
> > > Am Donnerstag, dem 06.04.2023 um 16:21 +0200 schrieb Christian König:
> > > > Am 06.04.23 um 12:45 schrieb Lucas Stach:
> > > > > Am Donnerstag, dem 06.04.2023 um 10:27 +0200 schrieb Daniel Vetter:
> > > > > > On Thu, 6 Apr 2023 at 10:22, Christian König <[email protected]> wrote:
> > > > > > > Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> > > > > > > > On 2023-04-05 10:05, Danilo Krummrich wrote:
> > > > > > > > > On 4/4/23 06:31, Luben Tuikov wrote:
> > > > > > > > > > On 2023-03-28 04:54, Lucas Stach wrote:
> > > > > > > > > > > Hi Danilo,
> > > > > > > > > > >
> > > > > > > > > > > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > > > > > > > > > > Hi all,
> > > > > > > > > > > >
> > > > > > > > > > > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > > > > > > > > > > tries to track the accumulated time that a job was active on the GPU
> > > > > > > > > > > > writing it to the entity through which the job was deployed to the
> > > > > > > > > > > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > > > > > > > > > > which fetches a job from the schedulers pending_list.
> > > > > > > > > > > >
> > > > > > > > > > > > Doing this can result in a race condition where the entity is already
> > > > > > > > > > > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > > > > > > > > > > once the job is fetched from the pending_list.
> > > > > > > > > > > >
> > > > > > > > > > > > After drm_sched_entity_destroy() being called it should be safe to free
> > > > > > > > > > > > the structure that embeds the entity. However, a job originally handed
> > > > > > > > > > > > over to the scheduler by this entity might still reside in the
> > > > > > > > > > > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > > > > > > > > > > already being called and the entity being freed. Hence, we can run into
> > > > > > > > > > > > a UAF.
> > > > > > > > > > > >
> > > > > > > > > > > Sorry about that, I clearly didn't properly consider this case.
> > > > > > > > > > >
> > > > > > > > > > > > In my case it happened that a job, as explained above, was just picked
> > > > > > > > > > > > from the schedulers pending_list after the entity was freed due to the
> > > > > > > > > > > > client application exiting. Meanwhile this freed up memory was already
> > > > > > > > > > > > allocated for a subsequent client applications job structure again.
> > > > > > > > > > > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > > > > > > > > > > reproduce the same corruption over and over again by just using
> > > > > > > > > > > > deqp-runner to run a specific set of VK test cases in parallel.
> > > > > > > > > > > >
> > > > > > > > > > > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > > > > > > > > > > I miss something), which is why I'm writing this mail instead of sending
> > > > > > > > > > > > a fix directly.
> > > > > > > > > > > >
> > > > > > > > > > > > Spontaneously, I see three options to fix it:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Rather than embedding the entity into driver specific structures
> > > > > > > > > > > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > > > > > > > > > > reference count it, such that it's only freed up once all jobs that were
> > > > > > > > > > > > deployed through this entity are fetched from the schedulers pending list.
> > > > > > > > > > > >
> > > > > > > > > > > My vote is on this or something in similar vain for the long term. I
> > > > > > > > > > > have some hope to be able to add a GPU scheduling algorithm with a bit
> > > > > > > > > > > more fairness than the current one sometime in the future, which
> > > > > > > > > > > requires execution time tracking on the entities.
> > > > > > > > > > Danilo,
> > > > > > > > > >
> > > > > > > > > > Using kref is preferable, i.e. option 1 above.
> > > > > > > > > I think the only real motivation for doing that would be for generically
> > > > > > > > > tracking job statistics within the entity a job was deployed through. If
> > > > > > > > > we all agree on tracking job statistics this way I am happy to prepare a
> > > > > > > > > patch for this option and drop this one:
> > > > > > > > > https://lore.kernel.org/all/[email protected]/T/#u
> > > > > > > > Hmm, I never thought about "job statistics" when I preferred using kref above.
> > > > > > > > The reason kref is attractive is because one doesn't need to worry about
> > > > > > > > it--when the last user drops the kref, the release is called to do
> > > > > > > > housekeeping. If this never happens, we know that we have a bug to debug.
> > > > > > > Yeah, reference counting unfortunately have some traps as well. For
> > > > > > > example rarely dropping the last reference from interrupt context or
> > > > > > > with some unexpected locks help when the cleanup function doesn't expect
> > > > > > > that is a good recipe for problems as well.
> > > > > > >
> > > > > Fully agreed.
> > > > >
> > > > > > > > Regarding the patch above--I did look around the code, and it seems safe,
> > > > > > > > as per your analysis, I didn't see any reference to entity after job submission,
> > > > > > > > but I'll comment on that thread as well for the record.
> > > > > > > Reference counting the entities was suggested before. The intentionally
> > > > > > > avoided that so far because the entity might be the tip of the iceberg
> > > > > > > of stuff you need to keep around.
> > > > > > >
> > > > > > > For example for command submission you also need the VM and when you
> > > > > > > keep the VM alive you also need to keep the file private alive....
> > > > > > Yeah refcounting looks often like the easy way out to avoid
> > > > > > use-after-free issue, until you realize you've just made lifetimes
> > > > > > unbounded and have some enourmous leaks: entity keeps vm alive, vm
> > > > > > keeps all the bo alives, somehow every crash wastes more memory
> > > > > > because vk_device_lost means userspace allocates new stuff for
> > > > > > everything.
> > > > > >
> > > > > > If possible a lifetime design where lifetimes have hard bounds and you
> > > > > > just borrow a reference under a lock (or some other ownership rule) is
> > > > > > generally much more solid. But also much harder to design correctly
> > > > > > :-/
> > > > > >
> > > > > The use we are discussing here is to keep the entity alive as long as
> > > > > jobs from that entity are still active on the HW. While there are no
> > > > > hard bounds on when a job will get inactive, at least it's not
> > > > > unbounded. On a crash/fault the job will be removed from the hardware
> > > > > pretty soon.
> > > > >
> > > > > Well behaved jobs after a application shutdown might take a little
> > > > > longer, but I don't really see the new problem with keeping the entity
> > > > > alive? As long as a job is active on the hardware, we can't throw out
> > > > > the VM or BOs, no difference whether the entity is kept alive or not.
> > > > Exactly that's the problem. VM & BOs are dropped as soon as the process
> > > > is destroyed, we *don't* wait for the hw to finish before doing so.
> > > >
> > > > Just the backing store managed by all the house keeping objects isn't
> > > > freed until the hw is idle preventing a crash or accessing freed memory.
> > > >
> > > > This behavior is rather important for the OOM killer since we need to be
> > > > able to tear down the process as fast as possible in that case.
> > > >
> > > Are you talking about the dropping of pending jobs in
> > > drm_sched_entity_kill? I'm certainly not trying to change that in any
> > > way. Those aren't put onto the hardware yet, so we can always safely
> > > drop them and do so as fast as possible.
> > >
> > > What I'm concerned about are the jobs that are already scheduled on the
> > > HW. At least with Vivante hardware there is no race free way to get rid
> > > of jobs once they are put on the ring. So whatever the scheduler or DRM
> > > core is doing, we have to hold on to the BOs and GPU memory management
> > > structures to keep the hardware from operating on freed memory.
> > >
> > > That's already a lot of memory, so I don't really see the issue with
> > > keeping the entity around in a quiescent state until all the currently
> > > queued jobs have left the HW.
> > >
> > > > Changing that is possible, but that's quite a huge change I'm not really
> > > > willing to do just for tracking the time spend.
> > > >
> > > Yea, it's a big change and whether it's a good idea really depends on
> > > what we a gaining from it. You seem to see quite low value in "just
> > > tracking the time spent" and that might be true, but it also forces all
> > > drivers that want to implement fdinfo to roll their own time tracking.
> > > I would rather see more of this moved to the scheduler and thus shared
> > > between drivers.
> >
> > That's generally a good idea, but if that means that we need to restructure
> > the whole entity handling then I would object. That's simply not worth it
> > when we can implement it differently.
> >
> > What we could do is to keep the submitted fences around in the entity.
> > Similar to the tracking amdgpu does, see struct amdgpu_ctx_entity.
> >
> > This way the entity doesn't needs to stay around after it delivered the job
> > to the hw.
>
> I've done a _very_ cursory look, but from that the design seems to be that
> we only keep something very small around to avoid oopsing (just
> drm_sched_entity) and the overall gpu ctx goes away synchronously (more or
> less) when userspace destroys it.
>
> And then the actual fdinfo reporting or any reporting is a pure pull model
> where you come from drm_file -> gpu ctx -> amdgpu_ctx_entity and have
> borrowed references by holding enough locks.
>
> The push model, where the scheduler job actively pushes the stats all the
> way to the userspace/uapi ctx object is imo impossible to get right.
> i915-gem tried a few times, and solutions start with sprinkling rcu
> everywhere and only get worse. Imo absolute no-go.
>
> If I understand this right then yes I think extracting the
> amgpu_ctx_entity split into a bit of helper. Might want to extract the
> context xarray and lookup too and put it into drm_file, including the
> amdgpu_ctx_mgr->lock and iterating over stats for fdinfo maybe too to make
> this really worth it.
>
> Of all the failed attempts I've seen for exposing ctx stats this (amdgpu
> push model) is imo the only solid one.

amdgpu _pull_ model. Yay for me introducing some concepts and then
mixing them up in the same email :-/

>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2023-04-06 17:19:02

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On 07/04/2023 00.30, Daniel Vetter wrote:
> On Thu, Apr 06, 2023 at 11:43:19PM +0900, Asahi Lina wrote:
>> On 06/04/2023 22.37, Daniel Vetter wrote:
>>> On Thu, Apr 06, 2023 at 09:21:47PM +0900, Asahi Lina wrote:
>>>> On 06/04/2023 19.09, Daniel Vetter wrote:
>>>>> On Thu, Apr 06, 2023 at 06:05:11PM +0900, Asahi Lina wrote:
>>>>>> On 06/04/2023 17.27, Daniel Vetter wrote:
>>>>>>> On Thu, 6 Apr 2023 at 10:22, Christian König <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Am 05.04.23 um 18:09 schrieb Luben Tuikov:
>>>>>>>>> On 2023-04-05 10:05, Danilo Krummrich wrote:
>>>>>>>>>> On 4/4/23 06:31, Luben Tuikov wrote:
>>>>>>>>>>> On 2023-03-28 04:54, Lucas Stach wrote:
>>>>>>>>>>>> Hi Danilo,
>>>>>>>>>>>>
>>>>>>>>>>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>>>>>>>>>>> tries to track the accumulated time that a job was active on the GPU
>>>>>>>>>>>>> writing it to the entity through which the job was deployed to the
>>>>>>>>>>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>>>>>>>>>>> which fetches a job from the schedulers pending_list.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Doing this can result in a race condition where the entity is already
>>>>>>>>>>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>>>>>>>>>>> once the job is fetched from the pending_list.
>>>>>>>>>>>>>
>>>>>>>>>>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>>>>>>>>>>> the structure that embeds the entity. However, a job originally handed
>>>>>>>>>>>>> over to the scheduler by this entity might still reside in the
>>>>>>>>>>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>>>>>>>>>>> already being called and the entity being freed. Hence, we can run into
>>>>>>>>>>>>> a UAF.
>>>>>>>>>>>>>
>>>>>>>>>>>> Sorry about that, I clearly didn't properly consider this case.
>>>>>>>>>>>>
>>>>>>>>>>>>> In my case it happened that a job, as explained above, was just picked
>>>>>>>>>>>>> from the schedulers pending_list after the entity was freed due to the
>>>>>>>>>>>>> client application exiting. Meanwhile this freed up memory was already
>>>>>>>>>>>>> allocated for a subsequent client applications job structure again.
>>>>>>>>>>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>>>>>>>>>>> reproduce the same corruption over and over again by just using
>>>>>>>>>>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>>>>>>>>>>> I miss something), which is why I'm writing this mail instead of sending
>>>>>>>>>>>>> a fix directly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Spontaneously, I see three options to fix it:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Rather than embedding the entity into driver specific structures
>>>>>>>>>>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>>>>>>>>>>> reference count it, such that it's only freed up once all jobs that were
>>>>>>>>>>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>>>>>>>>>>
>>>>>>>>>>>> My vote is on this or something in similar vain for the long term. I
>>>>>>>>>>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>>>>>>>>>>> more fairness than the current one sometime in the future, which
>>>>>>>>>>>> requires execution time tracking on the entities.
>>>>>>>>>>> Danilo,
>>>>>>>>>>>
>>>>>>>>>>> Using kref is preferable, i.e. option 1 above.
>>>>>>>>>> I think the only real motivation for doing that would be for generically
>>>>>>>>>> tracking job statistics within the entity a job was deployed through. If
>>>>>>>>>> we all agree on tracking job statistics this way I am happy to prepare a
>>>>>>>>>> patch for this option and drop this one:
>>>>>>>>>> https://lore.kernel.org/all/[email protected]/T/#u
>>>>>>>>> Hmm, I never thought about "job statistics" when I preferred using kref above.
>>>>>>>>> The reason kref is attractive is because one doesn't need to worry about
>>>>>>>>> it--when the last user drops the kref, the release is called to do
>>>>>>>>> housekeeping. If this never happens, we know that we have a bug to debug.
>>>>>>>>
>>>>>>>> Yeah, reference counting unfortunately have some traps as well. For
>>>>>>>> example rarely dropping the last reference from interrupt context or
>>>>>>>> with some unexpected locks help when the cleanup function doesn't expect
>>>>>>>> that is a good recipe for problems as well.
>>>>>>>>
>>>>>>>>> Regarding the patch above--I did look around the code, and it seems safe,
>>>>>>>>> as per your analysis, I didn't see any reference to entity after job submission,
>>>>>>>>> but I'll comment on that thread as well for the record.
>>>>>>>>
>>>>>>>> Reference counting the entities was suggested before. The intentionally
>>>>>>>> avoided that so far because the entity might be the tip of the iceberg
>>>>>>>> of stuff you need to keep around.
>>>>>>>>
>>>>>>>> For example for command submission you also need the VM and when you
>>>>>>>> keep the VM alive you also need to keep the file private alive....
>>>>>>>
>>>>>>> Yeah refcounting looks often like the easy way out to avoid
>>>>>>> use-after-free issue, until you realize you've just made lifetimes
>>>>>>> unbounded and have some enourmous leaks: entity keeps vm alive, vm
>>>>>>> keeps all the bo alives, somehow every crash wastes more memory
>>>>>>> because vk_device_lost means userspace allocates new stuff for
>>>>>>> everything.
>>>>>>
>>>>>> Refcounting everywhere has been working well for us, so well that so far all
>>>>>> the oopses we've hit have been... drm_sched bugs like this one, not anything
>>>>>> in the driver. But at least in Rust you have the advantage that you can't
>>>>>> just forget a decref in a rarely-hit error path (or worse, forget an incref
>>>>>> somewhere important)... ^^
>>>>>>
>>>>>>> If possible a lifetime design where lifetimes have hard bounds and you
>>>>>>> just borrow a reference under a lock (or some other ownership rule) is
>>>>>>> generally much more solid. But also much harder to design correctly
>>>>>>> :-/
>>>>>>>
>>>>>>>> Additional to that we have some ugly inter dependencies between tearing
>>>>>>>> down an application (potential with a KILL signal from the OOM killer)
>>>>>>>> and backward compatibility for some applications which render something
>>>>>>>> and quit before the rendering is completed in the hardware.
>>>>>>>
>>>>>>> Yeah I think that part would also be good to sort out once&for all in
>>>>>>> drm/sched, because i915 has/had the same struggle.
>>>>>>> -Daniel
>>>>>>>
>>>>>>
>>>>>> Is this really a thing? I think that's never going to work well for explicit
>>>>>> sync, since the kernel doesn't even know what BOs it has to keep alive for a
>>>>>> job... I guess it could keep the entire file and all of its objects/VMs/etc
>>>>>> alive until all of its submissions complete but... ewww.
>>>>>>
>>>>>> Our Mesa implementation synchronously waits for all jobs on context destroy
>>>>>> for this reason, but if you just kill the app, yeah, you get faults as
>>>>>> running GPU jobs have BOs yanked out from under them. Kill loops make for a
>>>>>> good way of testing fault handling...
>>>>>
>>>>> You wind down the entire thing on file close? Like
>>>>> - stop all context
>>>>> - tear down all context
>>>>> - tear down all vm
>>>>> - tear down all obj
>>>>>
>>>>> Just winding things down in a random order and then letting gpu fault
>>>>> handling sort out the mess doesn't strike me as particularly clean design
>>>>> ...
>>>> The idea is that object drop order generally doesn't matter since things
>>>> that care about other things should own them or hold references to them
>>>> anyway, so the dependency graph of all the resources is encoded directly in
>>>> the type hierarchy instead of having to open-code a "cleanup procedure"...
>>>> which then invariably leads to corner cases when you have to do the same
>>>> thing, or part of it, for error handling.
>>>>
>>>> This has been working *very* well! It solves the issue of error handling
>>>> since error handling just unwinds whatever was done to that point naturally
>>>> in Rust (? operator), so there's practically no open-coded error handling
>>>> code anywhere. The first time we ran into OOMs (Xonotic with no Mesa texture
>>>> compression support yet, on 8GB machines on max settings...) the whole thing
>>>> just worked. OOM killer went rampant and shmem doesn't account stuff to
>>>> processes properly of course, but all the error paths, allocation errors,
>>>> etc... all of that just worked, first try, across dozens of error paths that
>>>> had never been tested before, not a single oops or deadlock or leak or
>>>> anything in sight. Similarly, yesterday I did manage to run into drm_sched
>>>> failing to create kthreads (the scaling issue Matthew's series fixes)... and
>>>> still, that was fine. That happens on queue creation so it just bubbled up
>>>> to Mesa as a failed ioctl and things kept moving along nicely otherwise. I
>>>> even have nice ergonomic XArray semantics so that you can reserve a new
>>>> slot, allocate some object, then populate it, and if you don't (because you
>>>> errored out in between) it automatically gets freed again without explicit
>>>> cleanup code.
>>>>
>>>> And it also means that I can encode *firmware* resource dependencies in the
>>>> type system (with Rust lifetimes attached to *GPU* pointers even - it's a
>>>> bit dodgy how it's done but it works well in practice). Since it is
>>>> absolutely critical that the firmware objects respect their lifetimes or
>>>> else the whole thing crashes irrecoverably, this is the only way I feel it's
>>>> been even practical to write this driver and not be a firmware crash mess.
>>>> Of course we still get some crashes due to flaws in how I understand the
>>>> firmware, but it's always things I don't know, not things I accidentally
>>>> messed up in some corner case code path we don't normally hit, since I just
>>>> don't have to think about that as long as the hierarchy is right.
>>>>
>>>> I actually don't know exactly what precise order things get dropped in for
>>>> this reason! I could find out, and it's predictable in Rust, what I mean is
>>>> that thinking about a total order like that is not necessary for correctness
>>>> as long as I got the ownership right. Off the top of my head though, it goes
>>>> very roughly like this:
>>>>
>>>> - On File close, all the GEM objects get closed (DRM core does this)
>>>> - This triggers explicit drops of all mappings in those GEM objects owned by
>>>> that File (identified by unique ID, this is the one annoying circular
>>>> reference thing I mentioned in the other thread...). At this point the GPU
>>>> probably faults but we don't care. *
>>>> - The File itself gets dropped, which drops the XArrays for queues and
>>>> (UAPI) VMs
>>>> - UAPI VMs getting dropped doesn't do much other than unmap a single dummy
>>>> object. The underlying MMU VM is refcounted and jobs hold references. This
>>>> also drops the userspace VM object allocator used for kernel-managed
>>>> allocations, but that too is internally refcounted and won't go away until
>>>> all of its allocations do.
>>>> - Queues get dropped, which mostly drops a bunch of references to things
>>>> that no longer matter, along with the scheduler and scheduler entity.
>>>> - The entity already has a reference to the scheduler in the abstraction (to
>>>> meet the soundness requirement), so the entity necessarily goes first. That
>>>> kills all not yet scheduled jobs, freeing any resources they might use.
>>>> - Then the scheduler gets torn down, and with my other patch that logically
>>>> kills all in-flight jobs, detaching their hardware fences and dropping the
>>>> job objects. This... still doesn't do much other than drop some references
>>>> that we don't care about.
>>>> - At this point, if any jobs are in flight, their firmware objects and all
>>>> of the type hierarchy that goes with them is still alive, as well as the
>>>> firmware queues they're in and the Rust objects representing them, the VMs
>>>> they use, the Events they have been allocated...
>>>> - Jobs complete (successfully or fault), then when complete get popped off
>>>> of the Rust-side queue objects that represent the firmware queues.
>>>> - When any given FW queue is empty, it relinquishes its assigned firmware
>>>> event ID. That causes the event system to drop its owner reference to it,
>>>> which means the queue itself gets dropped (since the UAPI Queue that also
>>>> held a reference is gone). That then also drops a reference to what I call
>>>> the GpuContext.
>>>> - To avoid deadlocks, completed job objects are freed in another thread
>>>> (ugly hack right now, should be done better in the future). Eventually as
>>>> that happens, any resources they reference are dropped, including some
>>>> shared ones that are logically tied to the scheduler/queues, references to
>>>> the MMU VM address space, references to the VM slot that address space is
>>>> assigned to, objects allocated out of user VM space, everything. Refcounts
>>>> everywhere for anything shared, direct ownership of child structures for
>>>> anything that isn't (work buffers, firmware command lists, etc.). I once
>>>> tried to make a slide of the references and pointers involved in just the
>>>> vertex half of a single GPU job and... even just that was quite interesting.
>>>> - When the last job completes, that drops the last reference to the VM slot,
>>>> which means the backing VM is logically detached from the GPU MMU (this is
>>>> lazy though, it's still there in practice).
>>>> - When the last firmware queue is dropped for a logical queue/sched/etc
>>>> (which means no more jobs are running at the GPU for that context), that
>>>> drops the last reference to the GpuContext. That also gets shoved into
>>>> another thread context for cleanup to avoid deadlocks with fault recovery.
>>>> - When that is finally cleaned up, a firmware command is sent to invalidate
>>>> the GpuContext. I'm still figuring out what that does and what the lifetime
>>>> rules are here (this is the only explicit invalidation command that exists),
>>>> but as of yesterday I know that at the very least we need to keep hold of
>>>> any Tiled Vertex Buffer associated with it until after inval, so that now
>>>> has a reference to it that gets dropped after the firmware acknowledges the
>>>> invalidation (unless it's a non-render-capable Queue, then no TVB
>>>> necessary).
>>>> - When the Buffer gets dropped, that frees both its backing memory and
>>>> (virtual) page list structures, which are in user VM space, as well as some
>>>> kernel firmware objects.
>>>> - If things have happened in the order I listed here, those will be the last
>>>> allocations in the two user VM space heap object allocators, so those now
>>>> get dropped, which drops the mappings of their backing GEM objects,
>>>> unmapping them from the MMU VM page tables.
>>>> - Those mappings will now be the last references to the actual MMU VM
>>>> object, so that it gets destroyed (the lazy detach comes into effect here,
>>>> PT base address is removed from the VM context base table, full ASID
>>>> invalidate, etc.), which with it drops the IoPgTable that backs it, which
>>>> frees the page tables.
>>>> - Then finally the GEM objects backing the userspace allocators get dropped
>>>> as well, which will be the last reference to them, so those get freed.
>>>
>>> Thanks for the write up!
>>>
>>> Maybe some more fundamental thoughts on this entire endeavour:
>>>
>>> I think one thing that's causing a lot of friction that in C drivers at
>>> least some of these things are done with implied/borrowed references. If
>>> you want an absolute shocking example, the gpu crash dump sampling for a
>>> driver that does dynamic memory management can be pretty epic:
>>>
>>> - the timeout handler is guaranteed (well if the driver didn't screw up
>>> things, which mostly boils down to call drm_sched_stop() before it
>>> analyzes anything at all wrt gpu hw state) to hold up the drm_job
>>> completion fence
>>>
>>> - this fence is (at submit ioctl time) installed into all the dma_resv
>>> fence containers for the vm (single shared dma_resv for all vm-private obj for
>>> efficiency, all shareable gem_bo have their own dma_resv)
>>>
>>> - the memory manager guarantees that it will not free (or even move in the
>>> case of ttm) the buffer while a fence is unsingalled. ttm does this by
>>> delaying the free as needed, i915-gem and some of the others did this by
>>> trying to hold onto extra references. the latter causes way too much
>>> problems with reference dropping in my experience looking at a decade of
>>> drivers getting details wrong here.
>>>
>>> - which means the crash dump handler can have a plain list of gem_bo to
>>> record, without any references, because they wont go away before it
>>> finishes.
>>>
>>> - more fun even, you could directly sample the memory locations at ioctl
>>> submit time, and even when ttm moves the buffers around in a pipelined
>>> fashion: your timeout handler wont run before all the jobs to move the
>>> buffer into the right location have completed, and it will not unblock
>>> any subsequent move or eviction that's already queued up. Which means
>>> even the raw memory address will point at the right stuff.
>>>
>>> This is just one example. drm and the kernel is full of these borrowed
>>> reference tricks with often cross through the entire stack, so rust has to
>>> be able to cope. The sales pitch of rust, and really the reason it's the
>>> first non-C language in linux, is that with the borrow semantics, it can
>>> validate this stuff at compile time, and allow kernel drivers to continue
>>> playing these tricks to outright avoid any and all refcounting needs. If
>>> rust doesn't deliver and we need to have all the refcounting again to make
>>> rust drivers safe I think that'll cast serious doubts on the viability of
>>> rust-in-linux.
>>
>> Right, so the thing is Rust can encode *some* borrowed reference semantics
>> at compile time, namely what Rust lifetimes can represent. But you can't
>> encode arbitrarily complex "I can guarantee this is alive right now because
>> reasons A,B,C,D" semantics in the type system. For arbitrarily complex rules
>> like that, you need to either refcount, or use unsafe and raw pointers. For
>> example, in my driver I use raw pointers in the event slot acquisition to
>> point to the actual u32s that hold the event values, because I can guarantee
>> through the semantics of that module that those will always be valid
>> pointers to the live array (and never alias too), even though I can't encode
>> that in the type system. Throwing around refcounted pointers to the whole
>> object there would be silly, so instead I just used unsafe and left a SAFETY
>> comment explaining why it's safe.
>>
>> In the end, unsafe isn't any worse than C (it's strictly better, you still
>> get the borrow checker and type safety and everything else), so I don't
>> think Rust not being able to solve all of these problems is really a good
>> argument against Rust. The idea, and where Rust really shines, is that even
>> when you have to do this you are *containing* the unsafe code within places
>> where it can be audited and explained. And when the majority of the code is
>> safe, and when you strive to make the interfaces between modules safe, and
>> when you limit the ability of cross-module interactions to break safety (Not
>> eliminate! In abstractions yes, and in userspace Rust this is expected, but
>> within a driver it gets more interesting and I can spend a long time talking
>> about how safety boundaries in drivers are complicated and a bit porous...)
>> you end up reducing the chances of safety bugs by orders of magnitude
>> anyway.
>>
>> And it worked, right? I have ~128 unsafe{} blocks in the driver and still no
>> oopses in production ^^
>>
>> But (big but!) the idea with safe Rust abstractions (the stuff in rust/) is
>> that they absolutely do need to be safe (unless marked unsafe, but that
>> sucks...), because then implicitly you are documenting the safety/lifetime
>> requirements in the type system and that's hugely beneficial for common
>> code. It means users of those APIs don't have to think about the internals
>> and safety invariants and all that. Drivers still have to worry about
>> internal safety, but they don't have to worry about messing up the interface
>> with common code. That's also very beneficial for allowing refactoring of
>> that common code without worrying that you'll break users.
>>
>> So technically we could just give up and mark the drm_sched abstraction
>> unsafe and actually document all of these safety requirements (which is the
>> other side of the coin - if you want to use unsafe, you get to write the
>> docs, lawyer style! That's the rule! No getting by with vague docs with
>> dozens of unexplained corner cases!). But, honestly, I think that would be
>> kind of sad... drm_sched really should have a self-contained, safe API, and
>> I just don't see a good reason why it can't other than it not having been
>> designed with this in mind initially. It was ported out of a driver, and not
>> that long ago, right? That explains a lot, because it means it is probably
>> leaking all of these assumptions from that driver's internal design... and
>> it probably doesn't help that the people maintaining it all seem to be from
>> that particular company either. It's easy to have tunnel vision if you
>> mostly work on one use case...
>
> Nah I don't want to give up. What I expect to happen is that rust
> abstractions will need to be a lot more opionated about how to do certain
> things. And then the abstractions together with the C code provide the
> guarantee that the rust driver is safe.
>
>>> Now there's definitely going to be hilarious bugs uncovered on the C side,
>>> and semantics that need to be clarified, but I think enabling scary tricks
>>> like the above one is what'll fundamentally make or break rust in linux.
>>
>> For the specific case of crash dumps and stop-the-world stuff like that...
>> yeah, those are some of the worst things to fit in otherwise nice Rust
>> designs because they violate layering left and right. I don't have a good
>> answer for how exactly I'd do this in my driver yet. I'm sure it *can* be
>> done, and there's always unsafe if needed, since at that point you're
>> dealing in driver code and that's a lot less contentious than an abstraction
>> being unsafe itself...
>>
>> My initial plan for FW crash dumps is going to be fairly easy because I only
>> really care about dumping the firmware object arena and maybe actual
>> firmware .data, and that's all GPU-global anyway, and it won't be hard to
>> poke my way into the global allocators to take a snapshot of their backing
>> GEM objects. There's a global alloc lock involved there which is all I need
>> for basic consistency and getting the locking right for allocs as far as
>> deadlock risk is something I need to care about for other reasons anyway. It
>> won't guarantee that nobody else is *writing* to any of that memory while I
>> take a snapshot, but I don't really care about that since it would be rare
>> and should be limited to newly-allocated memory the firmware doesn't yet
>> care about anyway.
>
> So this is were crash dump capturing gets fun. If that lock covers any
> memory allocations, you can't take it in the tdr path.
>
> The simplest way out is to make context crashes permanent and bail out the
> entire crash capturing to a worker (which needs gem bo references) that
> doesn't block anything (so not even guilty/victim reporting to userspace)
> and can take locks and capture things at leasure.

Yeah, I'm pretty sure we can just do that by doing things in the
opposite order. Our FW crashes are permanent GPU crashes, so we can just
signal first *then* take the snapshot right in the event RX thread.
After all, the firmware just crashed, so that thread is never going to
get another message again and is as good as any other. Signaling will
cause a bunch of stuff to be freed, but as long as nothing allocs GPU
memory on top (I can add a guard to fail allocs after a crash), it won't
make a difference for non-debug dumps (in debug mode it will mark all
freed objects as DEAD which will show up in the metadata blocks, but I
can probably just add a special case to make it not do that after a
crash, which is a good idea for hypervisor-based debugging too anyway).

> Only works when userspace obeys the promise of not overwriting the memory
> meanwhile, which is what you get for vk_device_lost.

Userspace won't have access to firmware memory anyway, which is what I
really care about. Honestly, I'm not sure we have much of a use case for
full-GPU postmortem debugging like that. If the firmware crashed I 99%
only care about the firmware heap + firmware .data (which I can snoop
out of its RAM carveout), and maybe a trace log of recent driver events
*and* firmware ktrace events (yes, the firmware has its own tracing!).
If the GPU faulted the driver shouldn't even have to care much more than
it already does. We already pass user fault info down into Mesa, and
already print the likely culprit BO right in Mesa together with fault
unit, type, etc. (it's cute!). I think I'd rather just add a strict
fault mode that fully kills the context on fault (and ideally does
something to prevent the firmware from even trying to execute any
already queued commands after the crash one, TBD how to do this but I
bet it's possible during the recovery cycle) and then Mesa finds out
about the fault as soon as possible and we can implement GPU-side crash
dumps right in userspace.

We could even add stuff to the kernel to gather rich execution state
info and pass it back to Mesa too, it's just going to take a ton of
reverse engineering of the GPU registers to understand them since
normally only firmware deals with that... but it's totally possible and
I already have a huge list of "interesting" register addresses because
the macOS driver has a big debug print function that gets called on GPU
faults... which has all the printk()s compiled out in release builds,
but it still reads all the registers anyway, so I see all the register
reads in my hypervisor log! ^^

>
>> In the end, nothing stops you from having piles of raw pointer backdoors,
>> having a rule that "this is only used for the crash handler", and then
>> having a couple pages of SAFETY comments in that bit of the code with a big
>> "here be dragons" disclaimer. It won't make that particular chunk of code
>> any less likely to have bugs than the equivalent C code, but at least the
>> rest of the driver isn't affected, and hopefully you can at least avoid
>> unsafe in some subsets of the troublesome part too ^^
>>
>> There's also a flipside to this though: Who cares if you take extra
>> redundant GEM BO references in the crash handler due to API safety? After
>> all, it's not a perf-sensitive codepath. Of course, if safety causes you to
>> add refcounting at all where none would otherwise be needed, or to have to
>> do more of it in hot paths, or to introduce extra locking that could cause
>> deadlocks that's a different story. But I think the thought process is very
>> different between Rust and C here, when writing drivers. In C, it takes
>> extra work to be safe, so chances are most people won't write defensive code
>> like that even where there is no performance reason not to, since it's
>> outright easier to read code the less safe it is. Rust isn't like that! Most
>> safety is encoded in types (automatic reference counting is a big one here),
>> so it doesn't take any extra effort to be safe, while it usually does to do
>> it unsafely (unsafe {} blocks if nothing else). So then you write safe code
>> by default... which yes, might do some extra refcounting and locking and
>> stuff, but you can always optimize that later, profiler in hand. Once you
>> know you need unsafe for real perf reasons, by all means, and I expect to
>> end up doing a round of profiling in the future and have to poke some holes
>> in things too. But until then... safe is just easier and, well, safer!
>
> The problem with extra refcounting is dropping them. There's endless
> amounts of fun where dropping a kref at the wrong time goes boom (wrong
> context, holding wrong locks). Or then you delay it and userspace gets
> annoyed (task_work for struct file is fun), so just unconditionally
> stuffing everything you free into a worker isn't always the right thing
> either.
>
> And rust makes this extremely easy, all the Drop stuff isn't visible
> anywhere in the code, and you'll never screw up the error paths. So my
> take is a bit that unless you've excluded a borrow based design as
> impossible already, sprinkling Arc<T> all over isn't really solid either.

Yes, this is definitely a problem... This is where I really hope we get
the execution context stuff some day, so this can also be checked at
compile time. Fingers crossed!

>
>>> In the end it'll be lots of detail work, and I really hope it all works
>>> out.
>>>
>>>> I probably got more than one thing wrong there, and there's layers of
>>>> complexity I glossed over, but that's the rough idea ^^
>>>>
>>>> * If we need to fix this then we're going to need some kind of signal from
>>>> the DRM core that this is happening and it's not normal user-triggered GEM
>>>> closing, and it's going to be interesting... it also means we need some kind
>>>> of mechanism to transfer responsibility over those mappings to all in-flight
>>>> jobs themselves, because normally userspace is strictly responsible for all
>>>> mappings in an explicit sync VM bind style world, and now we're adding a
>>>> special case where we freeze the state of the VM until all in-flight jobs
>>>> are done when the File goes away instead of eagerly freeing things. That's a
>>>> very weird departure from how the whole thing normally works, so if we
>>>> really want that I'm going to have to think of how to do it reasonably. It
>>>> might be easier once we implement the full VM map range tracking which will
>>>> likely flip the VM<->GEM object relationship around a bit.
>>>
>>> See my other reply for you how can untangle this potentially on the vm_id
>>> subthread. I think in general rust Drop is awesome, but there's cases
>>> where we'll have to be more explicit at unwinding things. At least in the
>>> linux kernel the general consensus is that too much refcounting is a
>>> maintenance disaster because you end up dropping the final reference in
>>> all kinds of really bad places (interrupt handlers, while holding the
>>> wrong locks, holding them too long).
>>
>> Yeah, that's one I've run into, it's why I have a few places with subtle
>> code to avoid that and more things will need cleaning up too...
>>
>> This is something that I expect will become Rust compile-time magic in the
>> future though! There's already been talk about being able to prove that
>> things are only ever called from the right execution context. Then if you
>> have a Drop of a reference whose underlying type is not safe to Drop from
>> IRQ context in IRQ context, the compiler complains. It's not clear exactly
>> what this will look like yet, but it's an active topic of discussion. And
>> that's another nice thing about Rust: we can actually help drive the
>> evolution of the language, much faster than with C!
>>
>>> Somewhat related I also expect that
>>> rust drivers will need to have quite a few manual drop calls to
>>> artificially limit lifetime, because sometimes when you drop things the
>>> wrong way round (e.g. drop the refcount while still having the mutex guard
>>> live) results in deadlocks.
>>
>> Yup! ^^
>>
>> drivers/gpu/drm/asahi$ grep -r core::mem::drop | wc -l
>> 17
>>
>> Rust is magic but it won't magically solve all of our problems!
>>
>> (I don't think all of those 17 are necessary, some are just places where I
>> wanted to be explicit/safer or bound critical sections more, but a few are
>> definitely there to avoid deadlocks.)
>>
>> Simple Arc<Mutex<T>> usage isn't a problem though, you can't drop the
>> refcount without dropping the guard first there (lifetimes prevent it). The
>> tricky bit is when you have to call into *another* agent that owns a
>> reference to the same object, after having done something while holding the
>> mutex. Then you really need to drop the guard first.
>
> Yeah it's the cross object/module/crate fun for refcount dropping.
>
> The trouble is that refcount bugs are substantially worse to validate at
> runtime than locking. Locking isn't optional, so generally all you need to
> do is enable lockdep and run through all the codepaths and you're good.
> Sometimes that's a bit too tough and you can insert fake lockdep contexts
> to tie things together like fs_reclaim or dma_fence_signalling, but
> by&large it's not that hard to validate at runtime to a reasonable
> confidence.
>
> Reference dropping otoh isn't, because usually userspace holds onto a
> reference and drops the final reference through a syscall in a very well
> defined place. Now you could just add a fake lockdep context to pretend
> that any kref_put could be the final one, but generally this results in a
> lot of false positives (because of borrowed or implied references
> protecting you from the final reference drop). Which means unless you've
> gone through the trouble of proving that your final kref_put/rust Arc
> drops are never in a bad place (which means pretty much uncovering the
> entire Drop hierarchy mentally in every place) you have some potentially
> really nasty problems at hand.

Ah... honestly, I would actually want the strict version of that where
any ref drop could be the last one. It's the only model that will work
with future compile time checks anyway (unless you add explicit
annotations to ignore some drops, maybe something like a nonfinal_drop()
method that consumes a reference and instead asserts that it *isn't* the
final one for these cases). It might be annoying to annotate those and
get rid of the false positives, but annoying compilers and tooling is
how we get safety in Rust after all ^^

This isn't actually that bad, since for objects with this problem I *do*
want to actually expose the Drop hierarchy. Usually when this happens I
*think* I know where this stuff gets dropped, so it'd be nice if the
compiler told me if I'm wrong. I'm actually considering implementing the
undroppable trick in some places in the driver to bring these out,
especially once I really start worrying about not screwing up fence
signaling. It'd be trivial to have an ExplicitlyDropped<T> type (similar
to ManuallyDrop<T>) that just refuses to be dropped automatically at
compile/link time, but lets you do an .explicit_drop() instead.

> And unlike rust in C all the _put() calls are at least explicitly staring
> at you and reminding you that you're piling up a mountain that might come
> crashing down in an avalanche :-) Ofc you're also all but guaranteed to
> have screwed up an error path, but that's at least a runtime validation
> solveable problem nowadays with kasan + code coverage checks (not that
> we're any good at it tbh, syzkaller has clearly opinions to the contrary
> here).

^^

Hey, I haven't even started talking about the other safety things Rust
can help with that I don't use yet! Stuff like asserting that struct
types are safe to have in UAPIs and zero-initialize, and that we never
leak stack contents and things like that... No matter where all this
leads us, I think it's going to be good for the kernel ^^

>
>>> There's going to be lots of fun here :-)
>>
>> It's already been fun and I don't expect it to get any less fun ^^
>
> Oh that's for sure.
>
>>
>> ~~ Lina
>>
>

~~ Lina

2023-04-06 17:44:07

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On Fri, Apr 07, 2023 at 02:11:32AM +0900, Asahi Lina wrote:
> On 07/04/2023 00.30, Daniel Vetter wrote:
> > On Thu, Apr 06, 2023 at 11:43:19PM +0900, Asahi Lina wrote:
> > > On 06/04/2023 22.37, Daniel Vetter wrote:
> > > > On Thu, Apr 06, 2023 at 09:21:47PM +0900, Asahi Lina wrote:
> > > > > On 06/04/2023 19.09, Daniel Vetter wrote:
> > > > > > On Thu, Apr 06, 2023 at 06:05:11PM +0900, Asahi Lina wrote:
> > > > > > > On 06/04/2023 17.27, Daniel Vetter wrote:
> > > > > > > > On Thu, 6 Apr 2023 at 10:22, Christian K?nig <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> > > > > > > > > > On 2023-04-05 10:05, Danilo Krummrich wrote:
> > > > > > > > > > > On 4/4/23 06:31, Luben Tuikov wrote:
> > > > > > > > > > > > On 2023-03-28 04:54, Lucas Stach wrote:
> > > > > > > > > > > > > Hi Danilo,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > > > > > > > > > > > > tries to track the accumulated time that a job was active on the GPU
> > > > > > > > > > > > > > writing it to the entity through which the job was deployed to the
> > > > > > > > > > > > > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > > > > > > > > > > > > which fetches a job from the schedulers pending_list.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Doing this can result in a race condition where the entity is already
> > > > > > > > > > > > > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > > > > > > > > > > > > once the job is fetched from the pending_list.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > After drm_sched_entity_destroy() being called it should be safe to free
> > > > > > > > > > > > > > the structure that embeds the entity. However, a job originally handed
> > > > > > > > > > > > > > over to the scheduler by this entity might still reside in the
> > > > > > > > > > > > > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > > > > > > > > > > > > already being called and the entity being freed. Hence, we can run into
> > > > > > > > > > > > > > a UAF.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > Sorry about that, I clearly didn't properly consider this case.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > In my case it happened that a job, as explained above, was just picked
> > > > > > > > > > > > > > from the schedulers pending_list after the entity was freed due to the
> > > > > > > > > > > > > > client application exiting. Meanwhile this freed up memory was already
> > > > > > > > > > > > > > allocated for a subsequent client applications job structure again.
> > > > > > > > > > > > > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > > > > > > > > > > > > reproduce the same corruption over and over again by just using
> > > > > > > > > > > > > > deqp-runner to run a specific set of VK test cases in parallel.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > > > > > > > > > > > > I miss something), which is why I'm writing this mail instead of sending
> > > > > > > > > > > > > > a fix directly.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Spontaneously, I see three options to fix it:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. Rather than embedding the entity into driver specific structures
> > > > > > > > > > > > > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > > > > > > > > > > > > reference count it, such that it's only freed up once all jobs that were
> > > > > > > > > > > > > > deployed through this entity are fetched from the schedulers pending list.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > My vote is on this or something in similar vain for the long term. I
> > > > > > > > > > > > > have some hope to be able to add a GPU scheduling algorithm with a bit
> > > > > > > > > > > > > more fairness than the current one sometime in the future, which
> > > > > > > > > > > > > requires execution time tracking on the entities.
> > > > > > > > > > > > Danilo,
> > > > > > > > > > > >
> > > > > > > > > > > > Using kref is preferable, i.e. option 1 above.
> > > > > > > > > > > I think the only real motivation for doing that would be for generically
> > > > > > > > > > > tracking job statistics within the entity a job was deployed through. If
> > > > > > > > > > > we all agree on tracking job statistics this way I am happy to prepare a
> > > > > > > > > > > patch for this option and drop this one:
> > > > > > > > > > > https://lore.kernel.org/all/[email protected]/T/#u
> > > > > > > > > > Hmm, I never thought about "job statistics" when I preferred using kref above.
> > > > > > > > > > The reason kref is attractive is because one doesn't need to worry about
> > > > > > > > > > it--when the last user drops the kref, the release is called to do
> > > > > > > > > > housekeeping. If this never happens, we know that we have a bug to debug.
> > > > > > > > >
> > > > > > > > > Yeah, reference counting unfortunately have some traps as well. For
> > > > > > > > > example rarely dropping the last reference from interrupt context or
> > > > > > > > > with some unexpected locks help when the cleanup function doesn't expect
> > > > > > > > > that is a good recipe for problems as well.
> > > > > > > > >
> > > > > > > > > > Regarding the patch above--I did look around the code, and it seems safe,
> > > > > > > > > > as per your analysis, I didn't see any reference to entity after job submission,
> > > > > > > > > > but I'll comment on that thread as well for the record.
> > > > > > > > >
> > > > > > > > > Reference counting the entities was suggested before. The intentionally
> > > > > > > > > avoided that so far because the entity might be the tip of the iceberg
> > > > > > > > > of stuff you need to keep around.
> > > > > > > > >
> > > > > > > > > For example for command submission you also need the VM and when you
> > > > > > > > > keep the VM alive you also need to keep the file private alive....
> > > > > > > >
> > > > > > > > Yeah refcounting looks often like the easy way out to avoid
> > > > > > > > use-after-free issue, until you realize you've just made lifetimes
> > > > > > > > unbounded and have some enourmous leaks: entity keeps vm alive, vm
> > > > > > > > keeps all the bo alives, somehow every crash wastes more memory
> > > > > > > > because vk_device_lost means userspace allocates new stuff for
> > > > > > > > everything.
> > > > > > >
> > > > > > > Refcounting everywhere has been working well for us, so well that so far all
> > > > > > > the oopses we've hit have been... drm_sched bugs like this one, not anything
> > > > > > > in the driver. But at least in Rust you have the advantage that you can't
> > > > > > > just forget a decref in a rarely-hit error path (or worse, forget an incref
> > > > > > > somewhere important)... ^^
> > > > > > >
> > > > > > > > If possible a lifetime design where lifetimes have hard bounds and you
> > > > > > > > just borrow a reference under a lock (or some other ownership rule) is
> > > > > > > > generally much more solid. But also much harder to design correctly
> > > > > > > > :-/
> > > > > > > >
> > > > > > > > > Additional to that we have some ugly inter dependencies between tearing
> > > > > > > > > down an application (potential with a KILL signal from the OOM killer)
> > > > > > > > > and backward compatibility for some applications which render something
> > > > > > > > > and quit before the rendering is completed in the hardware.
> > > > > > > >
> > > > > > > > Yeah I think that part would also be good to sort out once&for all in
> > > > > > > > drm/sched, because i915 has/had the same struggle.
> > > > > > > > -Daniel
> > > > > > > >
> > > > > > >
> > > > > > > Is this really a thing? I think that's never going to work well for explicit
> > > > > > > sync, since the kernel doesn't even know what BOs it has to keep alive for a
> > > > > > > job... I guess it could keep the entire file and all of its objects/VMs/etc
> > > > > > > alive until all of its submissions complete but... ewww.
> > > > > > >
> > > > > > > Our Mesa implementation synchronously waits for all jobs on context destroy
> > > > > > > for this reason, but if you just kill the app, yeah, you get faults as
> > > > > > > running GPU jobs have BOs yanked out from under them. Kill loops make for a
> > > > > > > good way of testing fault handling...
> > > > > >
> > > > > > You wind down the entire thing on file close? Like
> > > > > > - stop all context
> > > > > > - tear down all context
> > > > > > - tear down all vm
> > > > > > - tear down all obj
> > > > > >
> > > > > > Just winding things down in a random order and then letting gpu fault
> > > > > > handling sort out the mess doesn't strike me as particularly clean design
> > > > > > ...
> > > > > The idea is that object drop order generally doesn't matter since things
> > > > > that care about other things should own them or hold references to them
> > > > > anyway, so the dependency graph of all the resources is encoded directly in
> > > > > the type hierarchy instead of having to open-code a "cleanup procedure"...
> > > > > which then invariably leads to corner cases when you have to do the same
> > > > > thing, or part of it, for error handling.
> > > > >
> > > > > This has been working *very* well! It solves the issue of error handling
> > > > > since error handling just unwinds whatever was done to that point naturally
> > > > > in Rust (? operator), so there's practically no open-coded error handling
> > > > > code anywhere. The first time we ran into OOMs (Xonotic with no Mesa texture
> > > > > compression support yet, on 8GB machines on max settings...) the whole thing
> > > > > just worked. OOM killer went rampant and shmem doesn't account stuff to
> > > > > processes properly of course, but all the error paths, allocation errors,
> > > > > etc... all of that just worked, first try, across dozens of error paths that
> > > > > had never been tested before, not a single oops or deadlock or leak or
> > > > > anything in sight. Similarly, yesterday I did manage to run into drm_sched
> > > > > failing to create kthreads (the scaling issue Matthew's series fixes)... and
> > > > > still, that was fine. That happens on queue creation so it just bubbled up
> > > > > to Mesa as a failed ioctl and things kept moving along nicely otherwise. I
> > > > > even have nice ergonomic XArray semantics so that you can reserve a new
> > > > > slot, allocate some object, then populate it, and if you don't (because you
> > > > > errored out in between) it automatically gets freed again without explicit
> > > > > cleanup code.
> > > > >
> > > > > And it also means that I can encode *firmware* resource dependencies in the
> > > > > type system (with Rust lifetimes attached to *GPU* pointers even - it's a
> > > > > bit dodgy how it's done but it works well in practice). Since it is
> > > > > absolutely critical that the firmware objects respect their lifetimes or
> > > > > else the whole thing crashes irrecoverably, this is the only way I feel it's
> > > > > been even practical to write this driver and not be a firmware crash mess.
> > > > > Of course we still get some crashes due to flaws in how I understand the
> > > > > firmware, but it's always things I don't know, not things I accidentally
> > > > > messed up in some corner case code path we don't normally hit, since I just
> > > > > don't have to think about that as long as the hierarchy is right.
> > > > >
> > > > > I actually don't know exactly what precise order things get dropped in for
> > > > > this reason! I could find out, and it's predictable in Rust, what I mean is
> > > > > that thinking about a total order like that is not necessary for correctness
> > > > > as long as I got the ownership right. Off the top of my head though, it goes
> > > > > very roughly like this:
> > > > >
> > > > > - On File close, all the GEM objects get closed (DRM core does this)
> > > > > - This triggers explicit drops of all mappings in those GEM objects owned by
> > > > > that File (identified by unique ID, this is the one annoying circular
> > > > > reference thing I mentioned in the other thread...). At this point the GPU
> > > > > probably faults but we don't care. *
> > > > > - The File itself gets dropped, which drops the XArrays for queues and
> > > > > (UAPI) VMs
> > > > > - UAPI VMs getting dropped doesn't do much other than unmap a single dummy
> > > > > object. The underlying MMU VM is refcounted and jobs hold references. This
> > > > > also drops the userspace VM object allocator used for kernel-managed
> > > > > allocations, but that too is internally refcounted and won't go away until
> > > > > all of its allocations do.
> > > > > - Queues get dropped, which mostly drops a bunch of references to things
> > > > > that no longer matter, along with the scheduler and scheduler entity.
> > > > > - The entity already has a reference to the scheduler in the abstraction (to
> > > > > meet the soundness requirement), so the entity necessarily goes first. That
> > > > > kills all not yet scheduled jobs, freeing any resources they might use.
> > > > > - Then the scheduler gets torn down, and with my other patch that logically
> > > > > kills all in-flight jobs, detaching their hardware fences and dropping the
> > > > > job objects. This... still doesn't do much other than drop some references
> > > > > that we don't care about.
> > > > > - At this point, if any jobs are in flight, their firmware objects and all
> > > > > of the type hierarchy that goes with them is still alive, as well as the
> > > > > firmware queues they're in and the Rust objects representing them, the VMs
> > > > > they use, the Events they have been allocated...
> > > > > - Jobs complete (successfully or fault), then when complete get popped off
> > > > > of the Rust-side queue objects that represent the firmware queues.
> > > > > - When any given FW queue is empty, it relinquishes its assigned firmware
> > > > > event ID. That causes the event system to drop its owner reference to it,
> > > > > which means the queue itself gets dropped (since the UAPI Queue that also
> > > > > held a reference is gone). That then also drops a reference to what I call
> > > > > the GpuContext.
> > > > > - To avoid deadlocks, completed job objects are freed in another thread
> > > > > (ugly hack right now, should be done better in the future). Eventually as
> > > > > that happens, any resources they reference are dropped, including some
> > > > > shared ones that are logically tied to the scheduler/queues, references to
> > > > > the MMU VM address space, references to the VM slot that address space is
> > > > > assigned to, objects allocated out of user VM space, everything. Refcounts
> > > > > everywhere for anything shared, direct ownership of child structures for
> > > > > anything that isn't (work buffers, firmware command lists, etc.). I once
> > > > > tried to make a slide of the references and pointers involved in just the
> > > > > vertex half of a single GPU job and... even just that was quite interesting.
> > > > > - When the last job completes, that drops the last reference to the VM slot,
> > > > > which means the backing VM is logically detached from the GPU MMU (this is
> > > > > lazy though, it's still there in practice).
> > > > > - When the last firmware queue is dropped for a logical queue/sched/etc
> > > > > (which means no more jobs are running at the GPU for that context), that
> > > > > drops the last reference to the GpuContext. That also gets shoved into
> > > > > another thread context for cleanup to avoid deadlocks with fault recovery.
> > > > > - When that is finally cleaned up, a firmware command is sent to invalidate
> > > > > the GpuContext. I'm still figuring out what that does and what the lifetime
> > > > > rules are here (this is the only explicit invalidation command that exists),
> > > > > but as of yesterday I know that at the very least we need to keep hold of
> > > > > any Tiled Vertex Buffer associated with it until after inval, so that now
> > > > > has a reference to it that gets dropped after the firmware acknowledges the
> > > > > invalidation (unless it's a non-render-capable Queue, then no TVB
> > > > > necessary).
> > > > > - When the Buffer gets dropped, that frees both its backing memory and
> > > > > (virtual) page list structures, which are in user VM space, as well as some
> > > > > kernel firmware objects.
> > > > > - If things have happened in the order I listed here, those will be the last
> > > > > allocations in the two user VM space heap object allocators, so those now
> > > > > get dropped, which drops the mappings of their backing GEM objects,
> > > > > unmapping them from the MMU VM page tables.
> > > > > - Those mappings will now be the last references to the actual MMU VM
> > > > > object, so that it gets destroyed (the lazy detach comes into effect here,
> > > > > PT base address is removed from the VM context base table, full ASID
> > > > > invalidate, etc.), which with it drops the IoPgTable that backs it, which
> > > > > frees the page tables.
> > > > > - Then finally the GEM objects backing the userspace allocators get dropped
> > > > > as well, which will be the last reference to them, so those get freed.
> > > >
> > > > Thanks for the write up!
> > > >
> > > > Maybe some more fundamental thoughts on this entire endeavour:
> > > >
> > > > I think one thing that's causing a lot of friction that in C drivers at
> > > > least some of these things are done with implied/borrowed references. If
> > > > you want an absolute shocking example, the gpu crash dump sampling for a
> > > > driver that does dynamic memory management can be pretty epic:
> > > >
> > > > - the timeout handler is guaranteed (well if the driver didn't screw up
> > > > things, which mostly boils down to call drm_sched_stop() before it
> > > > analyzes anything at all wrt gpu hw state) to hold up the drm_job
> > > > completion fence
> > > >
> > > > - this fence is (at submit ioctl time) installed into all the dma_resv
> > > > fence containers for the vm (single shared dma_resv for all vm-private obj for
> > > > efficiency, all shareable gem_bo have their own dma_resv)
> > > >
> > > > - the memory manager guarantees that it will not free (or even move in the
> > > > case of ttm) the buffer while a fence is unsingalled. ttm does this by
> > > > delaying the free as needed, i915-gem and some of the others did this by
> > > > trying to hold onto extra references. the latter causes way too much
> > > > problems with reference dropping in my experience looking at a decade of
> > > > drivers getting details wrong here.
> > > >
> > > > - which means the crash dump handler can have a plain list of gem_bo to
> > > > record, without any references, because they wont go away before it
> > > > finishes.
> > > >
> > > > - more fun even, you could directly sample the memory locations at ioctl
> > > > submit time, and even when ttm moves the buffers around in a pipelined
> > > > fashion: your timeout handler wont run before all the jobs to move the
> > > > buffer into the right location have completed, and it will not unblock
> > > > any subsequent move or eviction that's already queued up. Which means
> > > > even the raw memory address will point at the right stuff.
> > > >
> > > > This is just one example. drm and the kernel is full of these borrowed
> > > > reference tricks with often cross through the entire stack, so rust has to
> > > > be able to cope. The sales pitch of rust, and really the reason it's the
> > > > first non-C language in linux, is that with the borrow semantics, it can
> > > > validate this stuff at compile time, and allow kernel drivers to continue
> > > > playing these tricks to outright avoid any and all refcounting needs. If
> > > > rust doesn't deliver and we need to have all the refcounting again to make
> > > > rust drivers safe I think that'll cast serious doubts on the viability of
> > > > rust-in-linux.
> > >
> > > Right, so the thing is Rust can encode *some* borrowed reference semantics
> > > at compile time, namely what Rust lifetimes can represent. But you can't
> > > encode arbitrarily complex "I can guarantee this is alive right now because
> > > reasons A,B,C,D" semantics in the type system. For arbitrarily complex rules
> > > like that, you need to either refcount, or use unsafe and raw pointers. For
> > > example, in my driver I use raw pointers in the event slot acquisition to
> > > point to the actual u32s that hold the event values, because I can guarantee
> > > through the semantics of that module that those will always be valid
> > > pointers to the live array (and never alias too), even though I can't encode
> > > that in the type system. Throwing around refcounted pointers to the whole
> > > object there would be silly, so instead I just used unsafe and left a SAFETY
> > > comment explaining why it's safe.
> > >
> > > In the end, unsafe isn't any worse than C (it's strictly better, you still
> > > get the borrow checker and type safety and everything else), so I don't
> > > think Rust not being able to solve all of these problems is really a good
> > > argument against Rust. The idea, and where Rust really shines, is that even
> > > when you have to do this you are *containing* the unsafe code within places
> > > where it can be audited and explained. And when the majority of the code is
> > > safe, and when you strive to make the interfaces between modules safe, and
> > > when you limit the ability of cross-module interactions to break safety (Not
> > > eliminate! In abstractions yes, and in userspace Rust this is expected, but
> > > within a driver it gets more interesting and I can spend a long time talking
> > > about how safety boundaries in drivers are complicated and a bit porous...)
> > > you end up reducing the chances of safety bugs by orders of magnitude
> > > anyway.
> > >
> > > And it worked, right? I have ~128 unsafe{} blocks in the driver and still no
> > > oopses in production ^^
> > >
> > > But (big but!) the idea with safe Rust abstractions (the stuff in rust/) is
> > > that they absolutely do need to be safe (unless marked unsafe, but that
> > > sucks...), because then implicitly you are documenting the safety/lifetime
> > > requirements in the type system and that's hugely beneficial for common
> > > code. It means users of those APIs don't have to think about the internals
> > > and safety invariants and all that. Drivers still have to worry about
> > > internal safety, but they don't have to worry about messing up the interface
> > > with common code. That's also very beneficial for allowing refactoring of
> > > that common code without worrying that you'll break users.
> > >
> > > So technically we could just give up and mark the drm_sched abstraction
> > > unsafe and actually document all of these safety requirements (which is the
> > > other side of the coin - if you want to use unsafe, you get to write the
> > > docs, lawyer style! That's the rule! No getting by with vague docs with
> > > dozens of unexplained corner cases!). But, honestly, I think that would be
> > > kind of sad... drm_sched really should have a self-contained, safe API, and
> > > I just don't see a good reason why it can't other than it not having been
> > > designed with this in mind initially. It was ported out of a driver, and not
> > > that long ago, right? That explains a lot, because it means it is probably
> > > leaking all of these assumptions from that driver's internal design... and
> > > it probably doesn't help that the people maintaining it all seem to be from
> > > that particular company either. It's easy to have tunnel vision if you
> > > mostly work on one use case...
> >
> > Nah I don't want to give up. What I expect to happen is that rust
> > abstractions will need to be a lot more opionated about how to do certain
> > things. And then the abstractions together with the C code provide the
> > guarantee that the rust driver is safe.
> >
> > > > Now there's definitely going to be hilarious bugs uncovered on the C side,
> > > > and semantics that need to be clarified, but I think enabling scary tricks
> > > > like the above one is what'll fundamentally make or break rust in linux.
> > >
> > > For the specific case of crash dumps and stop-the-world stuff like that...
> > > yeah, those are some of the worst things to fit in otherwise nice Rust
> > > designs because they violate layering left and right. I don't have a good
> > > answer for how exactly I'd do this in my driver yet. I'm sure it *can* be
> > > done, and there's always unsafe if needed, since at that point you're
> > > dealing in driver code and that's a lot less contentious than an abstraction
> > > being unsafe itself...
> > >
> > > My initial plan for FW crash dumps is going to be fairly easy because I only
> > > really care about dumping the firmware object arena and maybe actual
> > > firmware .data, and that's all GPU-global anyway, and it won't be hard to
> > > poke my way into the global allocators to take a snapshot of their backing
> > > GEM objects. There's a global alloc lock involved there which is all I need
> > > for basic consistency and getting the locking right for allocs as far as
> > > deadlock risk is something I need to care about for other reasons anyway. It
> > > won't guarantee that nobody else is *writing* to any of that memory while I
> > > take a snapshot, but I don't really care about that since it would be rare
> > > and should be limited to newly-allocated memory the firmware doesn't yet
> > > care about anyway.
> >
> > So this is were crash dump capturing gets fun. If that lock covers any
> > memory allocations, you can't take it in the tdr path.
> >
> > The simplest way out is to make context crashes permanent and bail out the
> > entire crash capturing to a worker (which needs gem bo references) that
> > doesn't block anything (so not even guilty/victim reporting to userspace)
> > and can take locks and capture things at leasure.
>
> Yeah, I'm pretty sure we can just do that by doing things in the opposite
> order. Our FW crashes are permanent GPU crashes, so we can just signal first
> *then* take the snapshot right in the event RX thread. After all, the
> firmware just crashed, so that thread is never going to get another message
> again and is as good as any other. Signaling will cause a bunch of stuff to
> be freed, but as long as nothing allocs GPU memory on top (I can add a guard
> to fail allocs after a crash), it won't make a difference for non-debug
> dumps (in debug mode it will mark all freed objects as DEAD which will show
> up in the metadata blocks, but I can probably just add a special case to
> make it not do that after a crash, which is a good idea for hypervisor-based
> debugging too anyway).
>
> > Only works when userspace obeys the promise of not overwriting the memory
> > meanwhile, which is what you get for vk_device_lost.
>
> Userspace won't have access to firmware memory anyway, which is what I
> really care about. Honestly, I'm not sure we have much of a use case for
> full-GPU postmortem debugging like that. If the firmware crashed I 99% only
> care about the firmware heap + firmware .data (which I can snoop out of its
> RAM carveout), and maybe a trace log of recent driver events *and* firmware
> ktrace events (yes, the firmware has its own tracing!). If the GPU faulted
> the driver shouldn't even have to care much more than it already does. We
> already pass user fault info down into Mesa, and already print the likely
> culprit BO right in Mesa together with fault unit, type, etc. (it's cute!).
> I think I'd rather just add a strict fault mode that fully kills the context
> on fault (and ideally does something to prevent the firmware from even
> trying to execute any already queued commands after the crash one, TBD how
> to do this but I bet it's possible during the recovery cycle) and then Mesa
> finds out about the fault as soon as possible and we can implement GPU-side
> crash dumps right in userspace.
>
> We could even add stuff to the kernel to gather rich execution state info
> and pass it back to Mesa too, it's just going to take a ton of reverse
> engineering of the GPU registers to understand them since normally only
> firmware deals with that... but it's totally possible and I already have a
> huge list of "interesting" register addresses because the macOS driver has a
> big debug print function that gets called on GPU faults... which has all the
> printk()s compiled out in release builds, but it still reads all the
> registers anyway, so I see all the register reads in my hypervisor log! ^^
>
> >
> > > In the end, nothing stops you from having piles of raw pointer backdoors,
> > > having a rule that "this is only used for the crash handler", and then
> > > having a couple pages of SAFETY comments in that bit of the code with a big
> > > "here be dragons" disclaimer. It won't make that particular chunk of code
> > > any less likely to have bugs than the equivalent C code, but at least the
> > > rest of the driver isn't affected, and hopefully you can at least avoid
> > > unsafe in some subsets of the troublesome part too ^^
> > >
> > > There's also a flipside to this though: Who cares if you take extra
> > > redundant GEM BO references in the crash handler due to API safety? After
> > > all, it's not a perf-sensitive codepath. Of course, if safety causes you to
> > > add refcounting at all where none would otherwise be needed, or to have to
> > > do more of it in hot paths, or to introduce extra locking that could cause
> > > deadlocks that's a different story. But I think the thought process is very
> > > different between Rust and C here, when writing drivers. In C, it takes
> > > extra work to be safe, so chances are most people won't write defensive code
> > > like that even where there is no performance reason not to, since it's
> > > outright easier to read code the less safe it is. Rust isn't like that! Most
> > > safety is encoded in types (automatic reference counting is a big one here),
> > > so it doesn't take any extra effort to be safe, while it usually does to do
> > > it unsafely (unsafe {} blocks if nothing else). So then you write safe code
> > > by default... which yes, might do some extra refcounting and locking and
> > > stuff, but you can always optimize that later, profiler in hand. Once you
> > > know you need unsafe for real perf reasons, by all means, and I expect to
> > > end up doing a round of profiling in the future and have to poke some holes
> > > in things too. But until then... safe is just easier and, well, safer!
> >
> > The problem with extra refcounting is dropping them. There's endless
> > amounts of fun where dropping a kref at the wrong time goes boom (wrong
> > context, holding wrong locks). Or then you delay it and userspace gets
> > annoyed (task_work for struct file is fun), so just unconditionally
> > stuffing everything you free into a worker isn't always the right thing
> > either.
> >
> > And rust makes this extremely easy, all the Drop stuff isn't visible
> > anywhere in the code, and you'll never screw up the error paths. So my
> > take is a bit that unless you've excluded a borrow based design as
> > impossible already, sprinkling Arc<T> all over isn't really solid either.
>
> Yes, this is definitely a problem... This is where I really hope we get the
> execution context stuff some day, so this can also be checked at compile
> time. Fingers crossed!
>
> >
> > > > In the end it'll be lots of detail work, and I really hope it all works
> > > > out.
> > > >
> > > > > I probably got more than one thing wrong there, and there's layers of
> > > > > complexity I glossed over, but that's the rough idea ^^
> > > > >
> > > > > * If we need to fix this then we're going to need some kind of signal from
> > > > > the DRM core that this is happening and it's not normal user-triggered GEM
> > > > > closing, and it's going to be interesting... it also means we need some kind
> > > > > of mechanism to transfer responsibility over those mappings to all in-flight
> > > > > jobs themselves, because normally userspace is strictly responsible for all
> > > > > mappings in an explicit sync VM bind style world, and now we're adding a
> > > > > special case where we freeze the state of the VM until all in-flight jobs
> > > > > are done when the File goes away instead of eagerly freeing things. That's a
> > > > > very weird departure from how the whole thing normally works, so if we
> > > > > really want that I'm going to have to think of how to do it reasonably. It
> > > > > might be easier once we implement the full VM map range tracking which will
> > > > > likely flip the VM<->GEM object relationship around a bit.
> > > >
> > > > See my other reply for you how can untangle this potentially on the vm_id
> > > > subthread. I think in general rust Drop is awesome, but there's cases
> > > > where we'll have to be more explicit at unwinding things. At least in the
> > > > linux kernel the general consensus is that too much refcounting is a
> > > > maintenance disaster because you end up dropping the final reference in
> > > > all kinds of really bad places (interrupt handlers, while holding the
> > > > wrong locks, holding them too long).
> > >
> > > Yeah, that's one I've run into, it's why I have a few places with subtle
> > > code to avoid that and more things will need cleaning up too...
> > >
> > > This is something that I expect will become Rust compile-time magic in the
> > > future though! There's already been talk about being able to prove that
> > > things are only ever called from the right execution context. Then if you
> > > have a Drop of a reference whose underlying type is not safe to Drop from
> > > IRQ context in IRQ context, the compiler complains. It's not clear exactly
> > > what this will look like yet, but it's an active topic of discussion. And
> > > that's another nice thing about Rust: we can actually help drive the
> > > evolution of the language, much faster than with C!
> > >
> > > > Somewhat related I also expect that
> > > > rust drivers will need to have quite a few manual drop calls to
> > > > artificially limit lifetime, because sometimes when you drop things the
> > > > wrong way round (e.g. drop the refcount while still having the mutex guard
> > > > live) results in deadlocks.
> > >
> > > Yup! ^^
> > >
> > > drivers/gpu/drm/asahi$ grep -r core::mem::drop | wc -l
> > > 17
> > >
> > > Rust is magic but it won't magically solve all of our problems!
> > >
> > > (I don't think all of those 17 are necessary, some are just places where I
> > > wanted to be explicit/safer or bound critical sections more, but a few are
> > > definitely there to avoid deadlocks.)
> > >
> > > Simple Arc<Mutex<T>> usage isn't a problem though, you can't drop the
> > > refcount without dropping the guard first there (lifetimes prevent it). The
> > > tricky bit is when you have to call into *another* agent that owns a
> > > reference to the same object, after having done something while holding the
> > > mutex. Then you really need to drop the guard first.
> >
> > Yeah it's the cross object/module/crate fun for refcount dropping.
> >
> > The trouble is that refcount bugs are substantially worse to validate at
> > runtime than locking. Locking isn't optional, so generally all you need to
> > do is enable lockdep and run through all the codepaths and you're good.
> > Sometimes that's a bit too tough and you can insert fake lockdep contexts
> > to tie things together like fs_reclaim or dma_fence_signalling, but
> > by&large it's not that hard to validate at runtime to a reasonable
> > confidence.
> >
> > Reference dropping otoh isn't, because usually userspace holds onto a
> > reference and drops the final reference through a syscall in a very well
> > defined place. Now you could just add a fake lockdep context to pretend
> > that any kref_put could be the final one, but generally this results in a
> > lot of false positives (because of borrowed or implied references
> > protecting you from the final reference drop). Which means unless you've
> > gone through the trouble of proving that your final kref_put/rust Arc
> > drops are never in a bad place (which means pretty much uncovering the
> > entire Drop hierarchy mentally in every place) you have some potentially
> > really nasty problems at hand.
>
> Ah... honestly, I would actually want the strict version of that where any
> ref drop could be the last one. It's the only model that will work with
> future compile time checks anyway (unless you add explicit annotations to
> ignore some drops, maybe something like a nonfinal_drop() method that
> consumes a reference and instead asserts that it *isn't* the final one for
> these cases). It might be annoying to annotate those and get rid of the
> false positives, but annoying compilers and tooling is how we get safety in
> Rust after all ^^
>
> This isn't actually that bad, since for objects with this problem I *do*
> want to actually expose the Drop hierarchy. Usually when this happens I
> *think* I know where this stuff gets dropped, so it'd be nice if the
> compiler told me if I'm wrong. I'm actually considering implementing the
> undroppable trick in some places in the driver to bring these out,
> especially once I really start worrying about not screwing up fence
> signaling. It'd be trivial to have an ExplicitlyDropped<T> type (similar to
> ManuallyDrop<T>) that just refuses to be dropped automatically at
> compile/link time, but lets you do an .explicit_drop() instead.

Bit a wash-up reply, but if the overall consensus is that we have a path
towards robust refcount dropping, then that pretty much covers my worries.

There's a ton of things where auto-dropping is not problem at all (if it's
just a recursive Drop that does just kfree() and not much else). Or if the
contract for that object already implies that you can drop it from any
context, including hardirq handlers (like dma_fence, but especially older
drivers screw this up).

And then for the really nasty ones if we can make them explicit Drop only,
or require you have the right execution context for dropping or stuff like
that, then that's a substantial improvement over what we can do now. Even
if it needs some runtime checks still if it converts "kref_put for
everything" from "essentially not validateable" to "validateble with
compiler+runtime", then that's a really solid argument for a lot more
rust.

Like maybe Arc<T> needs some traits to encode T that are absolute safe
(because just kfree) and stuff that needs some serious care if you wrap
them in an Arc/kref.

> > And unlike rust in C all the _put() calls are at least explicitly staring
> > at you and reminding you that you're piling up a mountain that might come
> > crashing down in an avalanche :-) Ofc you're also all but guaranteed to
> > have screwed up an error path, but that's at least a runtime validation
> > solveable problem nowadays with kasan + code coverage checks (not that
> > we're any good at it tbh, syzkaller has clearly opinions to the contrary
> > here).
>
> ^^
>
> Hey, I haven't even started talking about the other safety things Rust can
> help with that I don't use yet! Stuff like asserting that struct types are
> safe to have in UAPIs and zero-initialize, and that we never leak stack
> contents and things like that... No matter where all this leads us, I think
> it's going to be good for the kernel ^^

Oh there's lots of good with rust. I'm kinda harping so much on this
because this feels like an area where if we're not careful, we might end
up converting a C problem that's bad, but tractable into a rust problem
that's much harder to sort out when things go wrong.

Ime for locking/refcount desing bugs the refactoring part isn't the tricky
one, it's being able to read all the drivers/code and fusing that into an
overall understanding, so that you can figure out what exactly you should
even do. I've had plenty of cases where the "read everything and think
real hard" part took years :-/ And my worry here on "just use more Arc"
was that it will make the reading everything part substantially tougher.

But I think there's tools to make sure we don't dig a complete hole with
these it sounds like. I guess another topic for pestering the rust folks.
-Daniel

>
> >
> > > > There's going to be lots of fun here :-)
> > >
> > > It's already been fun and I don't expect it to get any less fun ^^
> >
> > Oh that's for sure.
> >
> > >
> > > ~~ Lina
> > >
> >
>
> ~~ Lina
>

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2023-04-06 17:47:44

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On Thu, Apr 06, 2023 at 07:42:44PM +0200, Daniel Vetter wrote:
> But I think there's tools to make sure we don't dig a complete hole with
> these it sounds like. I guess another topic for pestering the rust folks.

Or to put it very bluntly: Could we make Arc<T> at least runtime enforce
(with the usual lockdep annotation trick like fs_reclaim) that and Drop is
the final one?

If that's the rust Arc<T> linux semantics then I think my worries are 100%
covered. And we'll sort out the trickier type based enforcement for
special cases when they hit us.

The downside is that this is substantially stricter than kref on the C
side, but I think that's a Good Thing :-)
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2023-04-06 22:18:15

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On 2023-04-06 11:58, Lucas Stach wrote:
> Am Mittwoch, dem 05.04.2023 um 16:44 -0400 schrieb Luben Tuikov:
>> On 2023-04-05 13:44, Lucas Stach wrote:
>>> Hi Luben,
>>>
>>> Am Dienstag, dem 04.04.2023 um 00:31 -0400 schrieb Luben Tuikov:
>>>> On 2023-03-28 04:54, Lucas Stach wrote:
>>>>> Hi Danilo,
>>>>>
>>>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>>>> Hi all,
>>>>>>
>>>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>>>> tries to track the accumulated time that a job was active on the GPU
>>>>>> writing it to the entity through which the job was deployed to the
>>>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>>>> which fetches a job from the schedulers pending_list.
>>>>>>
>>>>>> Doing this can result in a race condition where the entity is already
>>>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>>>> once the job is fetched from the pending_list.
>>>>>>
>>>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>>>> the structure that embeds the entity. However, a job originally handed
>>>>>> over to the scheduler by this entity might still reside in the
>>>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>>>> already being called and the entity being freed. Hence, we can run into
>>>>>> a UAF.
>>>>>>
>>>>> Sorry about that, I clearly didn't properly consider this case.
>>>>>
>>>>>> In my case it happened that a job, as explained above, was just picked
>>>>>> from the schedulers pending_list after the entity was freed due to the
>>>>>> client application exiting. Meanwhile this freed up memory was already
>>>>>> allocated for a subsequent client applications job structure again.
>>>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>>>> reproduce the same corruption over and over again by just using
>>>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>>>
>>>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>>>> I miss something), which is why I'm writing this mail instead of sending
>>>>>> a fix directly.
>>>>>>
>>>>>> Spontaneously, I see three options to fix it:
>>>>>>
>>>>>> 1. Rather than embedding the entity into driver specific structures
>>>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>>>> reference count it, such that it's only freed up once all jobs that were
>>>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>>>
>>>>> My vote is on this or something in similar vain for the long term. I
>>>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>>>> more fairness than the current one sometime in the future, which
>>>>> requires execution time tracking on the entities.
>>>>
>>>> Danilo,
>>>>
>>>> Using kref is preferable, i.e. option 1 above.
>>>>
>>>> Lucas, can you shed some light on,
>>>>
>>>> 1. In what way the current FIFO scheduling is unfair, and
>>>> 2. shed some details on this "scheduling algorithm with a bit
>>>> more fairness than the current one"?
>>>
>>> I don't have a specific implementation in mind yet. However the current
>>> FIFO algorithm can be very unfair if you have a sparse workload compete
>>> with one that generates a lot of jobs without any throttling aside from
>>> the entity queue length.
>>
>> Ah, that's interesting, let's see, a "sparse workload compete with one that
>> generates a lot of jobs", so basically we have a sparse workload compete
>> with a dense workload. So we can represent this with two entities, A, B,
>> whose jobs we're going to represent by the entities, names A and B.
>> So if we let A be the sparse workload and B the dense workload,
>> we have this, wlog,
>>
>> First/oldest job, .........................., latter/new jobs.
>> Subm: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...
>> Time: t0,t1,t2,t3,t4,t5,t6,t7,t8,t9, .....
>>
>> The current FIFO algorithm, would prefer to execute those jobs
>> in order of submission, i.e. oldest-ready-first job. Assume
>> that all jobs are ready. Then we'll execute them in order.
>> This is desirable and fair. We want to execute the jobs
>> in the order they were submitted, given also that they are
>> ready to be executed. So perhaps we want to execute them like this:
>>
>> First/oldest job, .........................., latter/new jobs.
>> Subm: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...
>> Time: t0,t1,t2,t3,t4,t5,t6,t7,t8,t9, ....
>> Exec: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...
>>
>> Any other ordering would starve either A, or B. If we executed the 2nd A
>> job at t6 or t7, then that would starve the 3rd/4th job in B, since the 2nd A job
>> arrives at the same time as that of the 3rd B job, at time t6.
>> The time t3-t0 is some delta > 0, some initial scheduler-start up time.
>>
> For simplicity now also assume that all jobs from A take 5ms of GPU
> time, while jobs from B take 50ms of GPU time.
>
>> IOW, we don't want to delay a job any more than it should wait--the oldest
>> job, which is also ready, should execute next, so that we're fair how
>> it executes in real time. We cannot boot B at t6, so that we execute A,
>> just because it is sparse, but just arrived.
>>
>> From A's point of view, it shouldn't expect its job execution time distribution
>> to be any different than its submission time distribution.
>>
>> Do you think there's a job permutation which offers a fairer scheduling
>> than the Exec line above for the Submission line above?
>>
> Yes, if we try to keep some fairness of actual GPU time made available
> to each entity by tracking the time spent over a sampling interval, we
> would see that at t6 B has already spent 100ms of GPU time, while A has
> only spent 5ms, so naturally we would prefer the newly submitted job
> from entity A when choosing the next job to schedule.

The problem is that you cannot inquire a priori about the actual
time the next task (job) would take. It might be the case
that the next A job would take a long time while the next B would take
very small amount of time.

>
>>> By tracking the actual GPU time consumed by
>>> the entities we could implement something with a bit more fairness like
>>> deficit round robin (don't pin me on the specific algorithm, as I
>>> haven't given it much thought yet).
>>
>> Since there's no preemption, this would be hard to achieve--you're at the mercy
>> of the execution time of job A_i for an entity A job i. (Assuming there's no
>> preemption as it is the current state of the GPU scheduler.)
>>
>> The only thing you can do, is punish the next job from this entity, A_i+1,
>> to execute much later. However, you don't know how long A_i+1 would take. If A_i+1
>> takes very little time, then you're better off executing it at the next opportune
>> time, i.e. when it would normally execute. But such an algorithm, which doesn't
>> know a priori the execution time of a job, would punish A_i+1 to execute much later.
>>
>> But if A_i+1 takes time as long or longer than A_i, then punishing it to execute much
>> later, would simply delay it, from an observer's point a view, it wouldn't teach
>> the context to submit smaller jobs, so that GPU sharing is more fair.
>
> Without preemption we can't really make the scheduling absolutely fair,
> but we can still do better than the simple FIFO. If a entity has taken

This is exactly my point as I stated above "Since there's no preemption..."

> up a disproportionate amount of GPU time over the last sampling
> interval, we can derate it when picking the next job to schedule to
> allow other entities to take up more GPU time in the next sampling
> interval.

The problem is that you don't know how long that new/other job would
take--it may be a long-time job as well. What if the punished context's
next job was a small quick job, but because of its previous job, it's now
been punished to wait a long time. What exacerbates the problem is if
the favourite new context picked has a big long job as the next job
because its previous job was a quick one.

The base problem with DRR, and similar scheduling algorithms when
applied to the GPU scheduling is that you don't know how long the next
job would take, a priori, and make decisions on context's past jobs. In
classical DRR applied to networking, you send bytes (the work done) until its
quanta is exhausted and then you move on to another one--which is preemption.

This approach would work if you get a hint from userspace as to the size
of jobs a context would send, i.e. knowing a priori. Then you can formulate
a context permutation which would achieve fairness, somewhat.

Are you seeing contexts sending lots of big jobs very frequently and other
contexts sending small jobs very infrequently?

What you could do, is use an adjusted-time-spent scheduling algorithm,
where you calculate the running average of a context's (past) jobs,
and schedule it with a frequency, such that the average time spent on the GPU
is about the same as that of other context's jobs.

Regards,
Luben

2023-04-06 22:48:15

[permalink] [raw]

Subject: Re: [Regression] drm/scheduler: track GPU active time per entity

On 2023-04-06 04:22, Christian König wrote:
> Am 05.04.23 um 18:09 schrieb Luben Tuikov:
>> On 2023-04-05 10:05, Danilo Krummrich wrote:
>>> On 4/4/23 06:31, Luben Tuikov wrote:
>>>> On 2023-03-28 04:54, Lucas Stach wrote:
>>>>> Hi Danilo,
>>>>>
>>>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>>>> Hi all,
>>>>>>
>>>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>>>> tries to track the accumulated time that a job was active on the GPU
>>>>>> writing it to the entity through which the job was deployed to the
>>>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>>>> which fetches a job from the schedulers pending_list.
>>>>>>
>>>>>> Doing this can result in a race condition where the entity is already
>>>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>>>> once the job is fetched from the pending_list.
>>>>>>
>>>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>>>> the structure that embeds the entity. However, a job originally handed
>>>>>> over to the scheduler by this entity might still reside in the
>>>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>>>> already being called and the entity being freed. Hence, we can run into
>>>>>> a UAF.
>>>>>>
>>>>> Sorry about that, I clearly didn't properly consider this case.
>>>>>
>>>>>> In my case it happened that a job, as explained above, was just picked
>>>>>> from the schedulers pending_list after the entity was freed due to the
>>>>>> client application exiting. Meanwhile this freed up memory was already
>>>>>> allocated for a subsequent client applications job structure again.
>>>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>>>> reproduce the same corruption over and over again by just using
>>>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>>>
>>>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>>>> I miss something), which is why I'm writing this mail instead of sending
>>>>>> a fix directly.
>>>>>>
>>>>>> Spontaneously, I see three options to fix it:
>>>>>>
>>>>>> 1. Rather than embedding the entity into driver specific structures
>>>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>>>> reference count it, such that it's only freed up once all jobs that were
>>>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>>>
>>>>> My vote is on this or something in similar vain for the long term. I
>>>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>>>> more fairness than the current one sometime in the future, which
>>>>> requires execution time tracking on the entities.
>>>> Danilo,
>>>>
>>>> Using kref is preferable, i.e. option 1 above.
>>> I think the only real motivation for doing that would be for generically
>>> tracking job statistics within the entity a job was deployed through. If
>>> we all agree on tracking job statistics this way I am happy to prepare a
>>> patch for this option and drop this one:
>>> https://lore.kernel.org/all/[email protected]/T/#u
>> Hmm, I never thought about "job statistics" when I preferred using kref above.
>> The reason kref is attractive is because one doesn't need to worry about
>> it--when the last user drops the kref, the release is called to do
>> housekeeping. If this never happens, we know that we have a bug to debug.
>
> Yeah, reference counting unfortunately have some traps as well. For
> example rarely dropping the last reference from interrupt context or
> with some unexpected locks help when the cleanup function doesn't expect
> that is a good recipe for problems as well.
>
>> Regarding the patch above--I did look around the code, and it seems safe,
>> as per your analysis, I didn't see any reference to entity after job submission,
>> but I'll comment on that thread as well for the record.
>
> Reference counting the entities was suggested before. The intentionally
> avoided that so far because the entity might be the tip of the iceberg
> of stuff you need to keep around.
>
> For example for command submission you also need the VM and when you
> keep the VM alive you also need to keep the file private alive....
>
> Additional to that we have some ugly inter dependencies between tearing
> down an application (potential with a KILL signal from the OOM killer)
> and backward compatibility for some applications which render something
> and quit before the rendering is completed in the hardware.

In my experience, ref counting has worked--just worked, even in
completely untested and unimagined circumstances. You pull
a cable (back end), or kill an app (front end) and everything
gracefully shuts down and gets freed.

(Behind that cable, you can have thousands of hardware entities,
(represented in the kernel), each processing thousands of work requests
(each represented in the kernel), but kref worked fine, very gracefully.)

Of course, what the dependency graph should be (what structs depends on
what structs), and what each struct should contain--that's what was spent
the most time, not coding, but really, it all worked like magic.

So, maybe, we should start anew...
--
Regards,
Luben

>
> Regards,
> Christian.
>
>>
>> Regards,
>> Luben
>>
>>> Christian mentioned amdgpu tried something similar to what Lucas tried
>>> running into similar trouble, backed off and implemented it in another
>>> way - a driver specific way I guess?
>>>
>>>> Lucas, can you shed some light on,
>>>>
>>>> 1. In what way the current FIFO scheduling is unfair, and
>>>> 2. shed some details on this "scheduling algorithm with a bit
>>>> more fairness than the current one"?
>>>>
>>>> Regards,
>>>> Luben
>>>>
>>>>>> 2. Somehow make sure drm_sched_entity_destroy() does block until all
>>>>>> jobs deployed through this entity were fetched from the schedulers
>>>>>> pending list. Though, I'm pretty sure that this is not really desirable.
>>>>>>
>>>>>> 3. Just revert the change and let drivers implement tracking of GPU
>>>>>> active times themselves.
>>>>>>
>>>>> Given that we are already pretty late in the release cycle and etnaviv
>>>>> being the only driver so far making use of the scheduler elapsed time
>>>>> tracking I think the right short term solution is to either move the
>>>>> tracking into etnaviv or just revert the change for now. I'll have a
>>>>> look at this.
>>>>>
>>>>> Regards,
>>>>> Lucas
>>>>>
>>>>>> In the case of just reverting the change I'd propose to also set a jobs
>>>>>> entity pointer to NULL once the job was taken from the entity, such
>>>>>> that in case of a future issue we fail where the actual issue resides
>>>>>> and to make it more obvious that the field shouldn't be used anymore
>>>>>> after the job was taken from the entity.
>>>>>>
>>>>>> I'm happy to implement the solution we agree on. However, it might also
>>>>>> make sense to revert the change until we have a solution in place. I'm
>>>>>> also happy to send a revert with a proper description of the problem.
>>>>>> Please let me know what you think.
>>>>>>
>>>>>> - Danilo
>>>>>>
>

2023-04-06 22:57:16