Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Message-ID: <849577a8-feab-d876-c0fa-4c752ed37b78@amd.com>
Date:   Wed, 19 Jul 2023 11:05:38 -0400
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is
 torn down.
To:     =?UTF-8?Q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>,
        Asahi Lina <lina@asahilina.net>,
        David Airlie <airlied@gmail.com>,
        Daniel Vetter <daniel@ffwll.ch>,
        Sumit Semwal <sumit.semwal@linaro.org>
Cc:     Faith Ekstrand <faith.ekstrand@collabora.com>,
        Alyssa Rosenzweig <alyssa@rosenzweig.io>,
        dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
        linux-media@vger.kernel.org, asahi@lists.linux.dev
References: <20230714-drm-sched-fixes-v1-0-c567249709f7@asahilina.net>
 <20230714-drm-sched-fixes-v1-3-c567249709f7@asahilina.net>
 <9343165f-2533-f363-4942-719ee9b7d2a4@amd.com>
 <99f9003f-d959-fff3-361a-25b2f47efc88@asahilina.net>
 <66943763-0c30-9603-2a4c-7603ab5ece27@amd.com>
Content-Language: en-CA, en-US
From:   Luben Tuikov <luben.tuikov@amd.com>
Autocrypt: addr=luben.tuikov@amd.com; keydata=
 xjMEY1i6jxYJKwYBBAHaRw8BAQdAhfD+Cc+P5t/fiF08Vw25EMLiwUuxULYRiDQAP6H50MTN
 I0x1YmVuIFR1aWtvdiA8bHViZW4udHVpa292QGFtZC5jb20+wpkEExYKAEEWIQQyyR05VSHw
 x45E/SoppxulNG8HhgUCY1i6jwIbAwUJCWYBgAULCQgHAgIiAgYVCgkICwIEFgIDAQIeBwIX
 gAAKCRAppxulNG8Hhk53AP4k4UY5xfcje0c5OF1k22pNv8tErxtVpgKKZgvfetA4xwD+OoAh
 vesLIYumBDxP0BoLiLN84udxdT15HwPFUGiDmwDOOARjWLqPEgorBgEEAZdVAQUBAQdAzSxY
 a2EtvvIwd09NckBLSTarSLNDkUthmqPnwolwiDYDAQgHwn4EGBYKACYWIQQyyR05VSHwx45E
 /SoppxulNG8HhgUCY1i6jwIbDAUJCWYBgAAKCRAppxulNG8HhnBLAP4yjSGpK6PE1mapKhrq
 8bSl9reo+F6EqdhE8X2TTHPycAEAt8EkTEstSiaOpM66gneU7r+xxzOYULo1b1XjXayGvwM=
In-Reply-To: <66943763-0c30-9603-2a4c-7603ab5ece27@amd.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?U2tCR1ZEK2pEQVZITWNmYTB4SWhvOUdOTkhZM2hpNWRZTThReERnMWZUUUFy?=
 =?utf-8?B?SU9rMENUbjgrdHFtVndvcHlUZ2x5eDRDV2NLQnZqOUFIOUhDZEdmbzNJb283?=
 =?utf-8?B?M2Y2cWJMTS9JUXpEd3VGalZKUnR2bnlkdGl6MitqUXNiaFZldGd0VHh2ODlI?=
 =?utf-8?B?NHhlSWJFamcyNXdpYUx1bTNBb21YK01tTWZ3UDdqRTRteEp0dFYweVNmVmtN?=
 =?utf-8?B?VXZtcVdRMVhuL291b1Y2b2VOUXF6TDRRMUxsRElzWDg2N1RJcWlZV293YVNs?=
 =?utf-8?B?K2RLNmFEMXlvTisvSktiMEJqVUM1VElGZkdVK1NMK0cvTFN5RWVCdkNEaU5H?=
 =?utf-8?B?OS9DRFN1c09YbWVpd0tIMTlGNGRjVzI3Ylg2dTZDTXJJZTl5TTRhaWVBa1lH?=
 =?utf-8?B?Y3JMM0R4ZUI4TTJCQ3Z6UmRQc1FLZjFSRU9FY0lWTnpmS1JPdnhBOWdNdFZD?=
 =?utf-8?B?RXVPeUI1VlkvU1FSaE9OTS9EdnU3M0d3a3VFOXRrMUJ0Z2pzdXVMb3BjbFRV?=
 =?utf-8?B?T09XczJHK3c5TERubWRFYVdCcmllMDNvU1NhZlhDVHJySVphQ0xOdktXckRC?=
 =?utf-8?B?dWFtL2NCdGdzaWlIb1FOalNUbTVsanBVZTk2bUFuOGx3OUFMaEpGZHp5NHJp?=
 =?utf-8?B?b2IrMEVaVDN1d2w2UHBLYmZtdit3Tzd6cVp0ZDJFc2xoN3ZwT2lwUEpoZHB6?=
 =?utf-8?B?N25YZUN6Y1hLdEQ2Qkx2MDhGZkxLUFo5Qno4M2E5UVlzRURXWThkUUpRQkFh?=
 =?utf-8?B?cjE1dW9ZN29TczA0UG1ITGJBL1RJWXpINVVJbXplWUMvS21zaDRoUWVBNEtX?=
 =?utf-8?B?UnF0S3NpdEkwNDd0VjkvajU0WWhrbksxSUtreTRWcUxuOEVwUHRmUzdlTWVy?=
 =?utf-8?B?eVp1ZXVaT09rZmMvNlNUSDNQdk43VHdvbys3dUpBSUZLbUVzNktVN2xRWWxI?=
 =?utf-8?B?amNjNFlqZ2NXZWN6MTVVSlNuMlR4Nk1sc3VLUHRsQnNINDJtcTA4OHczclVX?=
 =?utf-8?B?VmpDYWhqMkxuKzRzUmdtbUpwOUhWdVF0T3UzS3E5VmpHKzQzWW9oQUVFb0g2?=
 =?utf-8?B?RjBoL09yUWZ0YStYWTc5QjhXODNiZmN2SXV5aUdIRStLQUs2RVFMYVVQK0xz?=
 =?utf-8?B?Y1JtM2xPb2diQ0RlTUp0V2hCV2Ztd3V3WjkrZnBhT2pQSE9kdmFLbWhzaFg0?=
 =?utf-8?B?ckR3WFc2UThTTjBJMGVUcVBtSVlXeTljM1hmUVNGeXJvYktKSC91blJIYTBk?=
 =?utf-8?B?eXFUSEFBS3Q5RGJLVVdJU2V3NjluVEM2M2J2Q2M0MXoxNitRdFpSTFRrWjB5?=
 =?utf-8?B?d0FoU3EraWkrSEgzQ2xuVUgwNWFac1ZvZ0NnL3ROUHI4U0VIeitxS2dwclFM?=
 =?utf-8?B?bnJiNzUyS1hXamd5Kzc4SWdvZVI3UjUwT3Z2SW5kWE80aE1XeURrZmttbzFi?=
 =?utf-8?B?WmNZNzBDQnNxdE5TdXcvMnVxeFUwRm56RVVEcmpQaDlMYnJiSUJydGhZOWhE?=
 =?utf-8?B?S1lVWXp2WEpFUnJYSWdacFlKZHpMZW1jWWJDcWlQRHlycy9HaTFzRWFMdEFo?=
 =?utf-8?B?Z0oxaldvN3hpWEpPWlVKRjV1OGhLSkd5QjNTU0J2V2pZankzRlRMc0VkMkpq?=
 =?utf-8?B?bk9rV1I4eGhPQTdkTkxDTCtkYzhLQlBKdzZTM2VSMEt5V0gyd2phUW42THUy?=
 =?utf-8?B?VXFaWG9LQmVmL0ZEYTlHakhUaXIyNlJGZmpHcTZOUkhKUE1yQTdZdUlYV2hH?=
 =?utf-8?B?WStsRDZHMEtoUVYvK2tQVnpBeVI3dzBjOEhYUUJyMU5Qa2IzTlFkaFB3Z3F5?=
 =?utf-8?B?Vmc2eklJaFU4TTZXQVlMSW1FaURuamhUYTZjV250cUFXeXlncU9USEIvaStv?=
 =?utf-8?B?WkhnS1hxUnU4RXpoUDh1Qi9JZmpORml3c010THYwZnUxbGlKMkVySlY3TFAr?=
 =?utf-8?B?MGJURXlZMWZudDJ3VFhMS3ZPM1lxQ1dUT29BajNNbVJKMUZBemVoMHhPaVZt?=
 =?utf-8?B?VFlzWUVQREUyaC9EREtzSy9TZlZ2ZWJMY2pYZjZwY3FnZzIxdTYwKzVVQjZl?=
 =?utf-8?B?RGIyWUtzQlQ1eUdxL2EwLzIwVlpaTlo4Qmo0RE45YXl4ZmR2aWFJYzJyRnMw?=
 =?utf-8?Q?HTpbfeZL0gU/DmCzeypZdKIOI?=
X-OriginatorOrg: amd.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 7875f394-7b25-43ab-9f0d-08db88699f82
X-MS-Exchange-CrossTenant-AuthSource: DM6PR12MB3370.namprd12.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 19 Jul 2023 15:05:43.3352
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 8Pcm80UUScR4rsLhlkvi3Xs1sxajzKcJu/TNmCWs4BalpsqoYmdav/ttxm5JtYQw
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS0PR12MB7827
X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,
        DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FORGED_SPF_HELO,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE,
        T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=no autolearn_force=no
        version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 2023-07-19 04:45, Christian König wrote:
> Am 16.07.23 um 09:51 schrieb Asahi Lina:
>> On 15/07/2023 16.14, Luben Tuikov wrote:
>>> On 2023-07-14 04:21, Asahi Lina wrote:
>>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>>> causes segfaults and other badness when job completion fences are
>>>> signaled after the scheduler is torn down.
>>>
>>> If there are pending jobs, ideally we want to call into the driver,
>>> so that it can release resources it may be holding for those.
>>> The idea behind "pending" is that they are pending in the hardware
>>> and we don't know their state until signalled/the callback called.
>>> (Or unless the device is reset and we get a notification of that fact.)
>>
>> That's what the job->free_job() callback does, then the driver is free 
>> to do whatever it wants with those jobs. A driver could opt to 
>> synchronously kill those jobs (if it can) or account for them 
>> separately/asynchronously.
>>
>> What this patch basically says is that if you destroy a scheduler with 
>> pending jobs, it immediately considers them terminated with an error, 
>> and returns ownership back to the driver for freeing. Then the driver 
>> can decide how to handle the rest and whatever the underlying hardware 
>> state is.
> 
> Yeah, and exactly that is absolutely *not* a good idea. Keep in mind 
> that memory management depends on all this stuff and signal a dma_fence 
> always requires that the hw give a go for that.
> 
> If you want to cleanup a scheduler with pending jobs what needs to 
> happen instead is that the driver cancels the processing and signals the 
> hw fence.
> 
>>
>>>> Explicitly detach all jobs from their completion callbacks and free
>>>> them. This makes it possible to write a sensible safe abstraction for
>>>> drm_sched, without having to externally duplicate the tracking of
>>>> in-flight jobs.
>>>>
>>>> This shouldn't regress any existing drivers, since calling
>>>> drm_sched_fini() with any pending jobs is broken and this change should
>>>> be a no-op if there are no pending jobs.
>>>
>>> While this statement is true on its own, it kind of contradicts
>>> the premise of the first paragraph.
>>
>> I mean right *now* it's broken, before this patch. I'm trying to make 
>> it safe, but it shouldn't regress any exiting drivers since if they 
>> trigger this code path they are broken today.
> 
> Yes and exactly that's intentional.
> 
> What you can do is to issue a *big* warning here when there are still 
> pending unsignaled hw fences when the driver calls drm_sched_fini().
> 
> But setting the scheduler fence to signaled without getting a signaled 
> state from the hw fence is a complete NO-GO.

Okay, so we have the requirement (how). If we can also get a reason behind
it (why), perhaps we can add the requirement and the reason as a lucid comment
to drm_sched_fini() to come with this patch when reworked, so that future
drivers whether they be in Rust or C, can take note.

Perhaps this will also help future development in DRM itself.
-- 
Regards,
Luben

> 
> Regards,
> Christian.
> 
>>
>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>   drivers/gpu/drm/scheduler/sched_main.c | 32 
>>>> ++++++++++++++++++++++++++++++--
>>>>   1 file changed, 30 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 1f3bc3606239..a4da4aac0efd 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>   {
>>>>       struct drm_sched_entity *s_entity;
>>>> +    struct drm_sched_job *s_job, *tmp;
>>>>       int i;
>>>>   -    if (sched->thread)
>>>> -        kthread_stop(sched->thread);
>>>> +    if (!sched->thread)
>>>> +        return;
>>>> +
>>>> +    /*
>>>> +     * Stop the scheduler, detaching all jobs from their hardware 
>>>> callbacks
>>>> +     * and cleaning up complete jobs.
>>>> +     */
>>>> +    drm_sched_stop(sched, NULL);
>>>> +
>>>> +    /*
>>>> +     * Iterate through the pending job list and free all jobs.
>>>> +     * This assumes the driver has either guaranteed jobs are 
>>>> already stopped, or that
>>>> +     * otherwise it is responsible for keeping any necessary data 
>>>> structures for
>>>> +     * in-progress jobs alive even when the free_job() callback is 
>>>> called early (e.g. by
>>>> +     * putting them in its own queue or doing its own refcounting).
>>>> +     */
>>>> +    list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>>> +        spin_lock(&sched->job_list_lock);
>>>> +        list_del_init(&s_job->list);
>>>> +        spin_unlock(&sched->job_list_lock);
>>>> +
>>>> + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
>>>> +        drm_sched_fence_finished(s_job->s_fence);
>>>
>>> I'd imagine it's better to rebase this on top of drm-misc-next where
>>> drm_sched_fence_finished() takes one more parameter--the error.
>>
>> Ah, sure! I can do that.
>>
>>>
>>>> +
>>>> +        WARN_ON(s_job->s_fence->parent);
>>>> +        sched->ops->free_job(s_job);
>>>> +    }
>>>> +
>>>> +    kthread_stop(sched->thread);
>>>>         for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>           struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>
>>>
>>> Conceptually I don't mind this patch--I see what it is trying to 
>>> achieve,
>>> but technically, we want the driver to detect GPU removal and return 
>>> shared
>>> resources back, such as "jobs", which DRM is also aware of.
>>
>> I think you missed the context of why I'm doing this, so in short: my 
>> use case (like Xe's) involves using a separate drm_sched instance *per 
>> file* since these queues are scheduled directly by the firmware. So 
>> this isn't about GPU removal, but rather about a GPU context going 
>> away while jobs are in flight (e.g. the process got killed). We want 
>> that to quickly kill the "DRM view" of the world, including signaling 
>> all the fences with an error and freeing resources like the scheduler 
>> itself.
>>
>> In the case of this particular GPU, there is no known way to actively 
>> and synchronously abort GPU jobs, so we need to let them run to 
>> completion (or failure), but we don't want that to block process 
>> cleanup and freeing a bunch of high-level resources. The driver is 
>> architected roughly along the lines of a firmware abstraction layer 
>> that maps to the firmware shared memory structures, and then a layer 
>> on top that implements the DRM view. When a process gets killed, the 
>> DRM side (which includes the scheduler, etc.) gets torn down 
>> immediately, and it makes sense to handle this cleanup inside 
>> drm_sched since it already has a view into what jobs are in flight. 
>> Otherwise, I would have to duplicate job tracking in the driver 
>> (actually worse: in the Rust abstraction for safety), which doesn't 
>> make much sense.
>>
>> But what I *do* have in the driver is tracking of the firmware 
>> structures. So when the drm_sched gets torn down and all the jobs 
>> killed, the underlying firmware jobs do run to completion, and the 
>> resources they use are all cleaned up after that (it's all reference 
>> counted). The primitive involved here is that in-flight firmware jobs 
>> are assigned an event completion slot, and that keeps a reference to 
>> them from a global array until the events fire and all the jobs are 
>> known to have completed. This keeps things memory-safe, since we 
>> absolutely cannot free/destroy firmware structures while they are in 
>> use (otherwise the firmware crashes, which is fatal on these GPUs - 
>> requires a full system reboot to recover).
>>
>> In practice, with the VM map model that we use, what ends up happening 
>> when a process gets killed is that all the user objects for in-flight 
>> jobs get unmapped, which usually causes the GPU hardware (not 
>> firmware) to fault. This then triggers early termination of jobs 
>> anyway via the firmware fault recovery flow. But even that takes some 
>> short amount of time, and by then all the drm_sched stuff is long gone 
>> and we're just dealing with the in-flight firmware stuff.
>>
>>> In the case where we're initiating the tear, we should notify the 
>>> driver that
>>> we're about to forget jobs (resources), so that it knows to return 
>>> them back
>>> or that it shouldn't notify us for them (since we've notified we're 
>>> forgetting them.)
>>
>> That contradicts Christian's comment. I tried to document that (after 
>> this patch) the scheduler no longer cares about hw fences and whether 
>> they are signaled or not after it's destroyed, and I got a strongly 
>> worded NAK for it. Sooo... which is it? Is it okay for drivers not to 
>> signal the hw fence after a scheduler teardown, or not?
>>
>> But really, I don't see a use case for an explicit "about to forget 
>> job" callback. The job free callback already serves the purpose of 
>> telling the driver to clean up resources associated with a job. If it 
>> wants to synchronously abort things there, it could easily take over 
>> its own fence signaling and do something with the underlying stuff if 
>> the fence is not signaled yet.
>>
>> In my case, since the driver is written in Rust and free_job() just 
>> maps to the destructor (Drop impl), that just ends up freeing a bunch 
>> of memory and other objects, and I don't particularly care about the 
>> state of the firmware side any more after that. The flow is the same 
>> whether it was a successful job completion, a failure, or an early 
>> destruction due to the drm_sched getting torn down.
>>
>>> (Note also that in this latter case, traditionally, the device would 
>>> be reset,
>>> so that we can guarantee that it has forgotten all shared resources 
>>> which
>>> we are to tear up. This is somewhat more complicated with GPUs, thus 
>>> the method
>>> pointed out above.)
>>
>> Yeah, in the firmware scheduling case we can't do this at all unless 
>> the firmware has an explicit teardown/forget op (which I'm not aware 
>> of) and a full GPU reset isn't something we can do either. Hence we 
>> just let the underlying jobs complete. In practice they tend to die 
>> pretty quickly anyway once all the buffers are unmapped.
>>
>> ~~ Lina
>>
>