Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp1820479rwl; Thu, 6 Apr 2023 01:28:27 -0700 (PDT) X-Google-Smtp-Source: AKy350Y6c2n3yFtMJdX30KtAKfQl+ZE72eHG47aJEfU9G0XwaTWZUeG2neJ+YEO8FdyBxUjTBvwg X-Received: by 2002:a17:902:f152:b0:199:1b8a:42a8 with SMTP id d18-20020a170902f15200b001991b8a42a8mr7506001plb.6.1680769706954; Thu, 06 Apr 2023 01:28:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680769706; cv=none; d=google.com; s=arc-20160816; b=v9bqnNaeewnCisPLapurLd3k9oUQDS/AowftQAyaFhcvy8tmt+E8hoUVVv+izV9qum hgX4aVts+Cro7N+54T9NvPbhGJsGVJhyShmSbY/HCn2TUuSZG2pKZa9gx6HHsS8LqoST IA9PYfbWy7D3DOgB/oweI1MLYpzq3pP3SRo1qaprCm5sV6uL20VtLLHxGkEGHmH8JYl/ b1qf4x3JjrcCu26dsI3vGUZ3TnFAO22FjdIy8Q14eYS0uUBtBBRhQqF9mvcLg3rBsKKF XJy/Y9dy9+nrZWstEGxbD0AzicDGHJ05GSsApa30oogXT1GeYVqRpKk53+aitvSdl31X DTtw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=dWTNNL36dGpjiFKMl47ZkwX/x4GKWF0G3xfScmJLlBA=; b=sJE8W7YVnZycH/rYUxm+FDb9Bdmf/sBD062erhUgdb7Hkm9uCInb7TCDRfK/CY462F 9DlE7IxlqLRq1Ar3Lre149JmKvASihARkUuZ64RimW0/8oBDW3BbayfUOjg/o+8Kgaof GSQoKyn7OizFfrLIZ3o6Q9WzFzgmHah7m3WUso1xekLCxTTEsgmZTq9HuUWTer4fDn8s C5DiSnVgietkkoMvbf5H0GH3j3zFT8HrQUpZBDGcdaEfvCQINz39ptLfZuN/anMWeP/S tPpTli2kBuuHnoYO66a+/w2/37HrMVnAcV/r/2R1eZseBdoWcJo+LP6fkufKTHwTnIVg 8hVA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=h9p29Vd6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s6-20020a170902ea0600b001a1d2098209si1243314plg.343.2023.04.06.01.28.14; Thu, 06 Apr 2023 01:28:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=h9p29Vd6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236040AbjDFI1T (ORCPT + 99 others); Thu, 6 Apr 2023 04:27:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42712 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235374AbjDFI1S (ORCPT ); Thu, 6 Apr 2023 04:27:18 -0400 Received: from mail-ot1-x32d.google.com (mail-ot1-x32d.google.com [IPv6:2607:f8b0:4864:20::32d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C3FCF1BC0 for ; Thu, 6 Apr 2023 01:27:16 -0700 (PDT) Received: by mail-ot1-x32d.google.com with SMTP id 6-20020a9d0106000000b006a177038dfeso11186800otu.7 for ; Thu, 06 Apr 2023 01:27:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; t=1680769636; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=dWTNNL36dGpjiFKMl47ZkwX/x4GKWF0G3xfScmJLlBA=; b=h9p29Vd63NdAyJIqB3sowDNUqyRQ6f2/j1+MVsQUissPbJQxWF1vhIQE76EB2YeQjk vfkHTqngY3VIe93yeQXpgoxNGy1IFodm0hwCFAdQ6kiRIivdep06iKqSyGXrYN+ktdFh /c4p1x2GXcFv1ojntcLQYzrhpxUSxlOdtwOrw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680769636; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dWTNNL36dGpjiFKMl47ZkwX/x4GKWF0G3xfScmJLlBA=; b=JmpEdV8V3qLOAN1O7QjJeZpb/PZuOKeLV9tc1+cuRMmMGa/fCSmEBTFElca4guC0Il 2eMC+W0RZ+6TRz8aYXDl4VQClx5+qN3QkH8wnMsN6x31+1HEHdGHfL9KmUHN1zMdEx8E 960XKo/Y8nhMTbJhT8SjhhcOKHD91oIB1vdaOO4j1dpg9nhiFveH9INLIRun26fMaxCw uyYR3w0YVSSJWiNgAcVRA/xNyEbDCHEzZ4NT9aoZjkdOwnp8LGiqT26Jx+Z53fNk9suK 9Ug+t9QG0hAlfuuvDOlqxk0LFcc1vM3r2uST+S52Q47D0qhu686WTHmzI6buH80Ii/V9 L7Uw== X-Gm-Message-State: AAQBX9cGW+mUSQ8vaSNmRzrJR4CvXzlI/7yZAJ0v82hQZlwvCZ8jXGPN Sg1GWfsqoM7kqeq0O32J9XmaaZOYDLXulgQwnUH5QQ== X-Received: by 2002:a9d:1984:0:b0:69c:245b:7387 with SMTP id k4-20020a9d1984000000b0069c245b7387mr1894352otk.2.1680769636033; Thu, 06 Apr 2023 01:27:16 -0700 (PDT) MIME-Version: 1.0 References: <3e00d8a9-b6c4-8202-4f2d-5a659c61d094@redhat.com> <2a84875dde6565842aa07ddb96245b7d939cb4fd.camel@pengutronix.de> <8b28151c-f2db-af3f-8dff-87dd5d57417b@amd.com> <3004a2bf-e725-643e-82af-8a217784e796@redhat.com> <013781a3-5abd-8c66-8a0a-dd36c9c487af@amd.com> <28d10733-b217-7ccc-4b8c-54bdc8249234@amd.com> In-Reply-To: <28d10733-b217-7ccc-4b8c-54bdc8249234@amd.com> From: Daniel Vetter Date: Thu, 6 Apr 2023 10:27:04 +0200 Message-ID: Subject: Re: [Regression] drm/scheduler: track GPU active time per entity To: =?UTF-8?Q?Christian_K=C3=B6nig?= Cc: Luben Tuikov , Danilo Krummrich , Lucas Stach , Dave Airlie , Bagas Sanjaya , andrey.grodzovsky@amd.com, tvrtko.ursulin@linux.intel.com, Matthew Brost , yuq825@gmail.com, Boris Brezillon , lina@asahilina.net, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 6 Apr 2023 at 10:22, Christian K=C3=B6nig wrote: > > Am 05.04.23 um 18:09 schrieb Luben Tuikov: > > On 2023-04-05 10:05, Danilo Krummrich wrote: > >> On 4/4/23 06:31, Luben Tuikov wrote: > >>> On 2023-03-28 04:54, Lucas Stach wrote: > >>>> Hi Danilo, > >>>> > >>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich: > >>>>> Hi all, > >>>>> > >>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per enti= ty") > >>>>> tries to track the accumulated time that a job was active on the GP= U > >>>>> writing it to the entity through which the job was deployed to the > >>>>> scheduler originally. This is done within drm_sched_get_cleanup_job= () > >>>>> which fetches a job from the schedulers pending_list. > >>>>> > >>>>> Doing this can result in a race condition where the entity is alrea= dy > >>>>> freed, but the entity's newly added elapsed_ns field is still acces= sed > >>>>> once the job is fetched from the pending_list. > >>>>> > >>>>> After drm_sched_entity_destroy() being called it should be safe to = free > >>>>> the structure that embeds the entity. However, a job originally han= ded > >>>>> over to the scheduler by this entity might still reside in the > >>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy(= ) > >>>>> already being called and the entity being freed. Hence, we can run = into > >>>>> a UAF. > >>>>> > >>>> Sorry about that, I clearly didn't properly consider this case. > >>>> > >>>>> In my case it happened that a job, as explained above, was just pic= ked > >>>>> from the schedulers pending_list after the entity was freed due to = the > >>>>> client application exiting. Meanwhile this freed up memory was alre= ady > >>>>> allocated for a subsequent client applications job structure again. > >>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to > >>>>> reproduce the same corruption over and over again by just using > >>>>> deqp-runner to run a specific set of VK test cases in parallel. > >>>>> > >>>>> Fixing this issue doesn't seem to be very straightforward though (u= nless > >>>>> I miss something), which is why I'm writing this mail instead of se= nding > >>>>> a fix directly. > >>>>> > >>>>> Spontaneously, I see three options to fix it: > >>>>> > >>>>> 1. Rather than embedding the entity into driver specific structures > >>>>> (e.g. tied to file_priv) we could allocate the entity separately an= d > >>>>> reference count it, such that it's only freed up once all jobs that= were > >>>>> deployed through this entity are fetched from the schedulers pendin= g list. > >>>>> > >>>> My vote is on this or something in similar vain for the long term. I > >>>> have some hope to be able to add a GPU scheduling algorithm with a b= it > >>>> more fairness than the current one sometime in the future, which > >>>> requires execution time tracking on the entities. > >>> Danilo, > >>> > >>> Using kref is preferable, i.e. option 1 above. > >> I think the only real motivation for doing that would be for generical= ly > >> tracking job statistics within the entity a job was deployed through. = If > >> we all agree on tracking job statistics this way I am happy to prepare= a > >> patch for this option and drop this one: > >> https://lore.kernel.org/all/20230331000622.4156-1-dakr@redhat.com/T/#u > > Hmm, I never thought about "job statistics" when I preferred using kref= above. > > The reason kref is attractive is because one doesn't need to worry abou= t > > it--when the last user drops the kref, the release is called to do > > housekeeping. If this never happens, we know that we have a bug to debu= g. > > Yeah, reference counting unfortunately have some traps as well. For > example rarely dropping the last reference from interrupt context or > with some unexpected locks help when the cleanup function doesn't expect > that is a good recipe for problems as well. > > > Regarding the patch above--I did look around the code, and it seems saf= e, > > as per your analysis, I didn't see any reference to entity after job su= bmission, > > but I'll comment on that thread as well for the record. > > Reference counting the entities was suggested before. The intentionally > avoided that so far because the entity might be the tip of the iceberg > of stuff you need to keep around. > > For example for command submission you also need the VM and when you > keep the VM alive you also need to keep the file private alive.... Yeah refcounting looks often like the easy way out to avoid use-after-free issue, until you realize you've just made lifetimes unbounded and have some enourmous leaks: entity keeps vm alive, vm keeps all the bo alives, somehow every crash wastes more memory because vk_device_lost means userspace allocates new stuff for everything. If possible a lifetime design where lifetimes have hard bounds and you just borrow a reference under a lock (or some other ownership rule) is generally much more solid. But also much harder to design correctly :-/ > Additional to that we have some ugly inter dependencies between tearing > down an application (potential with a KILL signal from the OOM killer) > and backward compatibility for some applications which render something > and quit before the rendering is completed in the hardware. Yeah I think that part would also be good to sort out once&for all in drm/sched, because i915 has/had the same struggle. -Daniel > > Regards, > Christian. > > > > > Regards, > > Luben > > > >> Christian mentioned amdgpu tried something similar to what Lucas tried > >> running into similar trouble, backed off and implemented it in another > >> way - a driver specific way I guess? > >> > >>> Lucas, can you shed some light on, > >>> > >>> 1. In what way the current FIFO scheduling is unfair, and > >>> 2. shed some details on this "scheduling algorithm with a bit > >>> more fairness than the current one"? > >>> > >>> Regards, > >>> Luben > >>> > >>>>> 2. Somehow make sure drm_sched_entity_destroy() does block until al= l > >>>>> jobs deployed through this entity were fetched from the schedulers > >>>>> pending list. Though, I'm pretty sure that this is not really desir= able. > >>>>> > >>>>> 3. Just revert the change and let drivers implement tracking of GPU > >>>>> active times themselves. > >>>>> > >>>> Given that we are already pretty late in the release cycle and etnav= iv > >>>> being the only driver so far making use of the scheduler elapsed tim= e > >>>> tracking I think the right short term solution is to either move the > >>>> tracking into etnaviv or just revert the change for now. I'll have a > >>>> look at this. > >>>> > >>>> Regards, > >>>> Lucas > >>>> > >>>>> In the case of just reverting the change I'd propose to also set a = jobs > >>>>> entity pointer to NULL once the job was taken from the entity, suc= h > >>>>> that in case of a future issue we fail where the actual issue resid= es > >>>>> and to make it more obvious that the field shouldn't be used anymor= e > >>>>> after the job was taken from the entity. > >>>>> > >>>>> I'm happy to implement the solution we agree on. However, it might = also > >>>>> make sense to revert the change until we have a solution in place. = I'm > >>>>> also happy to send a revert with a proper description of the proble= m. > >>>>> Please let me know what you think. > >>>>> > >>>>> - Danilo > >>>>> > --=20 Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch