Received: by 2002:a05:6358:7058:b0:131:369:b2a3 with SMTP id 24csp2376024rwp; Fri, 14 Jul 2023 05:24:09 -0700 (PDT) X-Google-Smtp-Source: APBJJlGD4+0jXAynZVdCAamoWsOBfoac+Oa/tmIgrNmz09i89WE8cgTPGIMrPGYZBUozINE4Tbd8 X-Received: by 2002:a17:90a:d498:b0:261:219b:13b3 with SMTP id s24-20020a17090ad49800b00261219b13b3mr3568111pju.16.1689337449534; Fri, 14 Jul 2023 05:24:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689337449; cv=none; d=google.com; s=arc-20160816; b=Aqd26xxcrMuDxfn47jD1Iwa0RKGgpybEk4BP6MOV4DCXN8gaT6SgEKx4EXilpExIdR we+u9CRkMAolZxr9Hfu8BlQdOlLD36Pm4eBF3JA0vAyasOl80mt9m7bs8AvBAsQpFlnA V8ef+BQQ/GuVEEO3cOnToRZpgMcLy1okXTO4JSS9BLFeeD25oLcmrYES23qIqJPhyACn ubmtXhbKjQwbTHIDHCu/3fziozx32/QfjmQ3zkn1nP8PZRZb3SfldcBTo0cPx4wzHAC/ bk3oWn5AlG+UXeu0gZ6+t/vJzJI1EpbFD+D6yi0/i+hWJKaNdSg++u3PefhKIbVtXB9P lD7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=Liymi+OFybqRn+YKg8TMsrohKfJ4g1GLZVU8EJsy//Q=; fh=55RZIZzLuhRrvB39jWdpKSgMfNLAAz4RKVVa+gvdeE8=; b=D/xH68T8dQk21U6MLWKlsmK5yINmC4xUvzxUL6nHZHygnl0lON4zkSNa8+ZDyJ6xS7 xydbA+KAg8qLGZ0DbOmy8uXFIlIrO0ZDxsehiMNMWGAOJarOq3GyNmEpUKEyUeW8JxQv uu0wU3cj0oNWzPZxpmcRp9w5zBnCC6Sf/yk2GoRgvvhPebhVFbk4Ye0Brg61T0wTXKeB PJeg76m80l1EZC/ioa9ufVzRn2bhmfSfjMVMj6cb8sLKBDg0dVPfv1CKuJOYtt+nvFJR 5E/VVNKVAPH9pxhvaoLcup8kZFyLPJPdhtXfMdZYgL1qlhyWNWSGRlI2I0AEL60Shfsm TJfQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@asahilina.net header.s=default header.b=RNhUBcX3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=asahilina.net Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w12-20020a65534c000000b0055bfb8a3f6esi6864006pgr.239.2023.07.14.05.23.57; Fri, 14 Jul 2023 05:24:09 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@asahilina.net header.s=default header.b=RNhUBcX3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=asahilina.net Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235413AbjGNMN1 (ORCPT + 99 others); Fri, 14 Jul 2023 08:13:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59288 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230470AbjGNMN0 (ORCPT ); Fri, 14 Jul 2023 08:13:26 -0400 Received: from mail.marcansoft.com (marcansoft.com [212.63.210.85]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0D77B2722; Fri, 14 Jul 2023 05:13:23 -0700 (PDT) Received: from [127.0.0.1] (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: lina@asahilina.net) by mail.marcansoft.com (Postfix) with ESMTPSA id E69225BC3A; Fri, 14 Jul 2023 12:13:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=asahilina.net; s=default; t=1689336801; bh=DWYGMsjKYj46JY4sKn+xsGrKzmhqcYG/7k01RE2MggE=; h=Date:Subject:To:Cc:References:From:In-Reply-To; b=RNhUBcX3s0W+Zvu0qcG7y44YsnhK7RpOcztyyvrMh4jUpqdIFUVdsaAtArdvI1kvF dDnG9Olp8Gy2TsoGTNCL+gDNa2Gw005Qp90wuziuT4GFdKqu5atYBh0k92rJ8CaCem XklwuamNyNKzIhLkHHTkx4hOhA4OvR70m3hKOxwQgKV7w+AUDpmMjVfiItubxZv6TQ 8qxRm3K1ajK/2drxgdAttjR3J5mfWIlw66+Taag37MLlrZkYdwOnPOJzdgwMcLZeKW pOtvcCgo4y+pwtfCIVXjlwysoirNNJfRqUHJsMMxwSGFjFTTCKdt8DTT301ONw5dfJ 232IHUwkNx4GA== Message-ID: <7e53bc1f-7d1e-fb1c-be45-f03c5bbb8965@asahilina.net> Date: Fri, 14 Jul 2023 21:13:16 +0900 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Subject: Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name Content-Language: en-US To: =?UTF-8?Q?Christian_K=c3=b6nig?= , Luben Tuikov , David Airlie , Daniel Vetter , Sumit Semwal Cc: Faith Ekstrand , Alyssa Rosenzweig , dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, asahi@lists.linux.dev References: <20230714-drm-sched-fixes-v1-0-c567249709f7@asahilina.net> <20230714-drm-sched-fixes-v1-2-c567249709f7@asahilina.net> <6b473196-9f87-d6c8-b289-18f80de78f0a@asahilina.net> <003eb810-654e-3a2b-0756-d62440c2d419@amd.com> From: Asahi Lina In-Reply-To: <003eb810-654e-3a2b-0756-d62440c2d419@amd.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 14/07/2023 19.18, Christian König wrote: > Am 14.07.23 um 12:06 schrieb Asahi Lina: >> On 14/07/2023 18.57, Christian König wrote: >>> Am 14.07.23 um 11:49 schrieb Asahi Lina: >>>> On 14/07/2023 17.43, Christian König wrote: >>>>> Am 14.07.23 um 10:21 schrieb Asahi Lina: >>>>>> A signaled scheduler fence can outlive its scheduler, since fences >>>>>> are >>>>>> independencly reference counted. Therefore, we can't reference the >>>>>> scheduler in the get_timeline_name() implementation. >>>>>> >>>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared >>>>>> dma-bufs reference fences from GPU schedulers that no longer exist. >>>>>> >>>>>> Signed-off-by: Asahi Lina >>>>>> --- >>>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++- >>>>>>     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++- >>>>>>     include/drm/gpu_scheduler.h              | 5 +++++ >>>>>>     3 files changed, 14 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c >>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c >>>>>> index b2bbc8a68b30..17f35b0b005a 100644 >>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c >>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >>>>>> @@ -389,7 +389,12 @@ static bool >>>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity) >>>>>>                /* >>>>>>              * Fence is from the same scheduler, only need to wait >>>>>> for >>>>>> -         * it to be scheduled >>>>>> +         * it to be scheduled. >>>>>> +         * >>>>>> +         * Note: s_fence->sched could have been freed and >>>>>> reallocated >>>>>> +         * as another scheduler. This false positive case is okay, >>>>>> as if >>>>>> +         * the old scheduler was freed all of its jobs must have >>>>>> +         * signaled their completion fences. >>>>> >>>>> This is outright nonsense. As long as an entity for a scheduler exists >>>>> it is not allowed to free up this scheduler. >>>>> >>>>> So this function can't be called like this. >>>> >>>> As I already explained, the fences can outlive their scheduler. That >>>> means *this* entity certainly exists for *this* scheduler, but the >>>> *dependency* fence might have come from a past scheduler that was >>>> already destroyed along with all of its entities, and its address >>>> reused. >>> >>> Well this is function is not about fences, this function is a callback >>> for the entity. >> >> That deals with dependency fences, which could have come from any >> arbitrary source, including another entity and another scheduler. > > No, they can't. Signaling is certainly mandatory to happen before things > are released even if we allow to decouple the dma_fence from it's issuer. That's exactly what I'm saying in my comment. That the fence must be signaled if its creator no longer exists, therefore it's okay to inadvertently wait on its scheduled fence instead of its finished fence (if that one was intended) since everything needs to be signaled at that point anyway. >> >>>> >>>> Christian, I'm really getting tired of your tone. I don't appreciate >>>> being told my comments are "outright nonsense" when you don't even >>>> take the time to understand what the issue is and what I'm trying to >>>> do/document. If you aren't interested in working with me, I'm just >>>> going to give up on drm_sched, wait until Rust gets workqueue support, >>>> and reimplement it in Rust. You can keep your broken fence lifetime >>>> semantics and I'll do my own thing. >>> >>> I'm certainly trying to help here, but you seem to have unrealistic >>> expectations. >> >> I don't think expecting not to be told my changes are "outright >> nonsense" is an unrealistic expectation. If you think it is, maybe >> that's yet another indicator of the culture problems the kernel >> community has... > > Well I'm just pointing out that you don't seem to understand the > background of the things and just think this is a bug instead of > intentional behavior. I made a change, I explained why that change works with a portion of the existing code by updating a comment, and you called that nonsense. It's not even a bug, I'm trying to explain why this part isn't a bug even with the expectation that fences don't outlive the scheduler. This is because I went through the code trying to find problems this approach would cause, ran into this tricky case, thought about it for a while, realized it wasn't a problem, and figured it needed a comment. >>> I perfectly understand what you are trying to do, but you don't seem to >>> understand that this functionality here isn't made for your use case. >> >> I do, that's why I'm trying to change things. Right now, this >> functionality isn't even properly documented, which is why I thought >> it could be used for my use case, and slowly discovered otherwise. >> Daniel suggested documenting it, then fixing the mismatches between >> documentation and reality, which is what I'm doing here. > > Well I know Daniel for something like 10-15 years or so, I'm pretty sure > that he meant that you document the existing state because otherwise > this goes against usual patch submission approaches. > >> >>> We can adjust the functionality to better match your requirements, but >>> you can't say it is broken because it doesn't work when you use it not >>> in the way it is intended to be used. >> >> I'm saying the idea that a random dma-buf holds onto a chain of >> references that prevents unloading a driver module that wrote into it >> (and keeps a bunch of random unrelated objects alive) is a broken >> state of affairs. > > Well no, this is intentional design. Otherwise the module and with it > the operations pointer the fences rely on go away. But this is a drm_sched fence, not a driver fence. That's the point, that they should be decoupled. The driver is free to unload and only drm_sched would need to stay loaded so its fences continue to be valid. Except that's not what happens right now. Right now the drm_sched fence hangs onto the hw fence and the whole thing is supposed to keep the whole scheduler alive for things not to go boom. > We already discussed > that over 10 years ago when Marten came up with the initial dma_fence > design. > > The resulting problems are very well known and I completely agree that > they are undesirable, but this is how the framework works and not just > the scheduler but the rest of the DMA-buf framework as well. So it's undesirable but you don't want me to change things... > >> It may or may not trickle down to actual problems for users (I would >> bet it does in some cases but I don't know for sure), but it's a >> situation that doesn't make any sense. >> >> I know I'm triggering actual breakage with my new use case due to >> this, which is why I'm trying to improve things. But the current state >> of affairs just doesn't make any sense even if it isn't causing kernel >> oopses today with other drivers. >> >>> You can go ahead and try to re-implement the functionality in Rust, but >>> then I would reject that pointing out that this should probably be an >>> extension to the existing code. >> >> You keep rejecting my attempts at extending the existing code... > > Well I will try to improve here and push you into the right direction > instead. What is the right direction? So far it's looking more and more like wait until we get workqueues in Rust, write a trivial scheduler in the driver, and give up on this whole drm_sched thing. Please tell me if there is a better way, because so far all you've done is tell me my attempts are not the right way, and demotivated me from working on drm_sched at all. ~~ Lina