Received: by 2002:ac8:156:0:b0:3e0:cd10:60c8 with SMTP id f22csp820850qtg; Thu, 6 Apr 2023 02:15:58 -0700 (PDT) X-Google-Smtp-Source: AKy350YtHHwJFKNPbEyFGptLCXf3eq37qy4+XTZZTAAOUIf5H3S3XPFFF6mzGeyIAkIC5xaGRHo5 X-Received: by 2002:a17:902:da8f:b0:1a2:914e:2d2b with SMTP id j15-20020a170902da8f00b001a2914e2d2bmr11442003plx.19.1680772558034; Thu, 06 Apr 2023 02:15:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680772558; cv=none; d=google.com; s=arc-20160816; b=L4NjJKP9jWO3WLWrfEX3FORkismJhdCszlmZE5IcsFr7oYNrNeY+3ix0Gw8qWIiVMb GCdAYMv2ayqBg5jP2oe2p+PQgTelpUJtvO56ekjtIJ2vBLVpxiFsK4Bi+3Lw9yaS8Ve1 jcNhD1t5AvFwuytXxKtz4BHi/rK8cTT+x0V2GEkGQaL3LF46xak0Xc/RPY9MLQ443ORl d4Fpw7eU6WBkztjvKsv++4oEGOlWZVie+sJ+D1ja1Qqg0iQJp2WQ49XiRhbsMhr41MAZ hf++v7YOQv+8MHQaVwN/EFy9H6fJI3IEnJlt+mOC4BayqXnSszfHs80nzIZRgr3bFTu7 6S8A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=h+AqMHVb3g5m45wntdIM8uo4Oh/ZjswpzqhxjPXrTRQ=; b=JcXjBYwIJd9OKBRJ8OOt+mRswpOthqzdBP5mbqLPHN9+bcV3lMm/wFKTampirZMq8F SIqQBaVvxuU1aZq/2QdUz6bONIo8WtG0pauVg6YwaQu+3ddqwZf5sCQ7xEGU9wkoaodD bvZcSeorgID71TXH5XC2hY51gyBvvvxOiGFNTo6aT/SnDvN0yWPcpFpdmRbcsgjbJOYN EyLRNJSwzaNi5RxWTN4fwSQVEU3aHQWlWS/tPJCx2q3abtic0p2uEpVtFAxczJAq2s0S C+G6ggwNbezF6ScEE22Tr03v72DTV/ZGG17a/LEqfgzjPi8WVQk86n3Yfhb/K9EZRHZ1 QRfg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@asahilina.net header.s=default header.b=Pk9vO6rD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=asahilina.net Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b19-20020a170902d41300b0019911aa50e7si1136934ple.368.2023.04.06.02.15.47; Thu, 06 Apr 2023 02:15:58 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@asahilina.net header.s=default header.b=Pk9vO6rD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=asahilina.net Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236210AbjDFJPN (ORCPT + 99 others); Thu, 6 Apr 2023 05:15:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56100 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235893AbjDFJPL (ORCPT ); Thu, 6 Apr 2023 05:15:11 -0400 Received: from mail.marcansoft.com (marcansoft.com [212.63.210.85]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A95F6171E; Thu, 6 Apr 2023 02:15:09 -0700 (PDT) Received: from [127.0.0.1] (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (No client certificate requested) (Authenticated sender: lina@asahilina.net) by mail.marcansoft.com (Postfix) with ESMTPSA id C05E542494; Thu, 6 Apr 2023 09:15:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=asahilina.net; s=default; t=1680772508; bh=r6QAApmFMXAdCdgmAyhMS45Oy+VssbcgRzqJ53SBDRI=; h=Date:Subject:To:Cc:References:From:In-Reply-To; b=Pk9vO6rDodk/l13WtBwwjNtIsGaY8mnjvwnU3vGDKmwgiTOL35AsH/oVV/FUuho9B +xj12USiFoOD6wXq4SeANC/+55xdP4zmXmUc8BZU1hcTYysfL4Ah3ISTEPV+7ewoT2 tJQmj34PAWig7LmmsYoWlKLhcCqRRbRyK/Ejn21MnIZKxd2FgggeKYW+KhtfoZzhTb G7ZQEEfJ3KBUowfebMGuZ5pApei7w3FgdSPMtCkEa2ANtXaqUAgXlXDw5Q6avSgNq6 051fbK+MwRVhqVpwDwlpSdB8uCg4rOFWQgpiBo1Z0yVfVXx/TbXitcDnVyfjIzQd1D r181/CGXWNleA== Message-ID: Date: Thu, 6 Apr 2023 18:15:04 +0900 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [PATCH] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name Content-Language: en-US To: =?UTF-8?Q?Christian_K=c3=b6nig?= , Luben Tuikov , David Airlie , Daniel Vetter , Sumit Semwal Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, asahi@lists.linux.dev References: <20230406-scheduler-uaf-1-v1-1-8e5662269d25@asahilina.net> <6b3433ee-0712-f789-51ee-3047ead9bb79@amd.com> <180bd178-e3c0-85e3-785e-fc8a216cf65e@amd.com> From: Asahi Lina In-Reply-To: <180bd178-e3c0-85e3-785e-fc8a216cf65e@amd.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.4 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/04/2023 18.06, Christian König wrote: > Am 06.04.23 um 10:49 schrieb Asahi Lina: >> On 06/04/2023 17.29, Christian König wrote: >>> Am 05.04.23 um 18:34 schrieb Asahi Lina: >>>> A signaled scheduler fence can outlive its scheduler, since fences are >>>> independently reference counted. >>> >>> Well that is actually not correct. Schedulers are supposed to stay >>> around until the hw they have been driving is no longer present. >> >> But the fences can outlive that. You can GPU render into an imported >> buffer, which attaches a fence to it. Then the GPU goes away but the >> fence is still attached to the buffer. Then you oops when you cat that >> debugfs file... > > No, exactly that's the point you wouldn't ops. > >> >> My use case does this way more often (since schedulers are tied to >> UAPI objects), which is how I found this, but as far as I can tell >> this is already broken for all drivers on unplug/unbind/anything else >> that would destroy the schedulers with fences potentially referenced >> on separate scanout devices or at any other DMA-BUF consumer. > > Even if a GPU is hot plugged the data structures for it should only go > away with the last reference, since the scheduler fence is referencing > the hw fence and the hw fence in turn is referencing the driver this > shouldn't happen. So those fences can potentially keep half the driver data structures alive just for existing in a signaled state? That doesn't seem sensible (and it definitely doesn't work for our use case where schedulers can be created and destroyed freely, it could lead to way too much junk sticking around in memory). >> >>> E.g. the reference was scheduler_fence->hw_fence->driver->scheduler. >> >> It's up to drivers not to mess that up, since the HW fence has the >> same requirements that it can outlive other driver objects, just like >> any other fence. That's not something the scheduler has to be >> concerned with, it's a driver correctness issue. >> >> Of course, in C you have to get it right yourself, while with correct >> Rust abstractions will cause your code to fail to compile if you do it >> wrong ^^ >> >> In my particular case, the hw_fence is a very dumb object that has no >> references to anything, only an ID and a pending op count. Jobs hold >> references to it and decrement it until it signals, not the other way >> around. So that object can live forever regardless of whether the rest >> of the device is gone. > > That is then certainly a bug. This won't work that way, and the timelime > name is just the tip of the iceberg here. > > The fence reference count needs to keep both the scheduler and driver > alive. Otherwise you could for example unload the module and immediately > ops because your fence_ops go away. Yes, I realized the module issue after writing the email. But that's the *only* thing it needs to hold a reference to! Which is much more sensible than keeping a huge chunk of UAPI-adjacent data structures alive for a signaled fence that for all we know might stick around forever attached to some buffer. >>> Your use case is now completely different to that and this won't work >>> any more. >>> >>> This here might just be the first case where that breaks. >> >> This bug already exists, it's just a lot rarer for existing use >> cases... but either way Xe is doing the same thing I am, so I'm not >> the only one here either. > > No it doesn't. You just have implemented the references differently than > they are supposed to be. > > Fixing this one occasion here would mitigate that immediate ops, but > doesn't fix the fundamental problem. Honestly, at this point I'm starting to doubt whether we want to use drm_scheduler at all, because it clearly wasn't designed for our use case and every time I try to fix something your answer has been "you're using it wrong", and at the same time the practically nonexistent documentation makes it impossible to know how it was actually designed to be used. ~~ Lina