Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp668823rwb; Fri, 7 Oct 2022 02:31:58 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5F60bHsJBoOMPHJvsTH5/cl9VDUbcadKDCRgx+/+wSwiwmeoLuOZDEAB/jvLkBQifIjszT X-Received: by 2002:a17:907:1691:b0:78d:4051:9429 with SMTP id hc17-20020a170907169100b0078d40519429mr3393685ejc.721.1665135117922; Fri, 07 Oct 2022 02:31:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1665135117; cv=none; d=google.com; s=arc-20160816; b=dIzP6NRACDsgYy7x9NJTYmmWVOJfU9ibKJ7dONL6xUBN2+VC6cu2aK85FHmOpDSnxg J1ROBiR6pTrH3EMB+eq6IgisXLdEsT9DZQ2oaPy4idGXf98Jj8K6O2GDYn/zZ8QiigSF /lPOaAV2OofjXxTQq8IDMgOuODQwqFvn7SZ4e0J4NlNhaWMs0vd6sof9ooCoP4NW9WjO YYvEaPjjAbQI8O4jQzwzcvoRgEegpAMV4xzuhHC98JxwXxP3vE+hShOMdHiQG05WGMmU kO8BaJ7t62kgQfmg2j/homEWLGegaE3nwtQMjJgBQvgd6lR4oynJQ8gDTAYFg7Z8A52N JDZA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=JxLAH+bnod0VfWRdnLNFFaWOZptJDJqV0Gq0f68hNdc=; b=q/pq9xrkyuk33ttH3LmRmGGvqr6+bi47HrD53tZGrt/zhApPsj1VWGbmilJYRlrZdP IuoRIxhokRApTYjVm83Q0dMFmZ00BnO24o19py6b8LMvwFq8rW1hjf5xmTK7Cp/qH9xr s9FHtpQrulS8R8tUe2ifkLf2Xe9py9guW+ZqkJ1PQFRjqd+9jd0K4o3+ZDcPlJs16qsL qtO0pOkO7FlJSJqcOUMJ8GNWCjBXQWlfQCUoevf4jMsbBWbHyIhUo9thslB0ypcPjjCt +NK8qqdphK3EUImRpfCErWSjsF7VX9mXK4kX8xUlKwSSXmpkaYLyb6Av0qui6R23Fa2T AG6w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=Lld1Grct; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hg3-20020a1709072cc300b007707edd5487si1214591ejc.947.2022.10.07.02.31.17; Fri, 07 Oct 2022 02:31:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=Lld1Grct; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229658AbiJGJ22 (ORCPT + 99 others); Fri, 7 Oct 2022 05:28:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33506 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229509AbiJGJ20 (ORCPT ); Fri, 7 Oct 2022 05:28:26 -0400 Received: from mail-oi1-x232.google.com (mail-oi1-x232.google.com [IPv6:2607:f8b0:4864:20::232]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EE473B56DF for ; Fri, 7 Oct 2022 02:28:24 -0700 (PDT) Received: by mail-oi1-x232.google.com with SMTP id v134so4859057oie.10 for ; Fri, 07 Oct 2022 02:28:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=JxLAH+bnod0VfWRdnLNFFaWOZptJDJqV0Gq0f68hNdc=; b=Lld1GrctRhHOLbNsUNrujKK+MA6Gux6LyL6cFvYmqezas+x2uXUh0neHffeKlflrpg dZkxx8hziJWp3N7O/T/3HPCuk+wbLZG0pbMfhDNLxG9i7Pulwx7P45z/gUbRlk7cHsWM OULqzB+Sd9lowQX+RG4FzxkPt0m5B6fPgGT1k= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=JxLAH+bnod0VfWRdnLNFFaWOZptJDJqV0Gq0f68hNdc=; b=XvdaKTTxD3PAcceaTWLJkjQ2SxNiXy6tQT6gUcfWNZIkgbBflAazAsAF/DzqYE8DL0 /z9xZiF6HWFEa3zXsq8qJTTogQpO11nZgElAj6nmetkk3glLJ8sjn+SrseT9rXiU/sge 7mRT15rFcMqP4tWaYhhPizgYk7XK/gZiQ7k8b5bqbwElfZWvS7fGveVJPCmPjoYBR1xB 4sK+evqBA68t/X3QfkDnja+ZX6wWh5EvmoJP/vaMNH/21ByNOhPojCV+LP+p1MoypYf1 8PoV/yT80mLoyzsY480KsrQEPN8ek77SGmLCZTOF8oCzj9UupyDSzKgYgyicRV8Zpe8v RG7Q== X-Gm-Message-State: ACrzQf0T2cDbT0XVSuw5uMBeYVi5UOQnkC1WshsfIWN5K2hGjxMP5MeG 2/YLYAZb2aWL/5TqoX8RWoBSxcExq13xDxrNRoaXXA== X-Received: by 2002:a05:6808:e8c:b0:354:2751:69ae with SMTP id k12-20020a0568080e8c00b00354275169aemr2402247oil.228.1665134904319; Fri, 07 Oct 2022 02:28:24 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Daniel Vetter Date: Fri, 7 Oct 2022 11:28:12 +0200 Message-ID: Subject: Re: [git pull] drm for 6.1-rc1 To: Linus Torvalds , Andrey Grodzovsky Cc: Dave Airlie , Alex Deucher , Alex Deucher , =?UTF-8?Q?Christian_K=C3=B6nig?= , LKML , dri-devel Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Forgot to add Andrey as scheduler maintainer. -Daniel On Fri, 7 Oct 2022 at 10:16, Daniel Vetter wrote: > > On Fri, 7 Oct 2022 at 01:45, Linus Torvalds > wrote: > > > > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie wrote: > > > > > > > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 > > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] > > > > As far as I can tell, that's the line > > > > struct drm_gpu_scheduler *sched = s_fence->sched; > > > > where 's_fence' is NULL. The code is > > > > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > > 5: 41 54 push %r12 > > 7: 55 push %rbp > > 8: 53 push %rbx > > 9: 48 89 fb mov %rdi,%rbx > > c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction > > 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) > > 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax > > > > and that next 'lock decl' instruction would have been the > > > > atomic_dec(&sched->hw_rq_count); > > > > at the top of drm_sched_job_done(). > > > > Now, as to *why* you'd have a NULL s_fence, it would seem that > > drm_sched_job_cleanup() was called with an active job. Looking at that > > code, it does > > > > if (kref_read(&job->s_fence->finished.refcount)) { > > /* drm_sched_job_arm() has been called */ > > dma_fence_put(&job->s_fence->finished); > > ... > > > > but then it does > > > > job->s_fence = NULL; > > > > anyway, despite the job still being active. The logic of that kind of > > "fake refcount" escapes me. The above looks fundamentally racy, not to > > say pointless and wrong (a refcount is a _count_, not a flag, so there > > could be multiple references to it, what says that you can just > > decrement one of them and say "I'm done"). > > Just figured I'll clarify this, because it's indeed a bit wtf and the > comment doesn't explain much. drm_sched_job_cleanup can be called both > when a real job is being cleaned up (which holds a full reference on > job->s_fence and needs to drop it) and to simplify error path in job > constructions (and the "is this refcount initialized already" signals > what exactly needs to be cleaned up or not). So no race, because the > only times this check goes different is when job construction has > failed before the job struct is visible by any other thread. > > But yeah the comment could actually explain what's going on here :-) > > And yeah the patch Dave reverted screws up the cascade of references > that ensures this all stays alive until drm_sched_job_cleanup is > called on active jobs, so looks all reasonable to me. Some Kunit tests > maybe to exercise these corners? Not the first time pure scheduler > code blew up, so proably worth the effort. > -Daniel > > > > > Now, _why_ any of that happens, I have no idea. I'm just looking at > > the immediate "that pointer is NULL" thing, and reacting to what looks > > like a completely bogus refcount pattern. > > > > But that odd refcount pattern isn't new, so it's presumably some user > > on the amd gpu side that changed. > > > > The problem hasn't happened again for me, but that's not saying a lot, > > since it was very random to begin with. > > > > Linus > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch