Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp7017317rwr; Tue, 2 May 2023 08:24:43 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5VsRyOoF2iBH6q8tVOh4rn42WUFaV5D8JZGHQkiekwBP7tsWeuLjO0Ywschhx3my/bz8gZ X-Received: by 2002:a05:6a20:230b:b0:f6:6d25:9565 with SMTP id n11-20020a056a20230b00b000f66d259565mr15833767pzc.17.1683041083374; Tue, 02 May 2023 08:24:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683041083; cv=none; d=google.com; s=arc-20160816; b=j8/fUYxuHK6h4ThF8xf7XAFVz/ROsCBjCY/5BqpH9K0an0RAjsL+R+G8/IzlgQRHyh n96wQv9hW2qNH91ss5eg9Jr49FGb+Aic0Lf4D0bA8cPfn80cECKiz2jGlPmYXncrsFWs MvHacStTQzzfPz28VY3UJ/ICHX++ZlUPVX3Rze5Nxet6HLbls8y8Dbdf3to5HJzRLIXy PNh+kW9t7wv2UkAmhMgS2f6Wj7D6d1DWZw98AbB3bWIcTZr55+igfx8mKLhidJ6CP4RQ ZtU/qDIFDmgjQ8ID57uKSKQZqp+ztnwaCq4pA/Gla2uE/g9H5Fw8ih3hL1e8N4MSUrZs YIpQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent :content-transfer-encoding:references:in-reply-to:date:cc:to:from :subject:message-id:dkim-signature; bh=fPK0QjUxHX3a6I/jaRNvVSBSf3l6DRXHloilqaGemEk=; b=UT6ytYKP56M8YmzHpqi2iVMQMBleOHYYjCgP7jHEt2EpLrEww3T+qd4L22FQFMoKoh tJ0LbYilSOjwXIaKqDQ5HbmlweaoNJAkdJsyS9GxvK+ugQIR8h6TxCBzO47jxZzqlMCD nAQ1U2nW4k2bAMzHJOyTLjqvujLIY/IvbBqiwguW4qiXskOTTEQR42bgbwEHc1i5jFb0 XlPC77Qjuv2xAZQf0MK28NQkS8NdDAChm2KRAhZmMx7LmbQrQmwuDEUIlB5jj/KqPVOk 93h3b8Wdak5x+qF7XhxJC+eE6ZL5Yq9bMMVEnMZtG4bEeET4YLiBuS7R7xeFyLhpB6s4 294Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=gXT3z0mL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n190-20020a6327c7000000b0051324096be5si30792022pgn.47.2023.05.02.08.24.28; Tue, 02 May 2023 08:24:43 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=gXT3z0mL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234139AbjEBPW5 (ORCPT + 99 others); Tue, 2 May 2023 11:22:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40356 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234078AbjEBPW4 (ORCPT ); Tue, 2 May 2023 11:22:56 -0400 Received: from mail-lj1-x22c.google.com (mail-lj1-x22c.google.com [IPv6:2a00:1450:4864:20::22c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B291BD7 for ; Tue, 2 May 2023 08:22:51 -0700 (PDT) Received: by mail-lj1-x22c.google.com with SMTP id 38308e7fff4ca-2a8b082d6feso38371441fa.2 for ; Tue, 02 May 2023 08:22:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683040970; x=1685632970; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:from:to:cc:subject :date:message-id:reply-to; bh=fPK0QjUxHX3a6I/jaRNvVSBSf3l6DRXHloilqaGemEk=; b=gXT3z0mL4BcBfHS+MMvmBznO2GyQc6ucwsF15zrxc4N0gh+SZ2q1d4tEES4uwUwvGH MzRvNfA9B0yzfEzxlm+aSMjh391/xbx7C1gBzcqIXKuP48+zPW0NRtbLQSyjC2C+G+dv V4HSTMB/iG9jq+9YVM1b+GCzcmVv52ydR7jEtNpJvLsvt5rC41cxwcB6QV4qkCqrYg3j y9vwSiB5RkpOys0/5t11tzrjOqAw0bGdN5OuTeQbIwFENA+tqeo78VHAwU6Xd1RKRNLF 9cWQmpRT4jNoCOaPygd0H1qlSiPy3h5NR/8Z9dq4NEHgbGCk/OguHs+j7Be0y5RZcetI Mc2w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683040970; x=1685632970; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=fPK0QjUxHX3a6I/jaRNvVSBSf3l6DRXHloilqaGemEk=; b=M4CgM5bDomHKF2z4w4fk+V2/EwlK/4y3/kMWuQgGFMb5FdJVqNyHVVSC6XVA/3+pqk cR5zsz492ll3D/NgBU1Sc58xt9RCk1URyUiR9AWi6ScBk25fWy37vpINT2nShUbO5yUr 8hCGxwBctPljgoxE0nj9kKgZoBRba4Nw3Vq1c6gUvHS29jywFS8HlkstVOw80xF9jcEl djVfLvAS8YRecKxypG1WAWGJMKaROeGWCX4G1P7G/4Rvj7DdN46gVn6kkFTHhsHKUrFN 28djCdR/9g2zDIT1GUVoLt+fCC86KVYH6GseZ9d2WIQBb4c1kjHsgxZ+4xIyfkv2Dsz0 SppQ== X-Gm-Message-State: AC+VfDw1ckxKs7UU6/wuC9MPfJ1pSY+MqU3Pn1NAvIlTO9Tu8rRUy4wp 2jPDikEmtZx4MlWwyyDuXus= X-Received: by 2002:a19:f502:0:b0:4e8:3d24:de6f with SMTP id j2-20020a19f502000000b004e83d24de6fmr91127lfb.14.1683040969517; Tue, 02 May 2023 08:22:49 -0700 (PDT) Received: from [192.168.0.131] (catv-89-134-213-173.catv.fixed.vodafone.hu. [89.134.213.173]) by smtp.gmail.com with ESMTPSA id p5-20020ac246c5000000b004eafabb4dc1sm5430672lfo.250.2023.05.02.08.22.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 May 2023 08:22:48 -0700 (PDT) Message-ID: Subject: Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl From: Timur =?ISO-8859-1?Q?Krist=F3f?= To: Alex Deucher Cc: Christian =?ISO-8859-1?Q?K=F6nig?= , =?ISO-8859-1?Q?Andr=E9?= Almeida , dri-devel , amd-gfx list , linux-kernel@vger.kernel.org, "Pelloux-Prayer, Pierre-Eric" , Marek =?UTF-8?Q?Ol=C5=A1=C3=A1k?= , michel.daenzer@mailbox.org, Samuel Pitoiset , kernel-dev@igalia.com, Bas Nieuwenhuizen , "Deucher, Alexander" Date: Tue, 02 May 2023 17:22:46 +0200 In-Reply-To: References: <20230501185747.33519-1-andrealmeid@igalia.com> <6ab2ff76-4518-6fac-071e-5d0d5adc4fcd@igalia.com> <85c538b01efb6f3fa6ff05ed1a0bc3ff87df7a61.camel@gmail.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.48.1 (3.48.1-1.fc38) MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2023-05-02 at 09:45 -0400, Alex Deucher wrote: > On Tue, May 2, 2023 at 9:35=E2=80=AFAM Timur Krist=C3=B3f > wrote: > >=20 > > Hi, > >=20 > > On Tue, 2023-05-02 at 13:14 +0200, Christian K=C3=B6nig wrote: > > > >=20 > > > > Christian K=C3=B6nig ezt =C3=ADrta (id= =C5=91pont: > > > > 2023. > > > > m=C3=A1j. 2., Ke 9:59): > > > >=20 > > > > > Am 02.05.23 um 03:26 schrieb Andr=C3=A9 Almeida: > > > > > =C2=A0> Em 01/05/2023 16:24, Alex Deucher escreveu: > > > > > =C2=A0>> On Mon, May 1, 2023 at 2:58=E2=80=AFPM Andr=C3=A9 Almeid= a > > > > > > > > > > =C2=A0>> wrote: > > > > > =C2=A0>>> > > > > > =C2=A0>>> I know that devcoredump is also used for this kind of > > > > > information, > > > > > =C2=A0>>> but I believe > > > > > =C2=A0>>> that using an IOCTL is better for interfacing Mesa + > > > > > Linux > > > > > rather > > > > > =C2=A0>>> than parsing > > > > > =C2=A0>>> a file that its contents are subjected to be changed. > > > > > =C2=A0>> > > > > > =C2=A0>> Can you elaborate a bit on that?=C2=A0 Isn't the whole p= oint > > > > > of > > > > > devcoredump > > > > > =C2=A0>> to store this sort of information? > > > > > =C2=A0>> > > > > > =C2=A0> > > > > > =C2=A0> I think that devcoredump is something that you could use > > > > > to > > > > > submit to > > > > > =C2=A0> a bug report as it is, and then people can read/parse as > > > > > they > > > > > want, > > > > > =C2=A0> not as an interface to be read by Mesa... I'm not sure > > > > > that > > > > > it's > > > > > =C2=A0> something that I would call an API. But I might be wrong, > > > > > if > > > > > you know > > > > > =C2=A0> something that uses that as an API please share. > > > > > =C2=A0> > > > > > =C2=A0> Anyway, relying on that for Mesa would mean that we would > > > > > need > > > > > to > > > > > =C2=A0> ensure stability for the file content and format, making > > > > > it > > > > > less > > > > > =C2=A0> flexible to modify in the future and probe to bugs, while > > > > > the > > > > > IOCTL is > > > > > =C2=A0> well defined and extensible. Maybe the dump from Mesa + > > > > > devcoredump > > > > > =C2=A0> could be complementary information to a bug report. > > > > >=20 > > > > > =C2=A0Neither using an IOCTL nor devcoredump is a good approach > > > > > for > > > > > this since > > > > > =C2=A0the values read from the hw register are completely > > > > > unreliable. > > > > > They > > > > > =C2=A0could not be available because of GFXOFF or they could be > > > > > overwritten or > > > > > =C2=A0not even updated by the CP in the first place because of a > > > > > hang > > > > > etc.... > > > > >=20 > > > > > =C2=A0If you want to track progress inside an IB what you do > > > > > instead > > > > > is to > > > > > =C2=A0insert intermediate fence write commands into the IB. E.g. > > > > > something > > > > > =C2=A0like write value X to location Y when this executes. > > > > >=20 > > > > > =C2=A0This way you can not only track how far the IB processed, > > > > > but > > > > > also in > > > > > =C2=A0which stages of processing we where when the hang occurred. > > > > > E.g. > > > > > End of > > > > > =C2=A0Pipe, End of Shaders, specific shader stages etc... > > > > >=20 > > > > >=20 > > > >=20 > > > > Currently our biggest challenge in the userspace driver is > > > > debugging "random" GPU hangs. We have many dozens of bug > > > > reports > > > > from users which are like: "play the game for X hours and it > > > > will > > > > eventually hang the GPU". With the currently available tools, > > > > it is > > > > impossible for us to tackle these issues. Andr=C3=A9's proposal > > > > would be > > > > a step in improving this situation. > > > >=20 > > > > We already do something like what you suggest, but there are > > > > multiple problems with that approach: > > > >=20 > > > > 1. we can only submit 1 command buffer at a time because we > > > > won't > > > > know which IB hanged > > > > 2. we can't use chaining because we don't know where in the IB > > > > it > > > > hanged > > > > 3. it needs userspace to insert (a lot of) extra commands such > > > > as > > > > extra synchronization and memory writes > > > > 4. It doesn't work when GPU recovery is enabled because the > > > > information is already gone when we detect the hang > > > >=20 > > > =C2=A0You can still submit multiple IBs and even chain them. All you > > > need > > > to do is to insert into each IB commands which write to an extra > > > memory location with the IB executed and the position inside the > > > IB. > > >=20 > > > =C2=A0The write data command allows to write as many dw as you want > > > (up to > > > multiple kb). The only potential problem is when you submit the > > > same > > > IB multiple times. > > >=20 > > > =C2=A0And yes that is of course quite some extra overhead, but I thin= k > > > that should be manageable. > >=20 > > Thanks, this sounds doable and would solve the limitation of how > > many > > IBs are submitted at a time. However it doesn't address the problem > > that enabling this sort of debugging will still have extra > > overhead. > >=20 > > I don't mean the overhead from writing a couple of dwords for the > > trace, but rather, the overhead from needing to emit flushes or top > > of > > pipe events or whatever else we need so that we can tell which > > command > > hung the GPU. > >=20 > > >=20 > > > > In my opinion, the correct solution to those problems would be > > > > if > > > > the kernel could give userspace the necessary information about > > > > a > > > > GPU hang before a GPU reset. > > > >=20 > > > =C2=A0The fundamental problem here is that the kernel doesn't have > > > that > > > information either. We know which IB timed out and can > > > potentially do > > > a devcoredump when that happens, but that's it. > >=20 > >=20 > > Is it really not possible to know such a fundamental thing as what > > the > > GPU was doing when it hung? How are we supposed to do any kind of > > debugging without knowing that? > >=20 > > I wonder what AMD's Windows driver team is doing with this problem, > > surely they must have better tools to deal with GPU hangs? >=20 > For better or worse, most teams internally rely on scan dumps via > JTAG > which sort of limits the usefulness outside of AMD, but also gives > you > the exact state of the hardware when it's hung so the hardware teams > prefer it. >=20 How does this approach scale? It's not something we can ask users to do, and even if all of us in the radv team had a JTAG device, we wouldn't be able to play every game that users experience random hangs with.