Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp6882257rwr; Tue, 2 May 2023 06:51:37 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6Ez6t3crvQ8iEhyJRnZupeF/dbQef0YdIQ7TaUdmCYdyWfmAtJLBjVMd+0XdNp0iz+rDuj X-Received: by 2002:a17:90a:ba90:b0:24d:ea7f:9ea2 with SMTP id t16-20020a17090aba9000b0024dea7f9ea2mr10182617pjr.15.1683035497214; Tue, 02 May 2023 06:51:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683035497; cv=none; d=google.com; s=arc-20160816; b=zUEbemVyg7kkukZzDvRl1dU5pN5OBkY9NOpJ9/+bk/g9XUR2iraiQTuwqn2iTUTJum Vkzo+cWD4+BehUKzz0Z7jt1XTrKoT9T8NAfqvBbCCN9TZFvVD41eeOv/VB3WcAdO73z+ uygWUqcaEjqjhgc3p1UdQnzL+kxh7c+1SHgxrsTUEVGS+W7yf1C7eCgJ7jmHkAYbArg7 1D98DlNbeoQN3bruAGIJVwMtv6oZqnO6R/FClXsdV7MZOX66dL9OWv0D2xKCLzqYpGG2 hBeeEmGm69cr93d4ltjMIYeBd5BjrYwGR3WCgNGp5PRdCxSWEXuvUxpP1OZ3BUVh79zr 6fdA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=M8MTHJ52iLYQD5hklUS87Ar+YGmPytygVAeYV0UnYw0=; b=UzTK2ljcfXkpmfj7m2+Lr+QA63AFjAyw5nVZnT0BJvtcZjnq/hhO1xF9x3GINOyHm4 mI369ir9Lopgse4JBdLaGilLpjHuDOvXJqtarTBqqZq0Vh9a9yCijOjUdk+10IcGFevW KQKmRUFgvv5MBlew1/hwunf09CmkSFRN0+H1TqKx+6rnCHqVAyXC8nFo/RO/8q+CKoYI s9pEAsDKpHng+N3lcHO8CHGlcquTU52tRE1Le/F6C+4uCdZDP13LwGp5T09A2M/KH3hk ljMBDBXRugzu1ztglEIuSVgkZEzeNTWSQDUQsTCl2oUmYliT6umaA/jseJj7RMBJeKQQ efhg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=gYOWyPkt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n14-20020a63e04e000000b0051b70b1f1f5si26222186pgj.608.2023.05.02.06.51.10; Tue, 02 May 2023 06:51:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=gYOWyPkt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234202AbjEBNpl (ORCPT + 99 others); Tue, 2 May 2023 09:45:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43674 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233977AbjEBNpj (ORCPT ); Tue, 2 May 2023 09:45:39 -0400 Received: from mail-ot1-x333.google.com (mail-ot1-x333.google.com [IPv6:2607:f8b0:4864:20::333]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 380E2170A for ; Tue, 2 May 2023 06:45:19 -0700 (PDT) Received: by mail-ot1-x333.google.com with SMTP id 46e09a7af769-6a5f03551fdso2690176a34.0 for ; Tue, 02 May 2023 06:45:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683035116; x=1685627116; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=M8MTHJ52iLYQD5hklUS87Ar+YGmPytygVAeYV0UnYw0=; b=gYOWyPkt4QQAkSbtqPXdKGoGuU5TqFuXt5BI8IYWr4O97at9XkZOF9GUL6M0yxJ3t8 KuFcLSVV8ZgFDgojs0nVoUPKEIpOXgstyVO5lHyJvkqwRHHcbQIKrgeG8QPCoAq8Ck2k LC0A1cr4NF5sVli9HXrbR6TcZaM7sh+7T9gHKH39ADFdeW4k9D02WsBK+Iar9iJbu3jI 3W7e9KpIsevBjuyGNewuizv6IZ9oG+txj4WAGckJXMLBYov4NAk0hmn8Q2NVB8Wh0Q8/ DHB8BrMl9dY+9uH5ex0YReP7Dz6+B2xgm/Iz+FZqkzUZP3IwyDVfomVbn3+LDPD3GUDy BaqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683035116; x=1685627116; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=M8MTHJ52iLYQD5hklUS87Ar+YGmPytygVAeYV0UnYw0=; b=Vbcl/PRmd9Ro53oI9gJtv/HArXG88ZXiY02001jNO2GyQ4xmHfZlPJDFutHdpYHasD JKDH9Ay6Er0jMZCADE1YeMwFQ1c0rAHPLOsWKyR8rDWtNrrfuRGl9PpSmhqzMS0Ja0qR ZTMpEQbyKhlqB3KhdZiSByjrZSF4U6n6M3CJxacDKdioWaoyA9j8CC/uaUZOM+5Ka1ki 7pjQUb/hvXa7VO0EA510PkeUyuHoFpAKkZp/NYE+ZgiHQuucErBBiyVQjDug262J/A30 MmQr86QnnDE3viggPT2uZM9CDmh6hchkHpxhOWsRfXsSdvjjV0K+Rx93v8muumT84WsU TWbg== X-Gm-Message-State: AC+VfDxdcHlVgiHMu5v2ivasjhUxsqIdRccOvwLcXpFnkT6EsF//Y48z 2FvypNeo1S3LC9Hxq3JbrdaWb1wcyiiOknQ4F8o= X-Received: by 2002:a05:6808:159a:b0:392:3ba:3a28 with SMTP id t26-20020a056808159a00b0039203ba3a28mr6660179oiw.11.1683035116619; Tue, 02 May 2023 06:45:16 -0700 (PDT) MIME-Version: 1.0 References: <20230501185747.33519-1-andrealmeid@igalia.com> <6ab2ff76-4518-6fac-071e-5d0d5adc4fcd@igalia.com> <85c538b01efb6f3fa6ff05ed1a0bc3ff87df7a61.camel@gmail.com> In-Reply-To: <85c538b01efb6f3fa6ff05ed1a0bc3ff87df7a61.camel@gmail.com> From: Alex Deucher Date: Tue, 2 May 2023 09:45:05 -0400 Message-ID: Subject: Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl To: =?UTF-8?Q?Timur_Krist=C3=B3f?= Cc: =?UTF-8?Q?Christian_K=C3=B6nig?= , =?UTF-8?Q?Andr=C3=A9_Almeida?= , dri-devel , amd-gfx list , linux-kernel@vger.kernel.org, "Pelloux-Prayer, Pierre-Eric" , =?UTF-8?B?TWFyZWsgT2zFocOhaw==?= , michel.daenzer@mailbox.org, Samuel Pitoiset , kernel-dev@igalia.com, Bas Nieuwenhuizen , "Deucher, Alexander" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 2, 2023 at 9:35=E2=80=AFAM Timur Krist=C3=B3f wrote: > > Hi, > > On Tue, 2023-05-02 at 13:14 +0200, Christian K=C3=B6nig wrote: > > > > > > Christian K=C3=B6nig ezt =C3=ADrta (id=C5= =91pont: 2023. > > > m=C3=A1j. 2., Ke 9:59): > > > > > > > Am 02.05.23 um 03:26 schrieb Andr=C3=A9 Almeida: > > > > > Em 01/05/2023 16:24, Alex Deucher escreveu: > > > > >> On Mon, May 1, 2023 at 2:58=E2=80=AFPM Andr=C3=A9 Almeida > > > > > > > > >> wrote: > > > > >>> > > > > >>> I know that devcoredump is also used for this kind of > > > > information, > > > > >>> but I believe > > > > >>> that using an IOCTL is better for interfacing Mesa + Linux > > > > rather > > > > >>> than parsing > > > > >>> a file that its contents are subjected to be changed. > > > > >> > > > > >> Can you elaborate a bit on that? Isn't the whole point of > > > > devcoredump > > > > >> to store this sort of information? > > > > >> > > > > > > > > > > I think that devcoredump is something that you could use to > > > > submit to > > > > > a bug report as it is, and then people can read/parse as they > > > > want, > > > > > not as an interface to be read by Mesa... I'm not sure that > > > > it's > > > > > something that I would call an API. But I might be wrong, if > > > > you know > > > > > something that uses that as an API please share. > > > > > > > > > > Anyway, relying on that for Mesa would mean that we would need > > > > to > > > > > ensure stability for the file content and format, making it > > > > less > > > > > flexible to modify in the future and probe to bugs, while the > > > > IOCTL is > > > > > well defined and extensible. Maybe the dump from Mesa + > > > > devcoredump > > > > > could be complementary information to a bug report. > > > > > > > > Neither using an IOCTL nor devcoredump is a good approach for > > > > this since > > > > the values read from the hw register are completely unreliable. > > > > They > > > > could not be available because of GFXOFF or they could be > > > > overwritten or > > > > not even updated by the CP in the first place because of a hang > > > > etc.... > > > > > > > > If you want to track progress inside an IB what you do instead > > > > is to > > > > insert intermediate fence write commands into the IB. E.g. > > > > something > > > > like write value X to location Y when this executes. > > > > > > > > This way you can not only track how far the IB processed, but > > > > also in > > > > which stages of processing we where when the hang occurred. E.g. > > > > End of > > > > Pipe, End of Shaders, specific shader stages etc... > > > > > > > > > > > > > > Currently our biggest challenge in the userspace driver is > > > debugging "random" GPU hangs. We have many dozens of bug reports > > > from users which are like: "play the game for X hours and it will > > > eventually hang the GPU". With the currently available tools, it is > > > impossible for us to tackle these issues. Andr=C3=A9's proposal would= be > > > a step in improving this situation. > > > > > > We already do something like what you suggest, but there are > > > multiple problems with that approach: > > > > > > 1. we can only submit 1 command buffer at a time because we won't > > > know which IB hanged > > > 2. we can't use chaining because we don't know where in the IB it > > > hanged > > > 3. it needs userspace to insert (a lot of) extra commands such as > > > extra synchronization and memory writes > > > 4. It doesn't work when GPU recovery is enabled because the > > > information is already gone when we detect the hang > > > > > You can still submit multiple IBs and even chain them. All you need > > to do is to insert into each IB commands which write to an extra > > memory location with the IB executed and the position inside the IB. > > > > The write data command allows to write as many dw as you want (up to > > multiple kb). The only potential problem is when you submit the same > > IB multiple times. > > > > And yes that is of course quite some extra overhead, but I think > > that should be manageable. > > Thanks, this sounds doable and would solve the limitation of how many > IBs are submitted at a time. However it doesn't address the problem > that enabling this sort of debugging will still have extra overhead. > > I don't mean the overhead from writing a couple of dwords for the > trace, but rather, the overhead from needing to emit flushes or top of > pipe events or whatever else we need so that we can tell which command > hung the GPU. > > > > > > In my opinion, the correct solution to those problems would be if > > > the kernel could give userspace the necessary information about a > > > GPU hang before a GPU reset. > > > > > The fundamental problem here is that the kernel doesn't have that > > information either. We know which IB timed out and can potentially do > > a devcoredump when that happens, but that's it. > > > Is it really not possible to know such a fundamental thing as what the > GPU was doing when it hung? How are we supposed to do any kind of > debugging without knowing that? > > I wonder what AMD's Windows driver team is doing with this problem, > surely they must have better tools to deal with GPU hangs? For better or worse, most teams internally rely on scan dumps via JTAG which sort of limits the usefulness outside of AMD, but also gives you the exact state of the hardware when it's hung so the hardware teams prefer it. Alex