Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp467429rwr; Wed, 3 May 2023 01:03:42 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ69d3GitHr6GwqunMeatZAnvb0gOs01nxyo65ZCK/iDUCCZ1TwOqp4HTl6pjzZYEqFEp6x7 X-Received: by 2002:a17:902:a705:b0:1a0:50bd:31a8 with SMTP id w5-20020a170902a70500b001a050bd31a8mr1169118plq.26.1683101021637; Wed, 03 May 2023 01:03:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683101021; cv=none; d=google.com; s=arc-20160816; b=HDTVLYL/2Z5rP5WlLd52lkdz+IrG/G6pPaeEoJNDpHglP62TJeWXVTeAicwknoPb1F AP9aJ9nVQxJ00J0CcWFtuq1kvAkQXF1nJMZCMIvFeusIDJW3XHmAlmOXrPzK3adtpflx 3CpYOrLYKI4QIIMYAd5Jqn4Cw+Qlz23j6fJQ9LOtR+XBdSQdmHKvzfn7Sge6NLBayz8J B0qSs0Eu4ClxpGXpzQ4mb/Sr55pZ5KcRSg0Ki9TzbI8RR5iC2SNSVNNo754hnafmx/D4 sMG4kaErYbCFwQ8pxulr9UChivFlfdgRo0mkubJRuJsYHCTqAHIllMg39VAqRW+eB6r7 goeg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=n2lIatLtxS/6Cq5Ij16t1kjY3cuWYBdjoqqifNQbzJ0=; b=wrscQfnGsvEvEOLhMttvxEq9yAI/yUMa/WYx4QLgmKuNcsyNKkfYuPzod+URLahbaE /a2IUnQcckMNvJkPVo0oVE+Kwkc6HNzN36dAluJ60drqvAbbzlBK+EcPpUoyqMZrTN/O 9OymPgHsr9re1Ig4HcbxssDw4FZhgLxyL97hfxmMlefuJ78aXvKWO+NOFtTK28AEgRnt 8bBiBGGyKmG+V3+bdzf3H8kCscsy+ixDoltVLHy3YnTvd5rttay3m+uXdx5u1qj8J+fI gEAEVRl5krXLRCb6ntCLRwulze6+noDqqhBBYaVm9LzOu6zrxE7EQWDTFFneDnr0HpEW KdLQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b="DOS/xq+c"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t16-20020a170902b21000b001aaf607910fsi6856092plr.376.2023.05.03.01.03.28; Wed, 03 May 2023 01:03:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b="DOS/xq+c"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229672AbjECH7W (ORCPT + 99 others); Wed, 3 May 2023 03:59:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56084 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229630AbjECH7U (ORCPT ); Wed, 3 May 2023 03:59:20 -0400 Received: from mail-ej1-x632.google.com (mail-ej1-x632.google.com [IPv6:2a00:1450:4864:20::632]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1A80135B0 for ; Wed, 3 May 2023 00:59:19 -0700 (PDT) Received: by mail-ej1-x632.google.com with SMTP id a640c23a62f3a-94f3df30043so780225366b.2 for ; Wed, 03 May 2023 00:59:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683100757; x=1685692757; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=n2lIatLtxS/6Cq5Ij16t1kjY3cuWYBdjoqqifNQbzJ0=; b=DOS/xq+cYgi2rm7lNlZXy9XxJnguOr4ukDE7HlwSSGNfuchYEibFzGrCDB4JCmZDpP +Vp88JXbD9SBlbhs+0GhYf52FYivkDHS2VGGJyjmQr2yRAlrJOm8uVQAvGJnDxjrkxfV StCf39JolnezT8H3ADMXoPkBKlp7wB9bo5CkvVg3QZfKF4JGJA5j+hVU4296EqrccSzH NgzK+/qVIN43BzAVnOg/cOe94u/MnVVWBDmdra5++VtKRiBDY2VMMCetOpNJQ+WDDUgh pSufUDhgScQVCXzfAWIbWWE+DI9bwyEScM4kHig6AsUNHxQsLnGzwFFprTPaVG21aMu1 Pccw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683100757; x=1685692757; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=n2lIatLtxS/6Cq5Ij16t1kjY3cuWYBdjoqqifNQbzJ0=; b=NPoRVD0T6BDEtXrjAdKgVIuSOcEIj41xhnD6816D7xasvqZuc5mIAqfLKXxIGSg6qG ffc+bsydFtLtneTejbnacspfV3Je/RH1iWVNG8kcVSuOb8p6PgGZ27sj3kjof2Ix/ZEh YO4grJfkTYoqCqmNnjvnrrzv3yzU17S+yUdFZ+pIVB/m/qz5EGfWh8AyO521k2Ez42Bs nOTxN0er7//7TyZ9DDLvqckZjFXgRl1Blu9stTnPLEC+Ws7OemxemfFZnM75RAn/Hh1R tBHMdHBhC1gyQb2TQEXxEOWNCxffAap9zHficM257j7TzdI4WQ3OZACZ4xAOk3EmiW8c Mq2w== X-Gm-Message-State: AC+VfDzuTeuLX7oNZeGo7c9KaOGGik7VTs5Re1OWdUX1QyXO3LOob/6C JsUjrnA8bvgeueU/RLkPMBA= X-Received: by 2002:a17:907:2da9:b0:94e:bd38:49bb with SMTP id gt41-20020a1709072da900b0094ebd3849bbmr2734381ejc.23.1683100757340; Wed, 03 May 2023 00:59:17 -0700 (PDT) Received: from ?IPV6:2a02:908:1256:79a0:56ef:355f:bcc8:d6a8? ([2a02:908:1256:79a0:56ef:355f:bcc8:d6a8]) by smtp.gmail.com with ESMTPSA id s1-20020a1709060c0100b0094f66176208sm17145771ejf.95.2023.05.03.00.59.16 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 03 May 2023 00:59:16 -0700 (PDT) Message-ID: <57fa0ee4-de4f-3797-f817-d05f72541d0e@gmail.com> Date: Wed, 3 May 2023 09:59:15 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl Content-Language: en-US To: Alex Deucher , =?UTF-8?Q?Timur_Krist=c3=b3f?= Cc: "Pelloux-Prayer, Pierre-Eric" , Samuel Pitoiset , =?UTF-8?Q?Andr=c3=a9_Almeida?= , =?UTF-8?B?TWFyZWsgT2zFocOhaw==?= , michel.daenzer@mailbox.org, linux-kernel@vger.kernel.org, amd-gfx list , dri-devel , kernel-dev@igalia.com, Bas Nieuwenhuizen , "Deucher, Alexander" , =?UTF-8?Q?Christian_K=c3=b6nig?= References: <20230501185747.33519-1-andrealmeid@igalia.com> <6ab2ff76-4518-6fac-071e-5d0d5adc4fcd@igalia.com> <85c538b01efb6f3fa6ff05ed1a0bc3ff87df7a61.camel@gmail.com> From: =?UTF-8?Q?Christian_K=c3=b6nig?= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-6.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Am 02.05.23 um 20:41 schrieb Alex Deucher: > On Tue, May 2, 2023 at 11:22 AM Timur Kristóf wrote: >> [SNIP] >>>>>> In my opinion, the correct solution to those problems would be >>>>>> if >>>>>> the kernel could give userspace the necessary information about >>>>>> a >>>>>> GPU hang before a GPU reset. >>>>>> >>>>> The fundamental problem here is that the kernel doesn't have >>>>> that >>>>> information either. We know which IB timed out and can >>>>> potentially do >>>>> a devcoredump when that happens, but that's it. >>>> >>>> Is it really not possible to know such a fundamental thing as what >>>> the >>>> GPU was doing when it hung? How are we supposed to do any kind of >>>> debugging without knowing that? Yes, that's indeed something at least I try to figure out for years as well. Basically there are two major problems: 1. When the ASIC is hung you can't talk to the firmware engines any more and most state is not exposed directly, but just through some fw/hw interface.     Just take a look at how umr reads the shader state from the SQ. When that block is hung you can't do that any more and basically have no chance at all to figure out why it's hung.     Same for other engines, I remember once spending a week figuring out why the UVD block is hung during suspend. Turned out to be a debugging nightmare because any time you touch any register of that block the whole system would hang. 2. There are tons of things going on in a pipeline fashion or even completely in parallel. For example the CP is just the beginning of a rather long pipeline which at the end produces a bunch of pixels.     In almost all cases I've seen you ran into a problem somewhere deep in the pipeline and only very rarely at the beginning. >>>> >>>> I wonder what AMD's Windows driver team is doing with this problem, >>>> surely they must have better tools to deal with GPU hangs? >>> For better or worse, most teams internally rely on scan dumps via >>> JTAG >>> which sort of limits the usefulness outside of AMD, but also gives >>> you >>> the exact state of the hardware when it's hung so the hardware teams >>> prefer it. >>> >> How does this approach scale? It's not something we can ask users to >> do, and even if all of us in the radv team had a JTAG device, we >> wouldn't be able to play every game that users experience random hangs >> with. > It doesn't scale or lend itself particularly well to external > development, but that's the current state of affairs. The usual approach seems to be to reproduce a problem in a lab and have a JTAG attached to give the hw guys a scan dump and they can then tell you why something didn't worked as expected. And yes that absolutely doesn't scale. Christian. > > Alex