Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp130184rwr; Thu, 4 May 2023 00:00:34 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7ShnvAgKDGcbI6d8NltyWYAMURRHPjJN/6dwTp/0QU7nw8GFXGswhr2So2R/US0/TwESrs X-Received: by 2002:a05:6a20:a398:b0:f2:7dd:2753 with SMTP id w24-20020a056a20a39800b000f207dd2753mr1111425pzk.33.1683183634410; Thu, 04 May 2023 00:00:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683183634; cv=none; d=google.com; s=arc-20160816; b=SBmUnEP9GT57GYdIxv3BHlWH2zbF5+Hea4IoTXiU32FIje2aC38KpNA1ayNkZuL6vc f1PgizuG1xS7CXE5CRu6oMlYw9H7Vve6AiaMtuIuezezuf30IvM986SzkxUzoOx93dxU 9oq7SWuwLVgOLyhrs17tT5RvUraWqGWT993VLE6uR9rPOBPsWxW2F686W68yg4j5VB4F mCXcDmMLU57t5GF+06RkTj/TykB9yOi3L07U2y2kXq27+WDGI4G+3vz9vkbplwNyXeIQ xicfiJkQU+V4rpbfB5WQOUHEqcbQ7G0NoDeKqqHRzjoi2joZ0qiVWn3kIFqYKdFeZgMH r+lQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=L+HveaSZ8pmwoAQVLbFBbIPHXURnLfpovswLFaD49cM=; b=hXSOW3jRkTsJmFNsXb5oC6CHgHSQlORPUeGlkhbT9jLlOc9ECob9wLWW/mxu7AmCxD dTw79Mms0wE8UI1OuSJAd9fnZ6xyWfN0pxYyxu67qnfLhSZvdIjsaZkmLWNd5T0Rc4nV AM8msdVZy/7yNQr4woUy2AuiMkBVtzyx0f3l50jOfRUj0SLrXdJYYMpih01D1iEkt7Oh AMQSNE7yvk/PXv4VEjdVl328lNXm3m4TcoDjPRZ/5PdM3UsrTnEYHjHPcsC32/+Kvr+l VETZFK0U2IEMpPUwe2OcVjnoE5vCH8nyGFysDP6oLT0FcHif1MSvA8zd/uq/A8ToTqfK xl8w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=Yc1utEmm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s26-20020a63925a000000b0052c76c18412si2213529pgn.26.2023.05.04.00.00.20; Thu, 04 May 2023 00:00:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=Yc1utEmm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229851AbjEDGnr (ORCPT + 99 others); Thu, 4 May 2023 02:43:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53624 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229460AbjEDGnq (ORCPT ); Thu, 4 May 2023 02:43:46 -0400 Received: from mail-ej1-x630.google.com (mail-ej1-x630.google.com [IPv6:2a00:1450:4864:20::630]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9C1FEE4 for ; Wed, 3 May 2023 23:43:44 -0700 (PDT) Received: by mail-ej1-x630.google.com with SMTP id a640c23a62f3a-9505214c47fso13515966b.1 for ; Wed, 03 May 2023 23:43:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683182623; x=1685774623; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=L+HveaSZ8pmwoAQVLbFBbIPHXURnLfpovswLFaD49cM=; b=Yc1utEmmtVjKxRztgUq8TCrqbbROc/mWgRM3m733c//+EhoGPCJukqByvu+IsuYjXz 9fekby6lVOIh2YlJV/Pwk4KwyIOcLqk8AjfPvYHGFdPY34ZF9JcPF9Ki5Ier/UY4O5LC ++Kjly+BcoE03+TJesuBL2Pg9QhuWVcaQQU5nPUT8EBPrONzuK5Ex9y3ExzxbZG4A+q0 3RaTYaMmYm55t+c4AXLG+jKta0VAxf9t2XOIGq15QxVBU/YRqQVyoRA/wnHJ/QqK5LbG WtGE8MFTa4t8bOpBBWz5DY11Rt0KlFtM2yQ8rj8hHg4KtVLs5MUlAyqVFqEqBbMQOoWJ UZmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683182623; x=1685774623; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=L+HveaSZ8pmwoAQVLbFBbIPHXURnLfpovswLFaD49cM=; b=EUTPjKIE/cYoflufK9jdGnQiS0tKB+ry+Mr1HQkdVYk8xc1LDFlBWPb5gTpe6pRx4I KQN7QMHmWM8YM/92NF3mCfJpOczBh+KAJu9MAZBYt5rx3YmKAAAYlGhrixP/aNPABRTz O4OhO9SidObQ2+cUProsEImT1mACiCp20Gyp/Fkrp1Og756/t84GRC/UVDZzQ5B+gQoN osmxVkYAuyrpr9piB6M95w2cQLeHc5+pciejplbJsDCfYW+XHv+dxm18U54sfR/qN1NY N+jP8nJq29Pm8fWsEGdSkbLAxQanNYFeg07hGvUV2wYkKCcf3+V3ZXE7KmfkA8hnKuhW zgLg== X-Gm-Message-State: AC+VfDxyECIhwaKB2EM7UcD9BNA8/MUFw6u9Sa9hnIo0Tw8ZXEDNjyam vdlkrv3WchTwfS6uYR79t0g= X-Received: by 2002:a17:907:31c7:b0:95e:d448:477 with SMTP id xf7-20020a17090731c700b0095ed4480477mr5524506ejb.33.1683182622810; Wed, 03 May 2023 23:43:42 -0700 (PDT) Received: from ?IPV6:2a02:908:1256:79a0:df61:3d16:42f9:8c09? ([2a02:908:1256:79a0:df61:3d16:42f9:8c09]) by smtp.gmail.com with ESMTPSA id k9-20020a170906970900b009534211cc97sm18405177ejx.159.2023.05.03.23.43.41 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 03 May 2023 23:43:41 -0700 (PDT) Message-ID: <718e7aa5-0e08-fc88-b612-ae82ab9736cd@gmail.com> Date: Thu, 4 May 2023 08:43:42 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl Content-Language: en-US To: =?UTF-8?Q?Andr=c3=a9_Almeida?= , =?UTF-8?Q?Timur_Krist=c3=b3f?= , Felix Kuehling Cc: Alex Deucher , "Pelloux-Prayer, Pierre-Eric" , =?UTF-8?B?TWFyZWsgT2zFocOhaw==?= , michel.daenzer@mailbox.org, dri-devel , linux-kernel@vger.kernel.org, Samuel Pitoiset , amd-gfx list , kernel-dev@igalia.com, "Deucher, Alexander" , =?UTF-8?Q?Christian_K=c3=b6nig?= References: <20230501185747.33519-1-andrealmeid@igalia.com> <6ab2ff76-4518-6fac-071e-5d0d5adc4fcd@igalia.com> <85c538b01efb6f3fa6ff05ed1a0bc3ff87df7a61.camel@gmail.com> <57fa0ee4-de4f-3797-f817-d05f72541d0e@gmail.com> <2bf162d0-6112-8370-8828-0e0b21ac22ba@amd.com> <967a044bc2723cc24ab914506c0164db08923c59.camel@gmail.com> <59774c28-a0ef-d4f2-e920-503857bce1cf@igalia.com> From: =?UTF-8?Q?Christian_K=c3=b6nig?= In-Reply-To: <59774c28-a0ef-d4f2-e920-503857bce1cf@igalia.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-6.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Am 03.05.23 um 21:14 schrieb André Almeida: > Em 03/05/2023 14:43, Timur Kristóf escreveu: >> Hi Felix, >> >> On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: >>> That's the worst-case scenario where you're debugging HW or FW >>> issues. >>> Those should be pretty rare post-bringup. But are there hangs caused >>> by >>> user mode driver or application bugs that are easier to debug and >>> probably don't even require a GPU reset? >> >> There are many GPU hangs that gamers experience while playing. We have >> dozens of open bug reports against RADV about GPU hangs on various GPU >> generations. These usually fall into two categories: >> >> 1. When the hang always happens at the same point in a game. These are >> painful to debug but manageable. >> 2. "Random" hangs that happen to users over the course of playing a >> game for several hours. It is absolute hell to try to even reproduce >> let alone diagnose these issues, and this is what we would like to >> improve. >> >> For these hard-to-diagnose problems, it is already a challenge to >> determine whether the problem is the kernel (eg. setting wrong voltages >> / frequencies) or userspace (eg. missing some synchronization), can be >> even a game bug that we need to work around. >> >>> For example most VM faults can >>> be handled without hanging the GPU. Similarly, a shader in an endless >>> loop should not require a full GPU reset. >> >> This is actually not the case, AFAIK André's test case was an app that >> had an infinite loop in a shader. >> > > This is the test app if anyone want to try out: > https://github.com/andrealmeid/vulkan-triangle-v1. Just compile and run. > > The kernel calls amdgpu_ring_soft_recovery() when I run my example, > but I'm not sure what a soft recovery means here and if it's a full > GPU reset or not. That's just "soft" recovery. In other words we send the SQ a command to kill a shader. That usually works for shaders which contain an endless loop (which is the most common application bug), but unfortunately not for any other problem. > > But if we can at least trust the CP registers to dump information for > soft resets, it would be some improvement from the current state I think Especially for endless loops the CP registers are completely useless. The CP just prepares the draw commands and all the state which is then send to the SQ for execution. As Marek wrote we know which submission has timed out in the kernel, but we can't figure out where inside this submission we are. > >>> >>> It's more complicated for graphics because of the more complex >>> pipeline >>> and the lack of CWSR. But it should still be possible to do some >>> debugging without JTAG if the problem is in SW and not HW or FW. It's >>> probably worth improving that debugability without getting hung-up on >>> the worst case. >> >> I agree, and we welcome any constructive suggestion to improve the >> situation. It seems like our idea doesn't work if the kernel can't give >> us the information we need. >> >> How do we move forward? As I said the best approach to figure out which draw command hangs is to sprinkle WRITE_DATA commands into your command stream. That's not so much overhead and at least Bas things that this is doable in RADV with some changes. For the kernel we can certainly implement devcoredump and allow writing out register values and other state when a problem happens. Regards, Christian. >> >> Best regards, >> Timur >>