Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp1258580rwr; Wed, 3 May 2023 12:22:52 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4sd0kUezcA0ow7ghogWxqXLd9OIr9GryIBTVbhzGwlZPRddxpYKZATo7JEZ2gQ6CqBB5dY X-Received: by 2002:a17:902:db0a:b0:1ab:12cf:9e1c with SMTP id m10-20020a170902db0a00b001ab12cf9e1cmr1563620plx.32.1683141772644; Wed, 03 May 2023 12:22:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683141772; cv=none; d=google.com; s=arc-20160816; b=wL9u4nbyL3QXqqffTtuvEba9gHuz3Kj+rmGv5v/aKscQLuCCwZpILIBN3kaWPAPVhe tAXvl4cxs1HtzUGJVNI0QELlJYcdAwGs9Ue8fqEC3vRSRPrCxIkxwKDD7h9YDKH5G2VX Fr+XDcMcRFOtkkfk1gZiSt1MB+yWV4U+AnrQG5rZ5RhjrFvRr/z0IInUAbtzL47CxJfE N/2nTOc0ZABxio4mJXdsS7VmIuY0pKHuond/aFLPkcxOJZ05S78NtrO53233FsEblmwr 0Yao3pgzEG5KZya7lmuoXsU6W5++bebM4r5lx5IpqbQ+H4QrYHw2CDDfWZKyqtFcnXdW 3WXQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=EhdvKt6fX3eyfVjkfKTMAZRD9OI88urIeLzb1B2hSPM=; b=ZBRQre9d5zqxgpqwG2s3BaoppRhsFUqj8GquLchrQyE1/jnBx16zanQG+Jnr9OA+6z BIsHEy2jX054wXB3nySCgwia5wLY/OJodiKdlS0hCTzPASCqFD9tB7DtQfOEWV3h882r gYPXWrXNz+qlx8cpqoUlPFtyBrP64XL2WnFqsJJ77GMdONIOm+dE0a3S7f2pv9IotWof zbJeApaBg0YAU49UFmdZ0USX9w0UmOT7N3G3ufwI78YQHWJRPDp38PjEghy44b9TcuMY hTDE5i0jVmdNyKr22PDZlfG6BLy0iOG8Pi5SHYEewX1F4IkBg9D4fqwwPVx4Ts5tPS2v TRig== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@igalia.com header.s=20170329 header.b=j0Gwons4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n5-20020a170903110500b001a6b6085041si36334439plh.512.2023.05.03.12.22.13; Wed, 03 May 2023 12:22:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=fail header.i=@igalia.com header.s=20170329 header.b=j0Gwons4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229781AbjECTPO (ORCPT + 99 others); Wed, 3 May 2023 15:15:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42420 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229683AbjECTPL (ORCPT ); Wed, 3 May 2023 15:15:11 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 82D077690 for ; Wed, 3 May 2023 12:15:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:From: References:Cc:To:Subject:MIME-Version:Date:Message-ID:Sender:Reply-To: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=EhdvKt6fX3eyfVjkfKTMAZRD9OI88urIeLzb1B2hSPM=; b=j0Gwons4s5FUq/78sp8LgxzVTP 2FHF6xLZdhAxISl+mp2oHaYMQ7jwJTz3haBYn1dhDlKgsQJa0aIQO78CunvaTGc86Ki13q14YgTuU Q3umhiQrAVSAM+Gv1wDliK04Dm1pKTbxpPe8R9J8FCu62W2hIAWsSiRvmCgbgDdvKn2gRUJFyN3gl W848m7d6Km6t2WtFicatkglaPcAtZxbKRIxeqY2d2+8uMWGQWvy5ykUs/2HWe/1hqqBLsAkvTz5jX A/Y6isPexHsKdM+9iaicddj9Pty3El0qQNSyKPop4C5DU7153ldwlARn3VrA+6AeShivtXrnleV+G DsP1ZbLA==; Received: from [179.113.250.147] (helo=[192.168.1.111]) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_128_GCM:128) (Exim) id 1puHwL-000eUN-2j; Wed, 03 May 2023 21:15:01 +0200 Message-ID: <59774c28-a0ef-d4f2-e920-503857bce1cf@igalia.com> Date: Wed, 3 May 2023 16:14:11 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.1 Subject: Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl Content-Language: en-US To: =?UTF-8?Q?Timur_Krist=c3=b3f?= , Felix Kuehling Cc: Alex Deucher , "Pelloux-Prayer, Pierre-Eric" , =?UTF-8?B?TWFyZWsgT2zFocOhaw==?= , michel.daenzer@mailbox.org, dri-devel , =?UTF-8?Q?Christian_K=c3=b6nig?= , linux-kernel@vger.kernel.org, Samuel Pitoiset , amd-gfx list , kernel-dev@igalia.com, "Deucher, Alexander" , =?UTF-8?Q?Christian_K=c3=b6nig?= References: <20230501185747.33519-1-andrealmeid@igalia.com> <6ab2ff76-4518-6fac-071e-5d0d5adc4fcd@igalia.com> <85c538b01efb6f3fa6ff05ed1a0bc3ff87df7a61.camel@gmail.com> <57fa0ee4-de4f-3797-f817-d05f72541d0e@gmail.com> <2bf162d0-6112-8370-8828-0e0b21ac22ba@amd.com> <967a044bc2723cc24ab914506c0164db08923c59.camel@gmail.com> From: =?UTF-8?Q?Andr=c3=a9_Almeida?= In-Reply-To: <967a044bc2723cc24ab914506c0164db08923c59.camel@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-6.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Em 03/05/2023 14:43, Timur Kristóf escreveu: > Hi Felix, > > On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: >> That's the worst-case scenario where you're debugging HW or FW >> issues. >> Those should be pretty rare post-bringup. But are there hangs caused >> by >> user mode driver or application bugs that are easier to debug and >> probably don't even require a GPU reset? > > There are many GPU hangs that gamers experience while playing. We have > dozens of open bug reports against RADV about GPU hangs on various GPU > generations. These usually fall into two categories: > > 1. When the hang always happens at the same point in a game. These are > painful to debug but manageable. > 2. "Random" hangs that happen to users over the course of playing a > game for several hours. It is absolute hell to try to even reproduce > let alone diagnose these issues, and this is what we would like to > improve. > > For these hard-to-diagnose problems, it is already a challenge to > determine whether the problem is the kernel (eg. setting wrong voltages > / frequencies) or userspace (eg. missing some synchronization), can be > even a game bug that we need to work around. > >> For example most VM faults can >> be handled without hanging the GPU. Similarly, a shader in an endless >> loop should not require a full GPU reset. > > This is actually not the case, AFAIK André's test case was an app that > had an infinite loop in a shader. > This is the test app if anyone want to try out: https://github.com/andrealmeid/vulkan-triangle-v1. Just compile and run. The kernel calls amdgpu_ring_soft_recovery() when I run my example, but I'm not sure what a soft recovery means here and if it's a full GPU reset or not. But if we can at least trust the CP registers to dump information for soft resets, it would be some improvement from the current state I think >> >> It's more complicated for graphics because of the more complex >> pipeline >> and the lack of CWSR. But it should still be possible to do some >> debugging without JTAG if the problem is in SW and not HW or FW. It's >> probably worth improving that debugability without getting hung-up on >> the worst case. > > I agree, and we welcome any constructive suggestion to improve the > situation. It seems like our idea doesn't work if the kernel can't give > us the information we need. > > How do we move forward? > > Best regards, > Timur >