Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp425579pxf; Thu, 11 Mar 2021 07:04:33 -0800 (PST) X-Google-Smtp-Source: ABdhPJwAujcxHdt6MzeMawAlDMlRm5KH+sioyOS3YKVQnfX7+BcKJ7hJSDsTstgtxEFFePX5AMWL X-Received: by 2002:a17:906:a443:: with SMTP id cb3mr3449620ejb.542.1615475072815; Thu, 11 Mar 2021 07:04:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1615475072; cv=none; d=google.com; s=arc-20160816; b=YwCgvzLf1f4oivSoSR9DTFXlyrsF1b9KIKOTUxLl7pYMz+2gNz++KoCryqbsPvVD1D sUdjIIe1p8mqRRuOJgQzAxsNgzxosT1uy8DHsg1/KOHtOSSOqu0gUXqTIOyfIAPWNV+U l53HGaX4eVK7uwEpOcVGXg8gHQu4qCu6IV/qy228zYC+NfKOiQHtDZFbIrols4pRhXXl RSzOIx5i8Uq10fpk1fO8OMm5H38gEHquNo5PWUBBW3EESOtVz0ulmPRdvn1YqtZUIDH9 ZvgHQrX2G4n1oxrtJG7GORid6JOUYiPcg1QGOWveVg+B6ClRCWV1bpSFR/dhI3K8S49o AnXg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=gXAGBeqlgyf6WoQP72pNeSpQgoAtPfYQxn1Zx/OxGXA=; b=C+OgYpwcauXFF3jFvRSlxv9hKE+IdsxqLN/k0BGV3ziTMyo+tjlum3rpwWr3X/y2n0 +To0lQRKNEEJGkTa90GwwN9DSFEdErZchekwnGV+nps8P7881pQ+oRdYrFlze0ofCl+s Y6eg5DnStArkqC4VqGpJHdsASv36Bp1lk+IbZrkTzzcV3eOpkukESmsb8IKmQtuoKz5Y Ef2a7zfUxf6OPiSwN0Wi/no1fmrehWO57ubKXoJ5qhAHfibh4z+g1/phSUuUeAWFjZZx ymtOLUUvDrBbJdB3JFw2WGlqFDoYggi4R78XkTU2rNfWOeABTqVCh/8CoLIjf1ld8QzB Uymg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@qtec.com header.s=google header.b=TdgUqRQj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=qtec.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id ce20si2071003edb.34.2021.03.11.07.04.09; Thu, 11 Mar 2021 07:04:32 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@qtec.com header.s=google header.b=TdgUqRQj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=qtec.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233582AbhCKPC6 (ORCPT + 99 others); Thu, 11 Mar 2021 10:02:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38300 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233597AbhCKPC1 (ORCPT ); Thu, 11 Mar 2021 10:02:27 -0500 Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com [IPv6:2a00:1450:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DCE9EC061574 for ; Thu, 11 Mar 2021 07:02:26 -0800 (PST) Received: by mail-ej1-x62c.google.com with SMTP id ox4so31199786ejb.11 for ; Thu, 11 Mar 2021 07:02:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qtec.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=gXAGBeqlgyf6WoQP72pNeSpQgoAtPfYQxn1Zx/OxGXA=; b=TdgUqRQjhkpcX3i4KO9xpGkYINXAYYI2sRjQNydqKCpSMJhHJcaaZDxlZ2hhp45NxJ ToHf4STUfuUP/xKJFawD44ZpTyXr/KesyFBxWPF0A6MPK3zvYlONOiNykf/yjHbZq2mg 4XO3cdmAPnXL45axUNe8sysY090rzGmZcR+6kQ2ZU8TfERMFpolPeAGtKMXr2m1hLhOw EyoXganQqwqm2eJ/jUS6WWsNQcIR2IxRvkH00Q6lyAj6nttmWVNtTGPr5TcD6mFqBG2+ ShKDKQ9no6BJFD805g2FSDqjn9wG4YBXvdK9WmfUDqf8oDT5EH/G+0cAudu4z2e2T4GZ N97A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=gXAGBeqlgyf6WoQP72pNeSpQgoAtPfYQxn1Zx/OxGXA=; b=c/zx/dpe2VsTvBFLcULzsAyyRFLWggv25QxN+U2hCo+vEPSJ0MecHSQApxDBW/38Gd mzzSJ7lMrzLcW20aJFyaGUadalFNqIt7TFyCzMEH6EN/6mhs31VQqIC9mNxiRT/rW2P+ 5vTGY3wZqCucXIyt6UYuhDynJJIcHJO0x2A0ncFOmysmfEZitd1TMNLNuRDreDCU3+fE W6M/L8DKLPQLbRrtyUF8nuOJgu13aPxYyBaRCT99Is4QJtsO9uiETgfvEGx1Am8X5x+B xHm5gTXf/vlGGuUbqtGINnWq6B8YHZX8ujVsLphyZz5e93awo08OYeaJ2VUOQAJeFYvR esQA== X-Gm-Message-State: AOAM530CbaA2ejOSaKG3No7Kra4178nzqBya+hU6IDzxGoLRW40HdjWs A7ZNdd172wr7rG7cD6TDoUM2r7EsfLRVKPQpz0dkTw== X-Received: by 2002:a17:906:7b8d:: with SMTP id s13mr3505288ejo.247.1615474944849; Thu, 11 Mar 2021 07:02:24 -0800 (PST) MIME-Version: 1.0 References: <20210310163655.2591893-1-daniel@qtec.com> In-Reply-To: From: Alexandre Desnoyers Date: Thu, 11 Mar 2021 16:02:23 +0100 Message-ID: Subject: Re: [PATCH]] drm/amdgpu/gfx9: add gfxoff quirk To: Daniel Gomez Cc: Alex Deucher , Alex Deucher , =?UTF-8?Q?Christian_K=C3=B6nig?= , David Airlie , Daniel Vetter , Sumit Semwal , Hawking Zhang , Huang Rui , Nirmoy Das , Dennis Li , Monk Liu , Yintian Tao , Guchun Chen , Evan Quan , amd-gfx list , Maling list - DRI developers , LKML , linux-media , "moderated list:DMA BUFFER SHARING FRAMEWORK" Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 11, 2021 at 2:49 PM Daniel Gomez wrote: > > On Thu, 11 Mar 2021 at 10:09, Daniel Gomez wrote: > > > > On Wed, 10 Mar 2021 at 18:06, Alex Deucher wrote: > > > > > > On Wed, Mar 10, 2021 at 11:37 AM Daniel Gomez wrote: > > > > > > > > Disabling GFXOFF via the quirk list fixes a hardware lockup in > > > > Ryzen V1605B, RAVEN 0x1002:0x15DD rev 0x83. > > > > > > > > Signed-off-by: Daniel Gomez > > > > --- > > > > > > > > This patch is a continuation of the work here: > > > > https://lkml.org/lkml/2021/2/3/122 where a hardware lockup was discussed and > > > > a dma_fence deadlock was provoke as a side effect. To reproduce the issue > > > > please refer to the above link. > > > > > > > > The hardware lockup was introduced in 5.6-rc1 for our particular revision as it > > > > wasn't part of the new blacklist. Before that, in kernel v5.5, this hardware was > > > > working fine without any hardware lock because the GFXOFF was actually disabled > > > > by the if condition for the CHIP_RAVEN case. So this patch, adds the 'Radeon > > > > Vega Mobile Series [1002:15dd] (rev 83)' to the blacklist to disable the GFXOFF. > > > > > > > > But besides the fix, I'd like to ask from where this revision comes from. Is it > > > > an ASIC revision or is it hardcoded in the VBIOS from our vendor? From what I > > > > can see, it comes from the ASIC and I wonder if somehow we can get an APU in the > > > > future, 'not blacklisted', with the same problem. Then, should this table only > > > > filter for the vendor and device and not the revision? Do you know if there are > > > > any revisions for the 1002:15dd validated, tested and functional? > > > > > > The pci revision id (RID) is used to specify the specific SKU within a > > > family. GFXOFF is supposed to be working on all raven variants. It > > > was tested and functional on all reference platforms and any OEM > > > platforms that launched with Linux support. There are a lot of > > > dependencies on sbios in the early raven variants (0x15dd), so it's > > > likely more of a specific platform issue, but there is not a good way > > > to detect this so we use the DID/SSID/RID as a proxy. The newer raven > > > variants (0x15d8) have much better GFXOFF support since they all > > > shipped with newer firmware and sbios. > > > > We took one of the first reference platform boards to design our > > custom board based on the V1605B and I assume it has one of the early 'unstable' > > raven variants with RID 0x83. Also, as OEM we are in control of the bios > > (provided by insyde) but I wasn't sure about the RID so, thanks for the > > clarification. Is there anything we can do with the bios to have the GFXOFF > > enabled and 'stable' for this particular revision? Otherwise we'd need to add > > the 0x83 RID to the table. Also, there is an extra ']' in the patch > > subject. Sorry > > for that. Would you need a new patch in case you accept it with the ']' removed? > > > > Good to hear that the newer raven versions have better GFXOFF support. > > Adding Alex Desnoyer to the loop as he is the electronic/hardware and > bios responsible so, he can > provide more information about this. Hello everyone, We, Qtechnology, are the OEM of the hardware platform where we originally discovered the bug. Our platform is based on the AMD Dibbler V-1000 reference design, with the latest Insyde BIOS release available for the (now unsupported) Dibbler platform. We have the Insyde BIOS source code internally, so we can make some modifications as needed. The last test that Daniel and myself performed was on a standard Dibbler PCB rev.B1 motherboard (NOT our platform), and using the corresponding latest AMD released BIOS "RDB1109GA". As Daniel wrote, the hardware lockup can be reproduced on the Dibbler, even if it has a different RID that our V1605B APU. We also have a Neousys Technology POC-515 embedded computer (V-1000, V1605B) in our office. The Neousys PC also uses Insyde BIOS. This computer is also locking-up in the test. https://www.neousys-tech.com/en/product/application/rugged-embedded/poc-500-amd-ryzen-ultra-compact-embedded-computer Digging into the BIOS source code, the only reference to GFXOFF is in the SMU and PSP firmware release notes, where some bug fixes have been mentioned for previous SMU/PSP releases. After a quick "git grep -i gfx | grep -i off", there seems to be no mention of GFXOFF in the Insyde UEFI (inluding AMD PI) code base. I would appreciate any information regarding BIOS modification needed to make the GFXOFF feature stable. As you (Alex Deucher) mentionned, it should be functional on all AMD Raven reference platforms. Regards, Alexandre Desnoyers > > I've now done a test on the reference platform (dibbler) with the > latest bios available > and the hw lockup can be also reproduced with the same steps. > > For reference, I'm using mainline kernel 5.12-rc2. > > [ 5.938544] [drm] initializing kernel modesetting (RAVEN > 0x1002:0x15DD 0x1002:0x15DD 0xC1). > [ 5.939942] amdgpu: ATOM BIOS: 113-RAVEN-11 > > As in the previous cases, the clocks go to 100% of usage when the hang occurs. > > However, when the gpu hangs, dmesg output displays the following: > > [ 1568.279847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx > timeout, signaled seq=188, emitted seq=191 > [ 1568.434084] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process > information: process Xorg pid 311 thread Xorg:cs0 pid 312 > [ 1568.279847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx > timeout, signaled seq=188, emitted seq=191 > [ 1568.434084] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process > information: process Xorg pid 311 thread Xorg:cs0 pid 312 > [ 1568.507000] amdgpu 0000:01:00.0: amdgpu: GPU reset begin! > [ 1628.491882] rcu: INFO: rcu_sched self-detected stall on CPU > [ 1628.491882] rcu: 3-...!: (665 ticks this GP) > idle=f9a/1/0x4000000000000000 softirq=188533/188533 fqs=15 > [ 1628.491882] rcu: rcu_sched kthread timer wakeup didn't happen for > 58497 jiffies! g726761 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 > [ 1628.491882] rcu: Possible timer handling issue on cpu=2 > timer-softirq=55225 > [ 1628.491882] rcu: rcu_sched kthread starved for 58500 jiffies! > g726761 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2 > [ 1628.491882] rcu: Unless rcu_sched kthread gets sufficient CPU > time, OOM is now expected behavior. > [ 1628.491882] rcu: RCU grace-period kthread stack dump: > [ 1628.491882] rcu: Stack dump where RCU GP kthread last ran: > [ 1808.518445] rcu: INFO: rcu_sched self-detected stall on CPU > [ 1808.518445] rcu: 3-...!: (2643 ticks this GP) > idle=f9a/1/0x4000000000000000 softirq=188533/188533 fqs=15 > [ 1808.518445] rcu: rcu_sched kthread starved for 238526 jiffies! > g726761 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=2 > [ 1808.518445] rcu: Unless rcu_sched kthread gets sufficient CPU > time, OOM is now expected behavior. > [ 1808.518445] rcu: RCU grace-period kthread stack dump: > [ 1808.518445] rcu: Stack dump where RCU GP kthread last ran: > > > > > Daniel > > > > > > > > Alex > > > > > > > > > > > > > > Logs: > > > > [ 27.708348] [drm] initializing kernel modesetting (RAVEN > > > > 0x1002:0x15DD 0x1002:0x15DD 0x83). > > > > [ 27.789156] amdgpu: ATOM BIOS: 113-RAVEN-115 > > > > > > > > Thanks in advance, > > > > Daniel > > > > > > > > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 ++ > > > > 1 file changed, 2 insertions(+) > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > > > > index 65db88bb6cbc..319d4b99aec8 100644 > > > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > > > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > > > > @@ -1243,6 +1243,8 @@ static const struct amdgpu_gfxoff_quirk amdgpu_gfxoff_quirk_list[] = { > > > > { 0x1002, 0x15dd, 0x103c, 0x83e7, 0xd3 }, > > > > /* GFXOFF is unstable on C6 parts with a VBIOS 113-RAVEN-114 */ > > > > { 0x1002, 0x15dd, 0x1002, 0x15dd, 0xc6 }, > > > > + /* GFXOFF provokes a hw lockup on 83 parts with a VBIOS 113-RAVEN-115 */ > > > > + { 0x1002, 0x15dd, 0x1002, 0x15dd, 0x83 }, > > > > { 0, 0, 0, 0, 0 }, > > > > }; > > > > > > > > -- > > > > 2.30.1 > > > > > > > > _______________________________________________ > > > > dri-devel mailing list > > > > dri-devel@lists.freedesktop.org > > > > https://lists.freedesktop.org/mailman/listinfo/dri-devel