Received: by 2002:a05:6358:4e97:b0:b3:742d:4702 with SMTP id ce23csp1860126rwb; Fri, 19 Aug 2022 10:34:52 -0700 (PDT) X-Google-Smtp-Source: AA6agR7o8OZXP/VDvfGgbIVd5ky+kxb7DI/+n8VcHVOrgVtfDR7koOwr9GPIncmDcYthrBaxnipW X-Received: by 2002:a17:902:e892:b0:170:c2f:cb40 with SMTP id w18-20020a170902e89200b001700c2fcb40mr8421597plg.2.1660930492547; Fri, 19 Aug 2022 10:34:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1660930492; cv=none; d=google.com; s=arc-20160816; b=yDdXsE4AXb7erIQEUJqqBIAxmSty3vRNIuQaclflI3gzesV+gbmZ4YalMAd0z8ofzD /Cfg+X16+DilQjAaeu/E12ezqZe0l9+DD71IbVv7zFDZsYtwBTiGM40fdLDvvItFw63f gdjlAYgu1CSwgQ7DhEryBFStE359bYPt0rJilR12QNZOfYicLd4PSxCiubW71L84jFM2 6UsTICxFPMO6nM9iiJDeBw8Ed+8FPkCx171lOCtO0ePCmpuUKPcY+MwhCLzR7F4JPPdq OmdY7eEhPtA9Clj+KuC8UQwxqXYFkUYH3yvxwdFBfE7qmkAqoKZdvAGzjlZ/EN+LkU+p Y5Vw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=BcCj1m7KIf5adDuWbKKovxIK0b1svjA3E9fjFS29A+o=; b=vb3jGBPrVvmMpadzb17F+7SwR5FPc++H5BGw3Cn9fRXU4vDtGJKzjGmUd+7Xra3GTx HByZKynGepslKJsHmtl20VKLqcASCe2v4UPb8YqrveRXh7I/DISmZOwpt9mxQV4y6nQJ KfzMH8zr9nw3OJjgijz5Mk2FMLUNY8vor6wGIsQeZo0/qh0FVhcR8EI5i1Q2Jqc4RUKv 2LCg4RE7LAUX5xc+V3KZSNiwTLG3AOI0OFDFGl6JkkQEQMi14hto+31l+uwfpegVXFRl hOxGB76gFNSm8eXVIXXLhfNT30eCEGSPci8jWN66QHdztfIgLD2fy7B0/qtjITqmyntU KMfQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=dVKOGeil; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o15-20020aa7978f000000b0052abccd715csi1904499pfp.118.2022.08.19.10.34.41; Fri, 19 Aug 2022 10:34:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=dVKOGeil; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351630AbiHSQJB (ORCPT + 99 others); Fri, 19 Aug 2022 12:09:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47758 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1351689AbiHSQGW (ORCPT ); Fri, 19 Aug 2022 12:06:22 -0400 Received: from mail-ot1-x32d.google.com (mail-ot1-x32d.google.com [IPv6:2607:f8b0:4864:20::32d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1352C107F12 for ; Fri, 19 Aug 2022 08:55:48 -0700 (PDT) Received: by mail-ot1-x32d.google.com with SMTP id q39-20020a056830442700b0063889adc0ddso3351586otv.1 for ; Fri, 19 Aug 2022 08:55:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc; bh=BcCj1m7KIf5adDuWbKKovxIK0b1svjA3E9fjFS29A+o=; b=dVKOGeildqf1O5aKP/wT610XNrQGNA+oIpVRrosLqk6a6FBex7J5AQ2uzynwm5iiAe OjVTeQr+sgoYtWavTWBFLXegOsICtcPa27vGiKjvFfXw4jrbRU38eGFCCUf80crfKUbx BrXgN0f+1KfiS0iamuAuvNfsGqhP/3Z1MU22tVFVswIGiZWDPu63QhBkgkHTkHg1+D/D Lof+7ofYRBIKj+EXAwn0Jd+2TZSjyGjURv7CdPXyYZHhv4efVY+B+SxqZ3yKCo+7/C+B wbcyk08SQLKVWg+vB6VTZHrtBHDEVCniY5GCmRFxqzlWXFAUlcKNu1Rt3F7Y+x454p97 gBXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc; bh=BcCj1m7KIf5adDuWbKKovxIK0b1svjA3E9fjFS29A+o=; b=07qUqKHhkulntLmM65MhEwNgJVXYVWwBeBgobeiI5xGzgOc4FxoDbyFSG5cxA66Fh0 XACKzHf42cQvGjigfxnr2mIEuSt3g3elrk/nQVap0pc3rjsnCrpsSBlfCgDX1K/L5xMg OYmkTtslxil6IlzHlxuwiuriDsVso8YKckFOycqgaqFFOHbuPM1ijz8OT1CWEX0UkJE6 7WV2Q/EYeAVgw/04REM04A1ppO0bx6fvjuOVCes5yZOq2kZGWsyiUo57iqlmi/+1+4zO eEk0fwAvo9b/RSJ1Gq+Y/lqfmvSmjN5Uls0vxKSWKfYL4wGMK+aTIkTOBKsk1Ixebw9H 6ugg== X-Gm-Message-State: ACgBeo0W7OmKfLpLRiODdVZfHSSO/3cQbol05cBbmselmFMyAe/2VLp8 UNfm/RW8qLNfSfr7M3PbQjzccvIzIBQf1fm232Q= X-Received: by 2002:a9d:c64:0:b0:636:df4c:e766 with SMTP id 91-20020a9d0c64000000b00636df4ce766mr3253036otr.12.1660924546778; Fri, 19 Aug 2022 08:55:46 -0700 (PDT) MIME-Version: 1.0 References: <20220811072540.964309-1-lizhenneng@kylinos.cn> <2f38b94b-0965-80f2-5bae-840765ffc4da@kylinos.cn> <103b4a67-385c-68f5-f40f-39bbc1d9f723@kylinos.cn> In-Reply-To: From: Alex Deucher Date: Fri, 19 Aug 2022 11:55:35 -0400 Message-ID: Subject: Re: [PATCH] drm/radeon: add a force flush to delay work when radeon To: =?UTF-8?Q?Christian_K=C3=B6nig?= Cc: =?UTF-8?B?5p2O55yf6IO9?= , =?UTF-8?Q?Christian_K=C3=B6nig?= , Alex Deucher , David Airlie , Pan Xinhui , linux-kernel@vger.kernel.org, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Applied. Thanks! Alex On Fri, Aug 19, 2022 at 6:07 AM Christian K=C3=B6nig wrote: > > Am 19.08.22 um 11:34 schrieb =E6=9D=8E=E7=9C=9F=E8=83=BD: > > =E5=9C=A8 2022/8/17 19:40, Christian K=C3=B6nig =E5=86=99=E9=81=93: > > Am 17.08.22 um 09:31 schrieb =E6=9D=8E=E7=9C=9F=E8=83=BD: > > > =E5=9C=A8 2022/8/15 21:12, Christian K=C3=B6nig =E5=86=99=E9=81=93: > > Am 15.08.22 um 09:34 schrieb =E6=9D=8E=E7=9C=9F=E8=83=BD: > > > =E5=9C=A8 2022/8/12 18:55, Christian K=C3=B6nig =E5=86=99=E9=81=93: > > Am 11.08.22 um 09:25 schrieb Zhenneng Li: > > Although radeon card fence and wait for gpu to finish processing current = batch rings, > there is still a corner case that radeon lockup work queue may not be ful= ly flushed, > and meanwhile the radeon_suspend_kms() function has called pci_set_power_= state() to > put device in D3hot state. > > > If I'm not completely mistaken the reset worker uses the suspend/resume f= unctionality as well to get the hardware into a working state again. > > So if I'm not completely mistaken this here would lead to a deadlock, ple= ase double check that. > > > We have tested many times, there are no deadlock. > > > Testing doesn't tells you anything, you need to audit the call paths. > > In which situation, there would lead to a deadlock? > > > GPU resets. > > > Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will wait fo= r a lockup_work to finish executing the last queueing, but this kernel fun= c haven't get any lock, and lockup_work will run in another kernel thread, = so I think flush_delayed_work could not lead to a deadlock. > > Therefor if radeon_gpu_reset is called in another thread when radeon_susp= end_kms is blocked on flush_delayed_work, there could not lead to a deadloc= k. > > > Ok sounds like you didn't go what I wanted to say. > > The key problem is that radeon_gpu_reset() calls radeon_suspend() which i= n turn calls rdev->asic->suspend(). > > And this function in turn could end up in radeon_suspend_kms() again, but= I'm not 100% sure about that. > > Just double check the order of function called here (e.g. if radeon_suspe= nd_kms() call radeon_suspend() or the other way around). Apart from that y= our patch looks correct to me as well. > > radeon_gpu_reset will call radeon_suspend, but radeon_suspend will not ca= ll radeon_suspend_kms, the more detail of call flow, we can see the attache= ment file: radeon-suspend-reset.png. > > Sorry I may have mistaken your meaning. > > > Ok in this case my memory of the call flow wasn't correct any more. > > Feel free to add an Acked-by: Christian K=C3=B6nig to the patch. > > Alex should then pick it up for upstreaming. > > Thanks, > Christian. > > > Regards, > Christian. > > > > Regards, > Christian. > > > > Regards, > Christian. > > Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State. > > Configuration and Message requests are the only TLPs accepted by a Functi= on in > the D3hot state. All other received Requests must be handled as Unsupport= ed Requests, > and all received Completions may optionally be handled as Unexpected Comp= letions. > > This issue will happen in following logs: > Unable to handle kernel paging request at virtual address 00008800e000801= 0 > CPU 0 kworker/0:3(131): Oops 0 > pc =3D [] ra =3D [] ps =3D 0000 Tai= nted: G W > pc is at si_gpu_check_soft_reset+0x3c/0x240 > ra is at si_dma_is_lockup+0x34/0xd0 > v0 =3D 0000000000000000 t0 =3D fff08800e0008010 t1 =3D 0000000000010000 > t2 =3D 0000000000008010 t3 =3D fff00007e3c00000 t4 =3D fff00007e3c00258 > t5 =3D 000000000000ffff t6 =3D 0000000000000001 t7 =3D fff00007ef078000 > s0 =3D fff00007e3c016e8 s1 =3D fff00007e3c00000 s2 =3D fff00007e3c00018 > s3 =3D fff00007e3c00000 s4 =3D fff00007fff59d80 s5 =3D 0000000000000000 > s6 =3D fff00007ef07bd98 > a0 =3D fff00007e3c00000 a1 =3D fff00007e3c016e8 a2 =3D 0000000000000008 > a3 =3D 0000000000000001 a4 =3D 8f5c28f5c28f5c29 a5 =3D ffffffff810f4338 > t8 =3D 0000000000000275 t9 =3D ffffffff809b66f8 t10 =3D ff6769c5d964b80= 0 > t11=3D 000000000000b886 pv =3D ffffffff811bea20 at =3D 0000000000000000 > gp =3D ffffffff81d89690 sp =3D 00000000aa814126 > Disabling lock debugging due to kernel taint > Trace: > [] si_dma_is_lockup+0x34/0xd0 > [] radeon_fence_check_lockup+0xd0/0x290 > [] process_one_work+0x280/0x550 > [] worker_thread+0x70/0x7c0 > [] worker_thread+0x130/0x7c0 > [] kthread+0x200/0x210 > [] worker_thread+0x0/0x7c0 > [] kthread+0x14c/0x210 > [] ret_from_kernel_thread+0x18/0x20 > [] kthread+0x0/0x210 > Code: ad3e0008 43f0074a ad7e0018 ad9e0020 8c3001e8 40230101 > <88210000> 4821ed21 > So force lockup work queue flush to fix this problem. > > Signed-off-by: Zhenneng Li > --- > drivers/gpu/drm/radeon/radeon_device.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/rad= eon/radeon_device.c > index 15692cb241fc..e608ca26780a 100644 > --- a/drivers/gpu/drm/radeon/radeon_device.c > +++ b/drivers/gpu/drm/radeon/radeon_device.c > @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev, bool= suspend, > if (r) { > /* delay GPU reset to resume */ > radeon_fence_driver_force_completion(rdev, i); > + } else { > + /* finish executing delayed work */ > + flush_delayed_work(&rdev->fence_drv[i].lockup_work); > } > } > > > > >