Received: by 2002:a05:7412:b10a:b0:f3:1519:9f41 with SMTP id az10csp674872rdb; Thu, 30 Nov 2023 15:28:44 -0800 (PST) X-Google-Smtp-Source: AGHT+IFNwbt3a8zCbQ9S0D3xoMZhWRRNwJ4Ip1cLTVvhWSGKXYBFqE0uS07mx2B6kGLAk0ABFAxz X-Received: by 2002:a17:903:4282:b0:1cc:6906:c016 with SMTP id ju2-20020a170903428200b001cc6906c016mr20497619plb.9.1701386924432; Thu, 30 Nov 2023 15:28:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701386924; cv=none; d=google.com; s=arc-20160816; b=BnzxKhNjEUOj/YQ1ij/5/8R2FJHqngakL3IX4nUDzyr2nhfZawyzYDi8qoL1hN9Vuu t7o+/jWq9tmhnN6rzT68HETAW2/1bM8pda9KipWyAPxfSlt1Q19/cQTUcy1DpZ7+N5CW y74ftELav8m13vR45iXHs3UvNq9HHAdRtou9fxV8XUZh/PU6EGRVD7seureH1nhTsVm6 H5TA22UcJApen+/qEhoFVcPOEnOQ49EzyJICJ6k5SeyOTXF+Csr4MCbv9qiU1RDI4F+s jpoXqpZfjicWzd7qJxHpmr3gKyXZr8H/CeH9nnEOhFWVadPqKrBbGyR8CBs1X9fABYG9 wIKQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=smrNDXAc/syQyWE6VYvQDM0BmSBPKcPsC5CPtRv9e1k=; fh=y5c+niPowbDSb6V1J6x0TFyLA7BXYBNf4M1OLoZlGwo=; b=fCt0W/0X7SHx8FaVZzpH/CL9l2Q1ZuEiSb75jAWiF7HUbZryfLR7A46iUShz/nUmZU p+Z8Jel21WyM/l/61/CWYvvajSY+UfmZjgcceMETD0ihrBPuELCmM5pHo9aNHBeb1X5I iMuoIhENpJ3J9VW2h6K/93DOOS1tvKt76UsMDRGWPUsjZHfQCgkiRM8a8aqVLtdo9bsG J8+ukLF0LoGmiWFLAkY5oC7hToTJqHzoOHpCP+UeFZ+Zr5l9vtU9HRA5x9yW6F5WbTOt NeItff9JaDfIIt7x9IsU8447ECqUGRM2P5xMBb8of3nq7oj03e3SBNAzK5fEdf8rCmPq 2edA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=TEV0z8HB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id x20-20020a170902821400b001d00a86c9dbsi2009278pln.400.2023.11.30.15.28.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 30 Nov 2023 15:28:44 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=TEV0z8HB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id B43D785A6CE8; Thu, 30 Nov 2023 15:28:40 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1377250AbjK3X2X (ORCPT + 99 others); Thu, 30 Nov 2023 18:28:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50606 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1377216AbjK3X2W (ORCPT ); Thu, 30 Nov 2023 18:28:22 -0500 Received: from mail-ot1-x32d.google.com (mail-ot1-x32d.google.com [IPv6:2607:f8b0:4864:20::32d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7C4DC10C2 for ; Thu, 30 Nov 2023 15:28:28 -0800 (PST) Received: by mail-ot1-x32d.google.com with SMTP id 46e09a7af769-6d84ddd642fso912812a34.0 for ; Thu, 30 Nov 2023 15:28:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701386908; x=1701991708; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=smrNDXAc/syQyWE6VYvQDM0BmSBPKcPsC5CPtRv9e1k=; b=TEV0z8HBLCxwQANPEtxv8twY2mpsg3jOC+36y3ZWCIqHLTOlDKzKmHR5FsYKjHNzoX NzXnfNVVT02j7QU+rsbsHMo41YA1Z2j45rD4MYWIowLbKsmwuey3NRSZesTNnAK/yzbb mwUO/5a0PAHPw71gCU1G8+07T0k+X/NtyiWWpt44emmd5QhUzXvqUBP4uKie2wOHs/WY bKEPW+ghiq6W7cd2ADzz6+79k4vudpiBzSjvyhmwvttpBUMP+9L3dwZ+9+Lx74iGwGp3 dEV6P2PO6KMIaRcpUERD499Os9EMpyi/fFHVUimzbUyD/1ahJ94xafEo4pl9zOcTfW4e Pnhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701386908; x=1701991708; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=smrNDXAc/syQyWE6VYvQDM0BmSBPKcPsC5CPtRv9e1k=; b=JcAKnXqS67OJhmkvDXs7oo5wkVLTbkTgydiSQsbn2CDeeJbvNNg+MGYezpFVwGT0EP iK3oV29dCg5/LTyWak3UqRDsmoieAf33zRAgLXGmPmoVk9h+5NKHJn5oLEdA/YxCg4oP kE8yEDMqJteRMk5c8TX93fzQeo7B/M6HqUXbNUOVnMcE/y1opRcJye7WKmqxhnIrgIms pqOksCIOCPNaBVmxlABsXwGuxbaH6Te1oPQ+ZYv8fnpQjRHkehUes7gbHV0xNqy8vuWO TCktF9DlyHkJ2gzhJWVaRFvQsJkqlAfUsY4L1iWufrVimMg8UPCX+7+g+TKtDI4pD8Bb pfjA== X-Gm-Message-State: AOJu0YwuHhwKTHqWOwdNBfcwVkzC9pUzT+4plhcl5P+Ld48re6aS8Gxw PKYImR7DXT38W+m6NmE74N7YneQ4F9tz6mDHwV4= X-Received: by 2002:a05:6871:550:b0:1fa:2095:e1f with SMTP id t16-20020a056871055000b001fa20950e1fmr22746411oal.35.1701386907675; Thu, 30 Nov 2023 15:28:27 -0800 (PST) MIME-Version: 1.0 References: <87edgv4x3i.fsf@vps.thesusis.net> <559d0fa5-953a-4a97-b03b-5eb1287c83d8@leemhuis.info> <96e2e13c-f01c-4baf-a9a3-cbaa48fb10c7@amd.com> <87jzq2ixtm.fsf@vps.thesusis.net> <95fe9b5b-05ce-4462-9973-9aca306bc44f@gmail.com> <9595b8bf-e64d-4926-9263-97e18bcd7d05@gmail.com> <05a4dec0-1c07-4a64-9439-e2c306807ded@gmail.com> In-Reply-To: From: Alex Deucher Date: Thu, 30 Nov 2023 18:28:16 -0500 Message-ID: Subject: Re: Radeon regression in 6.6 kernel To: Luben Tuikov Cc: Phillip Susi , Linux regressions mailing list , =?UTF-8?Q?Christian_K=C3=B6nig?= , linux-kernel@vger.kernel.org, "amd-gfx@lists.freedesktop.org" , dri-devel@lists.freedesktop.org, Alex Deucher , =?UTF-8?Q?Christian_K=C3=B6nig?= , Danilo Krummrich Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.6 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Thu, 30 Nov 2023 15:28:41 -0800 (PST) On Wed, Nov 29, 2023 at 10:47=E2=80=AFPM Luben Tuikov = wrote: > > On 2023-11-29 22:36, Luben Tuikov wrote: > > On 2023-11-29 15:49, Alex Deucher wrote: > >> On Wed, Nov 29, 2023 at 3:10=E2=80=AFPM Alex Deucher wrote: > >>> > >>> Actually I think I see the problem. I'll try and send out a patch > >>> later today to test. > >> > >> Does the attached patch fix it? > > > > Thanks for the patch, Alex. > > > > Is it possible for AMD to also reproduce this issue and test this patch= on a Navi23 system? > > > >> From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001 > >> From: Alex Deucher > >> Date: Wed, 29 Nov 2023 15:44:25 -0500 > >> Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend > >> > >> We need to make disable this after the last eviction > > > > "make disable" --> "disable" > > > >> call, but before we disable the SDMA IP. > >> > >> Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level= ") > >> Link: https://lists.freedesktop.org/archives/amd-gfx/2023-November/101= 197.html > > > > Link: https://lore.kernel.org/r/87edgv4x3i.fsf@vps.thesusis.net > > > > Let's link the start of the thread. > > > > Regards, > > Luben > > > >> Signed-off-by: Alex Deucher > >> Cc: Phillip Susi > >> Cc: Luben Tuikov > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- > >> 1 file changed, 2 insertions(+), 2 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/= drm/amd/amdgpu/amdgpu_device.c > >> index b5edf40b5d03..78553e027db4 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> @@ -4531,8 +4531,6 @@ int amdgpu_device_suspend(struct drm_device *dev= , bool fbcon) > >> > >> amdgpu_ras_suspend(adev); > >> > >> - amdgpu_ttm_set_buffer_funcs_status(adev, false); > >> - > >> amdgpu_device_ip_suspend_phase1(adev); > >> > >> if (!adev->in_s0ix) > >> @@ -4542,6 +4540,8 @@ int amdgpu_device_suspend(struct drm_device *dev= , bool fbcon) > >> if (r) > >> return r; > >> > >> + amdgpu_ttm_set_buffer_funcs_status(adev, false); > >> + > > If you're moving this past phase 1, there's another instance in amdgpu_de= vice_ip_suspend(), > which may need to be moved down. I think that one should be ok since we don't do any evictions in amdgpu_device_ip_suspend(). Alex > > Regards, > Luben > > >> amdgpu_fence_driver_hw_fini(adev); > >> > >> amdgpu_device_ip_suspend_phase2(adev); > > > >> > >> Alex > >> > >>> > >>> Alex > >>> > >>> On Wed, Nov 29, 2023 at 1:52=E2=80=AFPM Alex Deucher wrote: > >>>> > >>>> On Wed, Nov 29, 2023 at 11:41=E2=80=AFAM Luben Tuikov wrote: > >>>>> > >>>>> On 2023-11-29 10:22, Alex Deucher wrote: > >>>>>> On Wed, Nov 29, 2023 at 8:50=E2=80=AFAM Alex Deucher wrote: > >>>>>>> > >>>>>>> On Tue, Nov 28, 2023 at 11:45=E2=80=AFPM Luben Tuikov wrote: > >>>>>>>> > >>>>>>>> On 2023-11-28 17:13, Alex Deucher wrote: > >>>>>>>>> On Mon, Nov 27, 2023 at 6:24=E2=80=AFPM Phillip Susi wrote: > >>>>>>>>>> > >>>>>>>>>> Alex Deucher writes: > >>>>>>>>>> > >>>>>>>>>>>> In that case those are the already known problems with the s= cheduler > >>>>>>>>>>>> changes, aren't they? > >>>>>>>>>>> > >>>>>>>>>>> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. May= be I'm > >>>>>>>>>>> misunderstanding what the original report was actually testin= g. If it > >>>>>>>>>>> was 6.7, then try reverting: > >>>>>>>>>>> 56e449603f0ac580700621a356d35d5716a62ce5 > >>>>>>>>>>> b70438004a14f4d0f9890b3297cd66248728546c > >>>>>>>>>> > >>>>>>>>>> At some point it was suggested that I file a gitlab issue, but= I took > >>>>>>>>>> this to mean it was already known and being worked on. -rc3 c= ame out > >>>>>>>>>> today and still has the problem. Is there a known issue I cou= ld track? > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> At this point, unless there are any objections, I think we shou= ld just > >>>>>>>>> revert the two patches > >>>>>>>> Uhm, no. > >>>>>>>> > >>>>>>>> Why "the two" patches? > >>>>>>>> > >>>>>>>> This email, part of this thread, > >>>>>>>> > >>>>>>>> https://lore.kernel.org/all/87r0kircdo.fsf@vps.thesusis.net/ > >>>>>>>> > >>>>>>>> clearly states that reverting *only* this commit, > >>>>>>>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable = number of run-queues > >>>>>>>> *does not* mitigate the failed suspend. (Furthermore, this commi= t doesn't really change > >>>>>>>> anything operational, other than using an allocated array, inste= ad of a static one, in DRM, > >>>>>>>> while the 2nd patch is solely contained within the amdgpu driver= code.) > >>>>>>>> > >>>>>>>> Leaving us with only this change, > >>>>>>>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level > >>>>>>>> to be at fault, as the kernel log attached in the linked email a= bove shows. > >>>>>>>> > >>>>>>>> The conclusion is that only b70438004a14f4 needs reverting. > >>>>>>> > >>>>>>> b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a1= 4f4, > >>>>>>> 56e449603f0ac5 breaks amdgpu. > >>>>>> > >>>>>> We can try and re-enable it in the next kernel. I'm just not sure > >>>>>> we'll be able to fix this in time for 6.7 with the holidays and al= l > >>>>>> and I don't want to cause a lot of scheduler churn at the end of t= he > >>>>>> 6.7 cycle if we hold off and try and fix it. Reverting seems like= the > >>>>>> best short term solution. > >>>>> > >>>>> A lot of subsequent code has come in since commit 56e449603f0ac5, a= s it opened > >>>>> the opportunity for a 1-to-1 relationship between an entity and a s= cheduler. > >>>>> (Should've always been the case, from the outset. Not sure why it w= as coded as > >>>>> a fixed-size array.) > >>>>> > >>>>> Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and= the problem > >>>>> is wholly contained in amdgpu, and no other driver has this problem= , there is > >>>>> no reason to have to "churn", i.e. go back and forth in DRM, only t= o cover up > >>>>> an init bug in amdgpu. See the response I just sent in @this thread= : > >>>>> https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba43e2@gmai= l.com > >>>>> > >>>>> And it's not like this issue is unknown. I first posted about it on= 2023-10-16. > >>>>> > >>>>> Ideally, amdgpu would just fix their init code. > >>>> > >>>> You can't make changes to core code that break other drivers. > >>>> Arguably 56e449603f0ac5 should not have gone in in the first place i= f > >>>> it broke amdgpu. b70438004a14f4 was the code to fix amdgpu's init > >>>> code, but as a side effect it seems to have broken suspend for some > >>>> users. > >>>> > >>>> Alex