Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5079BC636CC for ; Mon, 20 Feb 2023 10:51:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231250AbjBTKvo (ORCPT ); Mon, 20 Feb 2023 05:51:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43108 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229728AbjBTKvn (ORCPT ); Mon, 20 Feb 2023 05:51:43 -0500 Received: from mail-ed1-x52e.google.com (mail-ed1-x52e.google.com [IPv6:2a00:1450:4864:20::52e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 142ED125A7 for ; Mon, 20 Feb 2023 02:51:41 -0800 (PST) Received: by mail-ed1-x52e.google.com with SMTP id o12so3353771edb.9 for ; Mon, 20 Feb 2023 02:51:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20210112; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=1zcGRTl/9l2GpF8LYO2/lWbNZQDU6Xe0XQhuGFj5NQk=; b=B6uKlERavDvKoVnUAsopmtnUrrjgtlfadYYrEW+1d6mx6Dp7WDWk2LSrMYkiBImSJg 2MX6IMhFhciuSVo2IVdK8RLwB099a6FrmRRNKXdb3xxENLlHDWHjotUUUjAmEQMwuDBe aWatdxsvw3ci4l9yYNl7gE2GQyoa2bsMEDG8FHeP41AvnHs8C2Sgi+aqRlOGr1pd27J+ yWI91Q1FxLXu1h9QAHxbsHEzDLzpqA8mH9ElWsKR9LNn1dG3voNVw2FGedLk/3Q3Qu+D 4rvRA2DokgYwEo4qjf2yMgW+juWr0Isi0hDCc5TUrVYjbei39zGSC9v0Q+uO43XgTm7E Ax0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1zcGRTl/9l2GpF8LYO2/lWbNZQDU6Xe0XQhuGFj5NQk=; b=x7swbz2/vd1HfSVvd/KwknvtrkjoKGmyJ4Y9js/1rpQvAMN6ulbIDICSelpUk98XPH yE6SiXvOi4xG5YxJ7uDn6yOL9hQuYD7Ygmzg+EAUQUdHyYK3SBomFv4rnrjWON1jZKCW SEq0KdAHVia2Dc1tbIVeA4PYvYOMJjb+Rx/EvaFLSvj4kGe6eJ9zmgZICkjf0QiybDbT rZkNEIJuZDNMq2icZixtkHAcrsIfaaFaxPNYQ8bk33WI4IUjEzBnJxfemfpl2RXePhTo SbpE0fmTRpCLHYcFrIUgBq/uwRPj5mFx7xIxec/K1RXf656vjPrhGX5jczl8ak4g4hu2 Q7Xw== X-Gm-Message-State: AO0yUKWUOKqda90UoukvF7YsWk7G7Kkyf7gp6q/vLCLwvZWJJhn9+uTr mtRtby4Krtl5lCnEWJJ6KLk= X-Google-Smtp-Source: AK7set8TW5veSVICkPbHeLQ5wN2iSlqP+gqOJcmhQuifsFLIWce6R5b8cWaoCiDrC0NMmPiqdjWQjg== X-Received: by 2002:a05:6402:3890:b0:4a3:43c1:843e with SMTP id fd16-20020a056402389000b004a343c1843emr1507395edb.18.1676890299222; Mon, 20 Feb 2023 02:51:39 -0800 (PST) Received: from [192.168.1.10] (97e09f27.skybroadband.com. [151.224.159.39]) by smtp.googlemail.com with ESMTPSA id m9-20020a50c189000000b004ad7af3955dsm832108edf.15.2023.02.20.02.51.38 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 20 Feb 2023 02:51:38 -0800 (PST) Message-ID: Date: Mon, 20 Feb 2023 10:51:36 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.8.0 Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected To: Ben Skeggs Cc: Karol Herbst , Linux regressions mailing list , Dave Airlie , bskeggs@redhat.com, Lyude Paul , ML nouveau , LKML , ML dri-devel References: <5abbee70-cc84-1528-c3d8-9befd9edd611@googlemail.com> <5cf46df8-0fa2-e9f5-aa8e-7f7f703d96dd@googlemail.com> <4e786e22-f17a-da76-5129-8fef0c7c825a@googlemail.com> <181bea6a-e501-f5bd-b002-de7a244a921a@googlemail.com> <7f6ec5b3-b5c7-f564-003e-132f112b7cf4@googlemail.com> Content-Language: en-GB From: Chris Clayton In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 20/02/2023 05:35, Ben Skeggs wrote: > On Sun, 19 Feb 2023 at 04:55, Chris Clayton wrote: >> >> >> >> On 18/02/2023 15:19, Chris Clayton wrote: >>> >>> >>> On 18/02/2023 12:25, Karol Herbst wrote: >>>> On Sat, Feb 18, 2023 at 1:22 PM Chris Clayton wrote: >>>>> >>>>> >>>>> >>>>> On 15/02/2023 11:09, Karol Herbst wrote: >>>>>> On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update >>>>>> (Thorsten Leemhuis) wrote: >>>>>>> >>>>>>> On 13.02.23 10:14, Chris Clayton wrote: >>>>>>>> On 13/02/2023 02:57, Dave Airlie wrote: >>>>>>>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote: >>>>>>>>>>> On 10.02.23 20:01, Karol Herbst wrote: >>>>>>>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten >>>>>>>>>>>> Leemhuis) wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On 08.02.23 09:48, Chris Clayton wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released. >>>>>>>>>>>>> >>>>>>>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still >>>>>>>>>>>>> time to fix it and there is one thing I wonder: >>>>>>>>>>>>> >>>>>>>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris >>>>>>>>>>>>> posted more than a week ago to check if they somehow help to track down >>>>>>>>>>>>> the root of this problem? >>>>>>>>>>>> >>>>>>>>>>>> I did now and I can't spot anything. I think at this point it would >>>>>>>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if >>>>>>>>>>>> any is in a weird state preventing the machine from shutting down. >>>>>>>>>>> >>>>>>>>>>> Many thx for looking into it! >>>>>>>>>> >>>>>>>>>> Yes, thanks Karol. >>>>>>>>>> >>>>>>>>>> Attached is the output from dmesg when this block of code: >>>>>>>>>> >>>>>>>>>> /bin/mount /dev/sda7 /mnt/sda7 >>>>>>>>>> /bin/mountpoint /proc || /bin/mount /proc >>>>>>>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log & >>>>>>>>>> /bin/echo t > /proc/sysrq-trigger >>>>>>>>>> /bin/sleep 1 >>>>>>>>>> /bin/sync >>>>>>>>>> /bin/sleep 1 >>>>>>>>>> kill $(pidof dmesg) >>>>>>>>>> /bin/umount /mnt/sda7 >>>>>>>>>> >>>>>>>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system. >>>>>>>>>> >>>>>>>>>> I hope this is what you were looking for, but if not, please let me know what you need >>>>>>>> >>>>>>>> Thanks Dave. [...] >>>>>>> FWIW, in case anyone strands here in the archives: the msg was >>>>>>> truncated. The full post can be found in a new thread: >>>>>>> >>>>>>> https://lore.kernel.org/lkml/e0b80506-b3cf-315b-4327-1b988d86031e@googlemail.com/ >>>>>>> >>>>>>> Sadly it seems the info "With runpm=0, both reboot and poweroff work on >>>>>>> my laptop." didn't bring us much further to a solution. :-/ I don't >>>>>>> really like it, but for regression tracking I'm now putting this on the >>>>>>> back-burner, as a fix is not in sight. >>>>>>> >>>>>>> #regzbot monitor: >>>>>>> https://lore.kernel.org/lkml/e0b80506-b3cf-315b-4327-1b988d86031e@googlemail.com/ >>>>>>> #regzbot backburner: hard to debug and apparently rare >>>>>>> #regzbot ignore-activity >>>>>>> >>>>>> >>>>>> yeah.. this bug looks a little annoying. Sadly the only Turing based >>>>>> laptop I got doesn't work on Nouveau because of firmware related >>>>>> issues and we probably need to get updated ones from Nvidia here :( >>>>>> >>>>>> But it's a bit weird that the kernel doesn't shutdown, because I don't >>>>>> see anything in the logs which would prevent that from happening. >>>>>> Unless it's waiting on one of the tasks to complete, but none of them >>>>>> looked in any way nouveau related. >>>>>> >>>>>> If somebody else has any fancy kernel debugging tips here to figure >>>>>> out why it hangs, that would be very helpful... >>>>>> >>>>> >>>>> I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on >>>>> it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an >>>>> initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by >>>>> the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware. >>>>> >>>>> I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to >>>>> be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root >>>>> partition, and thus the scrubber binary, have become inaccessible. >>>>> >>>>> I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script >>>>> before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the >>>>> problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the >>>>> permanent solution. >>>>> >>>>> So, apologies (and thanks) to everyone whose time I have taken up with this non-bug. >>>>> >>>> >>>> Well.. nouveau shouldn't prevent the system from shutting down if the >>>> firmware file isn't available. Or at least it should print a >>>> warning/error. Mind messing with the code a little to see if skipping >>>> it kind of works? I probably can also come up with a patch by next >>>> week. >>>> >>> Well, I'd love to but a quick glance at the code caused me to bump into this obscenity: >>> >>> int >>> gm200_flcn_reset_wait_mem_scrubbing(struct nvkm_falcon *falcon) >>> { >>> nvkm_falcon_mask(falcon, 0x040, 0x00000000, 0x00000000); >>> >>> if (nvkm_msec(falcon->owner->device, 10, >>> if (!(nvkm_falcon_rd32(falcon, 0x10c) & 0x00000006)) >>> break; >>> ) < 0) >>> return -ETIMEDOUT; >>> >>> return 0; >>> } >>> >>> nvkm_msec is #defined to nvkm_usec which in turn is #defined to nvkm_nsec where the loop that the break is related to >>> appears >> >> I think someone who knows the code needs to look at this. What I can confirm is that after a freeze, I waited for 90 >> seconds for a timeout to occur, but it didn't. > Hey, > > Are you able to try the attached patch for me please? > > Thanks, > Ben. > Thanks Ben. Yes, this patch fixes the lockup on reboot and poweroff that I've been seeing on my laptop. As you would expect, offloaded rendering is still working and the discrete GPU is being powered on and off as required. Thanks. Reported-by: Chris Clayton Tested-by: Chris Clayton >> >> >> .> Chris >>>>> >>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >>>>>>> -- >>>>>>> Everything you wanna know about Linux kernel regression tracking: >>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr >>>>>>> That page also explains what to do if mails like this annoy you. >>>>>>> >>>>>>> #regzbot ignore-activity >>>>>>> >>>>>> >>>>> >>>>