Received: by 2002:a05:7412:40d:b0:e2:908c:2ebd with SMTP id 13csp985428rdf; Wed, 22 Nov 2023 02:23:56 -0800 (PST) X-Google-Smtp-Source: AGHT+IGcZRl200WZW4sAw4KNij65xjJn0hhXH4cx3MSX9EmF1I6m8xhz9F5RHB6ySMlDUaORO5Tb X-Received: by 2002:a17:90b:1b07:b0:280:23e1:e4dd with SMTP id nu7-20020a17090b1b0700b0028023e1e4ddmr3005499pjb.17.1700648635871; Wed, 22 Nov 2023 02:23:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700648635; cv=none; d=google.com; s=arc-20160816; b=FB2jKnONzhglJNB/teaBEfcCVhlOEJyRJtt8cQuMZrDBe67VhnOfF9BsP+jWuBWEeO iLAkhDStH37ur4cChEHBQd3fbwc9H6+BAEzSuspyckmf6JbbTrRMiamWNLb6adS/e3fT ABYiAQkK/I2Ww1CFPuQWHZVouVCcy0/mOksjrjTvlL5pHGW3LhYyfWclrDp7aV4tnmFD HuV473qP+FFc8V59775cxCrwx7DRZ5Xf09KrS8AuSVhto2c+mPcgyw972qwI78BpAJxm 7JFzK66B5V2wL5CgUnRYnjD/qA44Vd/KyGhcvS1zfSHHnso/v4We7MpAx/ptp205Hrjw vHIg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=QI46JV9HEbKlBNrANN5C2sRHE87nOwp3S56HcZ5m60I=; fh=rcc4kXo8nOGsDsgYPvksNC02xaDPvEzP9FLPAjDwknE=; b=bnl0X6W/Q2nj2Xhf3Up8rR3TZhrX2lkWoGQvzRalVC3LB0sb6DTATv7LxG7RTMHJb8 4JI6u1XIXBVVbTucRFuaqaEnAkEzwaHkbqz0pR2ZOyQc47KAHlfEFKDo3MgKOwhMMKDp IFdW5UVZM0X3veu/ExUuwd+WkRaHUSH0sv6QBjQLHqKMwdNDYA2tnKbz+3b4+lMqNWTp MHMgVqKEQk0oh6ltnyVd7dPihi8i4sx8Zcde3ylHvhoZLRFpXa7c2M6CiCtjaPuDEfwX /s6tfaojipViCYSLa+ejBtNf7pBKPzbYb4/3zHWrA4LC6C5yz7ZpyRTQAiQbo+eOaA1j Li4A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@collabora.com header.s=mail header.b=X62bx65y; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=collabora.com Return-Path: Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id bf18-20020a17090b0b1200b002775999122csi1168867pjb.141.2023.11.22.02.23.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 22 Nov 2023 02:23:55 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@collabora.com header.s=mail header.b=X62bx65y; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=collabora.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 8A174820D5D9; Wed, 22 Nov 2023 02:23:43 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235334AbjKVKXW (ORCPT + 99 others); Wed, 22 Nov 2023 05:23:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44050 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231407AbjKVKXU (ORCPT ); Wed, 22 Nov 2023 05:23:20 -0500 Received: from madras.collabora.co.uk (madras.collabora.co.uk [IPv6:2a00:1098:0:82:1000:25:2eeb:e5ab]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5A5EF112; Wed, 22 Nov 2023 02:23:10 -0800 (PST) Received: from [100.107.97.3] (cola.collaboradmins.com [195.201.22.229]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: kholk11) by madras.collabora.co.uk (Postfix) with ESMTPSA id 72CFC660737B; Wed, 22 Nov 2023 10:23:08 +0000 (GMT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1700648589; bh=ZOloRXT8TmaeWv8g6UEHGcaeuRiJMBaIUzIimZz8+Zk=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=X62bx65yUFSlXUR+OtecLn/nJCU+rym9ApdgXPY+P8xRJmnmD+5vTG/sxZS+I2fqZ wYxV3JFavBqb2+vDqXEvZF7tkchAUuWT6D5SpE+6fwDCGlNrhcqHlCwdMt9Rfzr+4b rM7lfu0nIkVTgrfsQ4D1iuh8wiJNX7lMf871aCIlDfBwyM8+ywN1ve5ARnTGxyOolJ hiNYwZ14mDFsIabJm82PyCO2D953CAizuBlqm+76tIts9SSOhkvlvOt+WSSsiCZQQn U4OBcvInTLuwFT2LigPVzBuGUSiz7ie0MK8QS/6QYIObLgQZityGkk7EjauUdK1lcW DyzNG9tW7TODw== Message-ID: Date: Wed, 22 Nov 2023 11:23:05 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off() Content-Language: en-US To: Boris Brezillon Cc: Krzysztof Kozlowski , Steven Price , tzimmermann@suse.de, linux-kernel@vger.kernel.org, mripard@kernel.org, dri-devel@lists.freedesktop.org, wenst@chromium.org, kernel@collabora.com, "linux-samsung-soc@vger.kernel.org" , Marek Szyprowski References: <20231102141507.73481-1-angelogioacchino.delregno@collabora.com> <7928524a-b581-483b-b1a1-6ffd719ce650@arm.com> <1c9838fb-7f2d-4752-b86a-95bcf504ac2f@linaro.org> <6b7a4669-7aef-41a7-8201-c2cfe401bc43@collabora.com> <20231121175531.085809f5@collabora.com> <4c73f67e-174c-497e-85a5-cb053ce657cb@collabora.com> <20231122105419.69724739@collabora.com> From: AngeloGioacchino Del Regno In-Reply-To: <20231122105419.69724739@collabora.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Wed, 22 Nov 2023 02:23:43 -0800 (PST) Il 22/11/23 10:54, Boris Brezillon ha scritto: > Hi Angelo, > > On Wed, 22 Nov 2023 10:06:19 +0100 > AngeloGioacchino Del Regno > wrote: > >> Il 21/11/23 18:08, Krzysztof Kozlowski ha scritto: >>> On 21/11/2023 17:55, Boris Brezillon wrote: >>>> On Tue, 21 Nov 2023 17:11:42 +0100 >>>> AngeloGioacchino Del Regno >>>> wrote: >>>> >>>>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto: >>>>>> On 08/11/2023 14:20, Steven Price wrote: >>>>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote: >>>>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request >>>>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones: >>>>>>>> this means that in order to request poweroff of cores, we are supposed >>>>>>>> to write a bitmask of cores that should be powered off! >>>>>>>> This means that the panfrost_gpu_power_off() function has always been >>>>>>>> doing nothing. >>>>>>>> >>>>>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff >>>>>>>> to the relevant PWROFF_LO registers and then check that the transition >>>>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO >>>>>>>> registers. >>>>>>>> >>>>>>>> While at it, in order to avoid code duplication, move the core mask >>>>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask() >>>>>>>> function, used in both poweron and poweroff. >>>>>>>> >>>>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") >>>>>>>> Signed-off-by: AngeloGioacchino Del Regno >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> This commit was added to next recently but it causes "external abort on >>>>>> non-linefetch" during boot of my Odroid HC1 board. >>>>>> >>>>>> At least bisect points to it. >>>>>> >>>>>> If fixed, please add: >>>>>> >>>>>> Reported-by: Krzysztof Kozlowski >>>>>> >>>>>> [ 4.861683] 8<--- cut here --- >>>>>> [ 4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c >>>>>> [ 4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453 >>>>>> ... >>>>>> [ 5.164010] panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c >>>>>> [ 5.171276] __handle_irq_event_percpu from handle_irq_event+0x38/0x80 >>>>>> [ 5.177765] handle_irq_event from handle_fasteoi_irq+0x9c/0x250 >>>>>> [ 5.183743] handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38 >>>>>> [ 5.190417] generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 >>>>>> [ 5.196741] gic_handle_irq from generic_handle_arch_irq+0x34/0x44 >>>>>> [ 5.202893] generic_handle_arch_irq from __irq_svc+0x8c/0xd0 >>>>>> >>>>>> Full log: >>>>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0 >>>>>> >>>>> >>>>> Hey Krzysztof, >>>>> >>>>> This is interesting. It might be about the cores that are missing from the partial >>>>> core_mask raising interrupts, but an external abort on non-linefetch is strange to >>>>> see here. >>>> >>>> I've seen such external aborts in the past, and the fault type has >>>> often been misleading. It's unlikely to have anything to do with a >>> >>> Yeah, often accessing device with power or clocks gated. >>> >> >> Except my commit does *not* gate SoC power, nor SoC clocks ???? > > It's not directly related to your commit, it's just a side effect. > Indeed! >> >> What the "Really power off ..." commit does is to ask the GPU to internally power >> off the shaders, tilers and L2, that's why I say that it is strange to see that >> kind of abort. >> >> The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and GPU_FAULT_ADDRESS_{HI/LO} >> registers should still be accessible even with shaders, tilers and cache OFF. > > It's not the power_off() call that's problematic, it's when it happens > (the last thing called in panfrost_device_runtime_suspend()), and the > fact it generates interrupts. Yes, you don't explicitly gate the clocks > in panfrost_device_runtime_suspend(), but the PM layer does interact > directly with power domains, and shutting down a power domain might > result in other clks/components being gated, which might make the > register bank inaccessible from the CPU. > >> >> Anyway, yes, synchronizing IRQs before calling the poweroff sequence would also >> work, but that'd add up quite a bit of latency on the runtime_suspend() call, > > Really? In practice I'd expect no pending interrupts, other than the > power transition ones, which are purely and simply ignored by the > handler. If we had any other pending interrupts on suspend, we would > have faced this problem before. To sum-up, I'd expect the extra latency > to just be the overhead of the synchronize_irq() call, which, after > looking at the code, shouldn't be such a big deal. > >> so >> in this case I'd be more for avoiding to execute any register r/w in the handler >> by either checking if the GPU is supposed to be OFF, > > Yes, that's an option, but I don't think that's enough (see below). > >> or clearing interrupts, > > The handler might have been called already when you clear the > interrupt, and you'd still need to make sure the handler has returned > before returning from panfrost_device_runtime_suspend() if you want to > guarantee no one is touching the registers when the power domains are > shutdown. > >> which >> may not work if those are generated after the execution of the poweroff function. > > They are generated while you poll the register, but that doesn't > guarantee they will be processed by the time you return from your > power_off() function, which I think is exactly the problem we're facing > here. > >> Or we could simply disable the irq after power_off, but that'd be hacky (as well). > > If by disabling the interrupt you mean calling disable_irq(), that > would work if the irq lines were not declared as shared (IRQF_SHARED > flag passed at request time). Beside, the latency of disable_irq() > should be pretty much the same as synchronize_irq(), given > synchronize_irq() from there. > > If by disabling the interrupt, you mean masking it with _INT_MASK, > then, as said above, that's not enough. You need to make sure any > handler that may have been called as a result of this interrupt, > returns before you return from the suspend function, so you need some > kind of synchronization. > Your reasons are totally valid and I see the point. That's what I'll do as a follow-up Fixes patch: - gpu_write(pfdev, GPU_INT_MASK, 0); - gpu_write(pfdev, GPU_INT_CLEAR, GPU_IRQ_MASK_ALL); - synchronize_irq() - poweroff *all* shaders/tilers/l2 (without caring about core_mask) - *No* INT_MASK restore, as we rely on soft_reset() to do that for us once we resume the GPU. >> >> >> Let's see if asking to poweroff *everything* works: > > It might slightly change the timing, making this problem disappear by > chance (if the interrupt gets processed before power_off() returns), > but it doesn't make the suspend logic more robust. We really have to > guarantee that no one will touch the registers when we enter suspend, > be it some interrupt handler, or any kind of deferred work. > > Again, none of this is a direct result of your patch, it's just that > your patch uncovered the problem, and I think now is a good time to fix > it properly. > Yes, I am well aware of that and I was trying to make that clear in the first place - I'm sorry if I gave the impression of having any kind of doubt around that, or any other. Cheers! Angelo