Received: by 2002:a05:6358:7058:b0:131:369:b2a3 with SMTP id 24csp5637574rwp; Mon, 17 Jul 2023 07:19:04 -0700 (PDT) X-Google-Smtp-Source: APBJJlEDVO5KUaa1bEa5Tagem+OALk4DDERVdUzSw6Oq5Zru+CdMsPUosH57MHObW4V7PywRcdWF X-Received: by 2002:a17:903:32c9:b0:1b8:9598:6508 with SMTP id i9-20020a17090332c900b001b895986508mr17518744plr.18.1689603544539; Mon, 17 Jul 2023 07:19:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689603544; cv=none; d=google.com; s=arc-20160816; b=BSHubyA6zU8NcQRrCgVNuZoLO4cYJ3+6rXpR5WaTLl1176JEIETwG3iCfAJCN0UYGE osPs1gy2sMQE7HFUe5mYCOLElqmIDsLj3v45cZTCD9dDTpq7tAVE2p8VURtQsoTJwdoe L/1DypIAgsNeyWM+PuxUbdAvjA9bPGe2lEf+EQTDmFzoHwOzvqwz9bHqNJdWD1lwyQ6O dBtBEEhimcJA+n9uOO3ovk3wF/Lv1Meedessab1FwWnrzekIxOtFBsrlE5YuYC5VBtQi QaqmzLeE4+NUtxcKItVS8s48diRWloNkuQy79ygdw6IDnto/N/o6gHY2pjb7Ap7C/1OB 8ygA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=Iq8i+9yWRrBHL09wug02Q7e0K3qWppTZwTZd7RP6/To=; fh=ZX2kFPrtj9ztk0pCvhNZQ44uTGmOp12xSdGBOUrXjuc=; b=LmdvzKHxYSVYTOIiDzcs0WG9rRzZ9ID8zXE0LGA2CvqLiKo7R4mM5yIHlbXkoi2Bes qwRcTRtuMb7N56XVmre1tWCBglHAx01iZKBIuCFw63KFbpJvzg/LJOd3iBIPW9r7G2xN 1PiYbGdtutYMa7dckT1tTZJ3NOxiP0cArU992VFzj4KynT47LR6fLFwMUnd6N4ADybwM qIwX2j/aU+dMfUnpEVtLWEGy7lgCMpwy1AXMibvtjrf5shX3wUgZm9k1tWi6s+OZfDzl HA0579r86/tEb9cF3bSZbpgbbzYoLdXsPrRmLlJ/KKyVcbRA+H++5LZj813Kol2qDGcI 2tfQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@collabora.com header.s=mail header.b=RpFjy4UM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=collabora.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j4-20020a170903028400b001b8ae69289dsi11862614plr.539.2023.07.17.07.18.52; Mon, 17 Jul 2023 07:19:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@collabora.com header.s=mail header.b=RpFjy4UM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=collabora.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229458AbjGQNb4 (ORCPT + 99 others); Mon, 17 Jul 2023 09:31:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42106 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229767AbjGQNbl (ORCPT ); Mon, 17 Jul 2023 09:31:41 -0400 Received: from madras.collabora.co.uk (madras.collabora.co.uk [46.235.227.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A1717213F for ; Mon, 17 Jul 2023 06:31:14 -0700 (PDT) Received: from [192.168.2.126] (109-252-154-2.dynamic.spd-mgts.ru [109.252.154.2]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: dmitry.osipenko) by madras.collabora.co.uk (Postfix) with ESMTPSA id DE0C66606EF9; Mon, 17 Jul 2023 14:30:53 +0100 (BST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1689600654; bh=3r44yMGopZswD/4MCovJLODOOOfJODbmEllLZPplAYU=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=RpFjy4UM1kol+ZAsvt1ZNWWfZZeaoWF6BEdkRkxHLSLiB+0oU6tu5WMsILTnxjcxv FOW2P7aOKEEppbsWZTXu2x4aspd+8mnQuvIcqvagFjDyTXAil+xvMzuScsajZLHK36 LIq6gQrpnxyPoBRFUN4C+UyZEmuPTPy8cy4vPb8ryxbyMG9zYBU7Xumsq3t+wuRLtN htg0q42eDRVOXdRc/3OVUeK140k9fGrJYAOevgXGrBJyQgXrUyOnVDzPg+iVWk33Mp 6N4OefvXSxD0/cszqbyIva19b+c0j/Ae3VbttFO1x7SBJ19WKK8ET7IWaSAR1+t7qM T4SU8VmNrvHzA== Message-ID: Date: Mon, 17 Jul 2023 16:30:51 +0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.1 Subject: Re: [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler To: Steven Price , Boris Brezillon Cc: Rob Herring , dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, kernel@collabora.com References: <20230717065254.1061033-1-dmitry.osipenko@collabora.com> <20230717090506.2ded4594@collabora.com> <80de081a-e443-85a2-1a61-6a8885e8d529@collabora.com> <20230717094905.7a1ee007@collabora.com> <0b527996-342b-da44-61dd-38743db80cda@arm.com> <20230717104955.268d84a8@collabora.com> <31f04dca-f7b7-e899-07ee-8c5f2dd55494@arm.com> Content-Language: en-US From: Dmitry Osipenko In-Reply-To: <31f04dca-f7b7-e899-07ee-8c5f2dd55494@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 17.07.2023 11:59, Steven Price пишет: > On 17/07/2023 09:49, Boris Brezillon wrote: >> On Mon, 17 Jul 2023 09:06:56 +0100 >> Steven Price wrote: >> >>> On 17/07/2023 08:49, Boris Brezillon wrote: >>>> On Mon, 17 Jul 2023 10:20:02 +0300 >>>> Dmitry Osipenko wrote: >>>> >>>>> Hi, >>>>> >>>>> On 7/17/23 10:05, Boris Brezillon wrote: >>>>>> Hi Dmitry, >>>>>> >>>>>> On Mon, 17 Jul 2023 09:52:54 +0300 >>>>>> Dmitry Osipenko wrote: >>>>>> >>>>>>> Panfrost IRQ handler may stuck for a long time, for example this happens >>>>>>> when there is a bad HDMI connection and HDMI handler takes a long time to >>>>>>> finish processing, holding Panfrost. Make Panfrost's job timeout handler >>>>>>> to sync IRQ before checking fence signal status in order to prevent >>>>>>> spurious job timeouts due to a slow IRQ processing. >>>>>> >>>>>> Feels like the problem should be fixed in the HDMI encoder driver >>>>>> instead, so it doesn't stall the whole system when processing its >>>>>> IRQs (use threaded irqs, maybe). I honestly don't think blocking in the >>>>>> job timeout path to flush IRQs is a good strategy. >>>>> >>>>> The syncing is necessary to have for correctness regardless of whether >>>>> it's HDMI problem or something else, there could be other reasons for >>>>> CPU to delay IRQ processing. It's wrong to say that hw is hung, while >>>>> it's not. >>>> >>>> Well, hardware is effectively hung, if not indefinitely, at least >>>> temporarily. All you do here is block in the timeout handler path >>>> waiting for the GPU interrupt handlers to finish, handler that's >>>> probably waiting in the queue, because the raw HDMI handler is blocking >>>> it somehow. So, in the end, you might just be delaying the time of HWR a >>>> bit more. I know it's not GPU's fault in that case, and the job could >>>> have finished in time if the HDMI encoder hadn't stall the interrupt >>>> handling pipeline, but I'm not sure we should care for that specific >>>> situation. And more importantly, if it took more than 500ms to get a >>>> frame rendered (or, in that case, to get the event that a frame is >>>> rendered), you already lost, so I'm not sure correctness matters: >>>> rendering didn't make it in time, and the watchdog kicked in to try and >>>> unblock the situation. Feels like we're just papering over an HDMI >>>> encoder driver bug here, really. >>> >>> TLDR; I don't see any major downsides and it stops the GPU getting the >>> blame for something that isn't its fault. >> >> True, but doing that will also give the impression that things run fine, >> but very slowly, which would put the blame on the userspace driver :P. > > Maybe I'm tainted by years of the kernel driver getting the blame > because it was the one that printed the message ;p > >>> >>> I guess the question is whether panfrost should work on a system which >>> has terrible IRQ latency. At the moment we have a synchronize_irq() call >>> in panfrost_reset() which effectively does the same thing, but with all >>> the overhead/spew of resetting the GPU. >> >> Unless I'm mistaken, the synchronize_irq() in panfrost_reset() is >> mostly here to make sure there's no race between the interrupt >> handler and the reset logic (we mask interrupts, and then synchronize, >> guaranteeing that the interrupt handler won't be running after that >> point), and it happens after we've printed the error message, so the >> user knows something was blocked at least. > > Yes the synchronize_irq() in panfrost_reset() is there to avoid a real > race - but it has the side effect of flushing out the IRQ and therefore > the job gets completed successfully. And in the high IRQ latency > situation makes the actual reset redundant. > >>> >>> Of course in the case Dmitry is actually talking about - it does seem >>> like the HDMI encoder has a bug which needs fixing. There are plenty of >>> other things that will break if IRQ latency gets that bad. >> >> Yes, that's my point. The GPU driver is the only one to complain right >> now, but the HDMI encoder behavior could be impacting other parts of >> the system. Silently ignoring those weirdnesses sounds like a terrible >> idea. > > Agreed - but making it look like a GPU driver bug isn't good either. > >>> >>> I do wonder if it makes sense to only synchronize when it's needed, >>> e.g.: >>> >>> ----8<--- >>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c >>> index dbc597ab46fb..d96266b74e5c 100644 >>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c >>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c >>> @@ -720,6 +720,12 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job >>> if (dma_fence_is_signaled(job->done_fence)) >>> return DRM_GPU_SCHED_STAT_NOMINAL; >>> >>> + /* Synchronize with the IRQ handler in case the IRQ latency is bad */ >>> + synchronize_irq(pfdev->js->irq); >>> + /* Recheck if the job is now complete */ >>> + if (dma_fence_is_signaled(job->done_fence)) >>> + return DRM_GPU_SCHED_STAT_NOMINAL; >>> + >>> dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p", >>> js, >>> job_read(pfdev, JS_CONFIG(js)), >>> ----8<--- >>> >>> I don't have any data as to how often we hit the case where the DRM >>> scheduler calls the timeout but we've already signalled - so the extra >>> check might be overkill. >> >> Right, it's not so much about the overhead of the synchronize_irq() >> call (even though my first reply complained about that :-)), but more >> about silently ignoring system misbehaviors. So I guess I'd be fine with >> a version printing a dev_warn("Unexpectedly high interrupt latency") >> when synchronize_irq() unblocks the situation, which means you'd still >> have to do it in two steps. > > I like this idea - it still warns so it's obvious there's something > wrong with the system, and it makes it clear it's not the GPU's fault. Like that idea too, thanks for the suggestions! Will prepare v2 -- Best regards, Dmitry