Received: by 2002:a05:6358:7058:b0:131:369:b2a3 with SMTP id 24csp5346971rwp; Mon, 17 Jul 2023 02:24:49 -0700 (PDT) X-Google-Smtp-Source: APBJJlGjZ5NLmubQ3QYS3F3AjKaf3DkWaa6I1iQSLf1rmIjKrEldJ9OAOLeKFFxs8Rc/P0H0fcbd X-Received: by 2002:a17:907:9143:b0:993:d920:87d3 with SMTP id l3-20020a170907914300b00993d92087d3mr8809765ejs.25.1689585888958; Mon, 17 Jul 2023 02:24:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689585888; cv=none; d=google.com; s=arc-20160816; b=sm8Zz9e47K2uuwIQQnka/ptdYITN45BQHUC9lfhGps/bjZeK4NFtoqVxbILBJIMlut EotInSKqzF5JXZRpjsWUvcmQG4J/CiaYTB9yM5Hq+MO0VvDxle15PuaL+EcAh8nwXqKd nxiaMzNFwjzrsY55W0sRktsIRigpmNfq+DKEQNe1cO4K3A3PS+sXPItUkTyVqwzrA9Ua hKvt9pHP65CuLQ0G0/Bv20znes4jwvZLS78oAFkYMYDTjNSX4BtoeFAtPUHSjlQwHPnp BxDTqWSKpeAgGS3kjkq5iDI4Qadd8vz7wtclhH2t2fO661agSeC2843klEsKcA0qtY9x wdqw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=nC6sF7D3lvF29yP0vy2lTxb3rVESMOHfMUDTreT5m4s=; fh=opB5am8oXZwbXRnS6Awt7GMNRaQ6Y0TD3IEQu1d7AGc=; b=pMX7dEnBe/hJWuVcakRL/+KiQoeOwO0gMbDKOZyVXHo/SkG0bj1muOFIltMVJ8anvQ mzNwawglEU19b723Whxpu4AYp+bwDCZTcUX+98c+TSUzNTANf2ILlvvUGbm66QjOrGLa wh4BaS2gsN5S/EZ29sSvhGLwjO+8UInCizXSGrDXZual7CE4cDruRdm9lh2vbA8ZuywW abgFOADfVDLgdVafoqJPU/ZTAnbEPni+UlbJlR2OFOuTbcm6DPdBpiO0IkPbaPQLZ4p1 YdyurBI9+JtR9a2OMdtuPIpUN2ZHPiP3V7jTF5WZNPDPxpajqXEc6Q9NUONLb8xB/Kxv kkmQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id gl21-20020a170906e0d500b00982c4ef1c5fsi12952863ejb.499.2023.07.17.02.24.24; Mon, 17 Jul 2023 02:24:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230089AbjGQI7c (ORCPT + 99 others); Mon, 17 Jul 2023 04:59:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55628 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229658AbjGQI7b (ORCPT ); Mon, 17 Jul 2023 04:59:31 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9E105E4E for ; Mon, 17 Jul 2023 01:59:29 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A059DD75; Mon, 17 Jul 2023 02:00:12 -0700 (PDT) Received: from [10.57.37.37] (unknown [10.57.37.37]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 20C463F73F; Mon, 17 Jul 2023 01:59:27 -0700 (PDT) Message-ID: <31f04dca-f7b7-e899-07ee-8c5f2dd55494@arm.com> Date: Mon, 17 Jul 2023 09:59:28 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Subject: Re: [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler Content-Language: en-GB To: Boris Brezillon Cc: Dmitry Osipenko , Rob Herring , dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, kernel@collabora.com References: <20230717065254.1061033-1-dmitry.osipenko@collabora.com> <20230717090506.2ded4594@collabora.com> <80de081a-e443-85a2-1a61-6a8885e8d529@collabora.com> <20230717094905.7a1ee007@collabora.com> <0b527996-342b-da44-61dd-38743db80cda@arm.com> <20230717104955.268d84a8@collabora.com> From: Steven Price In-Reply-To: <20230717104955.268d84a8@collabora.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 17/07/2023 09:49, Boris Brezillon wrote: > On Mon, 17 Jul 2023 09:06:56 +0100 > Steven Price wrote: > >> On 17/07/2023 08:49, Boris Brezillon wrote: >>> On Mon, 17 Jul 2023 10:20:02 +0300 >>> Dmitry Osipenko wrote: >>> >>>> Hi, >>>> >>>> On 7/17/23 10:05, Boris Brezillon wrote: >>>>> Hi Dmitry, >>>>> >>>>> On Mon, 17 Jul 2023 09:52:54 +0300 >>>>> Dmitry Osipenko wrote: >>>>> >>>>>> Panfrost IRQ handler may stuck for a long time, for example this happens >>>>>> when there is a bad HDMI connection and HDMI handler takes a long time to >>>>>> finish processing, holding Panfrost. Make Panfrost's job timeout handler >>>>>> to sync IRQ before checking fence signal status in order to prevent >>>>>> spurious job timeouts due to a slow IRQ processing. >>>>> >>>>> Feels like the problem should be fixed in the HDMI encoder driver >>>>> instead, so it doesn't stall the whole system when processing its >>>>> IRQs (use threaded irqs, maybe). I honestly don't think blocking in the >>>>> job timeout path to flush IRQs is a good strategy. >>>> >>>> The syncing is necessary to have for correctness regardless of whether >>>> it's HDMI problem or something else, there could be other reasons for >>>> CPU to delay IRQ processing. It's wrong to say that hw is hung, while >>>> it's not. >>> >>> Well, hardware is effectively hung, if not indefinitely, at least >>> temporarily. All you do here is block in the timeout handler path >>> waiting for the GPU interrupt handlers to finish, handler that's >>> probably waiting in the queue, because the raw HDMI handler is blocking >>> it somehow. So, in the end, you might just be delaying the time of HWR a >>> bit more. I know it's not GPU's fault in that case, and the job could >>> have finished in time if the HDMI encoder hadn't stall the interrupt >>> handling pipeline, but I'm not sure we should care for that specific >>> situation. And more importantly, if it took more than 500ms to get a >>> frame rendered (or, in that case, to get the event that a frame is >>> rendered), you already lost, so I'm not sure correctness matters: >>> rendering didn't make it in time, and the watchdog kicked in to try and >>> unblock the situation. Feels like we're just papering over an HDMI >>> encoder driver bug here, really. >> >> TLDR; I don't see any major downsides and it stops the GPU getting the >> blame for something that isn't its fault. > > True, but doing that will also give the impression that things run fine, > but very slowly, which would put the blame on the userspace driver :P. Maybe I'm tainted by years of the kernel driver getting the blame because it was the one that printed the message ;p >> >> I guess the question is whether panfrost should work on a system which >> has terrible IRQ latency. At the moment we have a synchronize_irq() call >> in panfrost_reset() which effectively does the same thing, but with all >> the overhead/spew of resetting the GPU. > > Unless I'm mistaken, the synchronize_irq() in panfrost_reset() is > mostly here to make sure there's no race between the interrupt > handler and the reset logic (we mask interrupts, and then synchronize, > guaranteeing that the interrupt handler won't be running after that > point), and it happens after we've printed the error message, so the > user knows something was blocked at least. Yes the synchronize_irq() in panfrost_reset() is there to avoid a real race - but it has the side effect of flushing out the IRQ and therefore the job gets completed successfully. And in the high IRQ latency situation makes the actual reset redundant. >> >> Of course in the case Dmitry is actually talking about - it does seem >> like the HDMI encoder has a bug which needs fixing. There are plenty of >> other things that will break if IRQ latency gets that bad. > > Yes, that's my point. The GPU driver is the only one to complain right > now, but the HDMI encoder behavior could be impacting other parts of > the system. Silently ignoring those weirdnesses sounds like a terrible > idea. Agreed - but making it look like a GPU driver bug isn't good either. >> >> I do wonder if it makes sense to only synchronize when it's needed, >> e.g.: >> >> ----8<--- >> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c >> index dbc597ab46fb..d96266b74e5c 100644 >> --- a/drivers/gpu/drm/panfrost/panfrost_job.c >> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c >> @@ -720,6 +720,12 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job >> if (dma_fence_is_signaled(job->done_fence)) >> return DRM_GPU_SCHED_STAT_NOMINAL; >> >> + /* Synchronize with the IRQ handler in case the IRQ latency is bad */ >> + synchronize_irq(pfdev->js->irq); >> + /* Recheck if the job is now complete */ >> + if (dma_fence_is_signaled(job->done_fence)) >> + return DRM_GPU_SCHED_STAT_NOMINAL; >> + >> dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p", >> js, >> job_read(pfdev, JS_CONFIG(js)), >> ----8<--- >> >> I don't have any data as to how often we hit the case where the DRM >> scheduler calls the timeout but we've already signalled - so the extra >> check might be overkill. > > Right, it's not so much about the overhead of the synchronize_irq() > call (even though my first reply complained about that :-)), but more > about silently ignoring system misbehaviors. So I guess I'd be fine with > a version printing a dev_warn("Unexpectedly high interrupt latency") > when synchronize_irq() unblocks the situation, which means you'd still > have to do it in two steps. I like this idea - it still warns so it's obvious there's something wrong with the system, and it makes it clear it's not the GPU's fault. Steve