Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp9095775pxu; Mon, 28 Dec 2020 06:32:07 -0800 (PST) X-Google-Smtp-Source: ABdhPJw8yKtovgmILUlAcLRuGf1xCJc7WAfqUC+0rLfRJf5he9M36yP1fHQYqXKhFHclWGzszYxX X-Received: by 2002:a17:906:451:: with SMTP id e17mr40642854eja.228.1609165926869; Mon, 28 Dec 2020 06:32:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1609165926; cv=none; d=google.com; s=arc-20160816; b=DbWu/xiE7Eho96bf9EkM/xIOoRdfZ9nBCjVo9Z8O22lC0Am575bgOX24NYVJ3icqmm 48aRQH+l0Yy27f7vLxEtxCm9wnh684cMx+k79I4BnHfuPz1khSGPadot+M5KnR9pkNFE nSvHzgvYrtLkQyN4pYpyC5H2IxCub5YS2/56kqPqp5HQdsRw3T7yE4YkFarKlZYksT7X LazxbQNhqnANzW4ELSML6/GsnhLzBpX2GqWdct5uO1WIRuyhbFcOKlJ/Sujh9HvDak9z HUimpOzxQKdVGMZntt9QL9Ko1sntopH0al0hgyy0WD7OeJJe8CtRFfHDUmsO4HuyD44z B2Xw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=lJHKZ7nV9NQxPUfc33HC/vrw3JVuT4Uj5516A0ohqcA=; b=Zn/m2/K23lI/41BNieC98HhOQp5a95FzjQ/12tIF9BYgTHa5QLUM2U/UJdG69zn20g Xqj3myCgCqhSXxvRofFXYZxDeXD7y4ilHu3BUGI1URzE87k34xSfS+Dus9a+0Zf9FxEH NlEqtp8IGMondEEdF5gub+4K+yfSGiWhW9ZFK9vi0Cv62dooiTjCPeFLvz/YWbgM6IDg 1LGbByhvibP/K6RjuwvZtJeXsRTR5h46G7s6Q/IcDlx5wV+LIU4c3CGyzJP3oY/jx2Se W6L/ZegjVcQEzBZKry9XS+0EwuTAX0pN7pTiYuKa05SpKC1oHyE8/LbpejxNMTjTP2vg 39+g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b="X/2IYNYv"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id hp33si17541837ejc.250.2020.12.28.06.31.43; Mon, 28 Dec 2020 06:32:06 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b="X/2IYNYv"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2503367AbgL1O3H (ORCPT + 99 others); Mon, 28 Dec 2020 09:29:07 -0500 Received: from mail.kernel.org ([198.145.29.99]:36662 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2392093AbgL1O3D (ORCPT ); Mon, 28 Dec 2020 09:29:03 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 34FFB207B2; Mon, 28 Dec 2020 14:28:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1609165727; bh=fFPAKFuQfQyqz5WxsqRWCpgSneal3CQQzhAkSi2kIN0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=X/2IYNYvCOjRbfC4OVEAjcuElc23x5CoWC38GOxMpPyxIDPIysrx/sswx4YiVNWuX HotIpAEzcROZbVbEXbEQzSTf7fqEo7Vkhve3wzZqWIM5+l2Uno9ZJTTq5+Zsccrcol e/5onHz/u7zOjmu1GDS7psHWnXdVxvD2jQxeRCBc= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Boris Brezillon , Steven Price Subject: [PATCH 5.10 633/717] drm/panfrost: Fix job timeout handling Date: Mon, 28 Dec 2020 13:50:31 +0100 Message-Id: <20201228125051.246020874@linuxfoundation.org> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201228125020.963311703@linuxfoundation.org> References: <20201228125020.963311703@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Boris Brezillon commit 1a11a88cfd9a97e13be8bc880c4795f9844fbbec upstream. If more than two jobs end up timeout-ing concurrently, only one of them (the one attached to the scheduler acquiring the lock) is fully handled. The other one remains in a dangling state where it's no longer part of the scheduling queue, but still blocks something in scheduler, leading to repetitive timeouts when new jobs are queued. Let's make sure all bad jobs are properly handled by the thread acquiring the lock. v3: - Add Steven's R-b - Don't take the sched_lock when stopping the schedulers v2: - Fix the subject prefix - Stop the scheduler before returning from panfrost_job_timedout() - Call cancel_delayed_work_sync() after drm_sched_stop() to make sure no timeout handlers are in flight when we reset the GPU (Steven Price) - Make sure we release the reset lock before restarting the schedulers (Steven Price) Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") Cc: Signed-off-by: Boris Brezillon Reviewed-by: Steven Price Signed-off-by: Steven Price Link: https://patchwork.freedesktop.org/patch/msgid/20201002122506.1374183-1-boris.brezillon@collabora.com Signed-off-by: Greg Kroah-Hartman --- drivers/gpu/drm/panfrost/panfrost_job.c | 62 +++++++++++++++++++++++++++----- 1 file changed, 53 insertions(+), 9 deletions(-) --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -25,7 +25,8 @@ struct panfrost_queue_state { struct drm_gpu_scheduler sched; - + bool stopped; + struct mutex lock; u64 fence_context; u64 emit_seqno; }; @@ -369,6 +370,24 @@ void panfrost_job_enable_interrupts(stru job_write(pfdev, JOB_INT_MASK, irq_mask); } +static bool panfrost_scheduler_stop(struct panfrost_queue_state *queue, + struct drm_sched_job *bad) +{ + bool stopped = false; + + mutex_lock(&queue->lock); + if (!queue->stopped) { + drm_sched_stop(&queue->sched, bad); + if (bad) + drm_sched_increase_karma(bad); + queue->stopped = true; + stopped = true; + } + mutex_unlock(&queue->lock); + + return stopped; +} + static void panfrost_job_timedout(struct drm_sched_job *sched_job) { struct panfrost_job *job = to_panfrost_job(sched_job); @@ -392,19 +411,39 @@ static void panfrost_job_timedout(struct job_read(pfdev, JS_TAIL_LO(js)), sched_job); + /* Scheduler is already stopped, nothing to do. */ + if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job)) + return; + if (!mutex_trylock(&pfdev->reset_lock)) return; for (i = 0; i < NUM_JOB_SLOTS; i++) { struct drm_gpu_scheduler *sched = &pfdev->js->queue[i].sched; - drm_sched_stop(sched, sched_job); - if (js != i) - /* Ensure any timeouts on other slots have finished */ + /* + * If the queue is still active, make sure we wait for any + * pending timeouts. + */ + if (!pfdev->js->queue[i].stopped) + cancel_delayed_work_sync(&sched->work_tdr); + + /* + * If the scheduler was not already stopped, there's a tiny + * chance a timeout has expired just before we stopped it, and + * drm_sched_stop() does not flush pending works. Let's flush + * them now so the timeout handler doesn't get called in the + * middle of a reset. + */ + if (panfrost_scheduler_stop(&pfdev->js->queue[i], NULL)) cancel_delayed_work_sync(&sched->work_tdr); - } - drm_sched_increase_karma(sched_job); + /* + * Now that we cancelled the pending timeouts, we can safely + * reset the stopped state. + */ + pfdev->js->queue[i].stopped = false; + } spin_lock_irqsave(&pfdev->js->job_lock, flags); for (i = 0; i < NUM_JOB_SLOTS; i++) { @@ -421,11 +460,11 @@ static void panfrost_job_timedout(struct for (i = 0; i < NUM_JOB_SLOTS; i++) drm_sched_resubmit_jobs(&pfdev->js->queue[i].sched); + mutex_unlock(&pfdev->reset_lock); + /* restart scheduler after GPU is usable again */ for (i = 0; i < NUM_JOB_SLOTS; i++) drm_sched_start(&pfdev->js->queue[i].sched, true); - - mutex_unlock(&pfdev->reset_lock); } static const struct drm_sched_backend_ops panfrost_sched_ops = { @@ -558,6 +597,7 @@ int panfrost_job_open(struct panfrost_fi int ret, i; for (i = 0; i < NUM_JOB_SLOTS; i++) { + mutex_init(&js->queue[i].lock); sched = &js->queue[i].sched; ret = drm_sched_entity_init(&panfrost_priv->sched_entity[i], DRM_SCHED_PRIORITY_NORMAL, &sched, @@ -570,10 +610,14 @@ int panfrost_job_open(struct panfrost_fi void panfrost_job_close(struct panfrost_file_priv *panfrost_priv) { + struct panfrost_device *pfdev = panfrost_priv->pfdev; + struct panfrost_job_slot *js = pfdev->js; int i; - for (i = 0; i < NUM_JOB_SLOTS; i++) + for (i = 0; i < NUM_JOB_SLOTS; i++) { drm_sched_entity_destroy(&panfrost_priv->sched_entity[i]); + mutex_destroy(&js->queue[i].lock); + } } int panfrost_job_is_idle(struct panfrost_device *pfdev)