Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp5099440imu; Tue, 8 Jan 2019 11:29:02 -0800 (PST) X-Google-Smtp-Source: ALg8bN6ylnxBp/4GOk4pV0P+mXM2/WjbvdVMOH2jBeRCeacLQR+JmHlQGu9PupSrzvjP4Ix9unDl X-Received: by 2002:a62:1b83:: with SMTP id b125mr3124424pfb.42.1546975742795; Tue, 08 Jan 2019 11:29:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546975742; cv=none; d=google.com; s=arc-20160816; b=mvUqu81NXgLDJpUbjcvuETb+Uqd9W1PPTZ2FA7rSgNa9AuTowv1i1I+UMqAfTJwC/R 3rkj6/1jErpaDKcDwOldI09kh13S7P8s7rcwD9Z2uFOOx0W9VGspJeZapacwom3ItXLu r7Oe09XKGr3UrqGqHxiljhDOYFQW4CwbEMhf3J6FDHGLFUSXX40AkoFWNSJcSh3yxs2v gtXU5xfZxU0dKQoc72KRcEtg9IsJxB8XsjzBvftJ87lol5uNyvdzxLA9pq7PxbHWXHGv lJ3oF0miZAD0oNwjk1yZ/2viZ8I3nSRYXnCHBLuvpjP8nXprQLk2deNit+faw8pzjWx5 o4bA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=iy1zhImX9KX0JxswmE57/t/TaP7bzLURj9gu7Jdmur4=; b=oQS0ndSRaCYsIqR3K7nC/1BA6MAKrNlvJCtCcwBdlArsba96qXk7TofLq3FqvP72VT SejQO/T6CxKfe8zWinWhEk/vxMBuonmRRV49gT1wyuCa3Z5ZxUKpJn3SjIV3whrwslSF Fi6t9VYKUJBSV/BD2ubG6wqh9XT44d0rEfMjSOyvSq91rjeg332DmQthyT8GYfANEjKx lcsmlb+uIwQr+x4pENWDv7JWOm+nKh2b75oSHExdI+KqcehemSP2dXNo3AX1iYRqfVZu ude4lul8kZsULrVJ7lNL61fFrsySVKzASvtpzUPnLofQly1Vzgkt0FYwKKJmQHs6+F7C Hm0w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=utDM7yhe; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g23si66263471pgb.229.2019.01.08.11.28.47; Tue, 08 Jan 2019 11:29:02 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=utDM7yhe; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729528AbfAHT1P (ORCPT + 99 others); Tue, 8 Jan 2019 14:27:15 -0500 Received: from mail.kernel.org ([198.145.29.99]:33090 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729472AbfAHT1J (ORCPT ); Tue, 8 Jan 2019 14:27:09 -0500 Received: from sasha-vm.mshome.net (c-73-47-72-35.hsd1.nh.comcast.net [73.47.72.35]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 0E3612070B; Tue, 8 Jan 2019 19:27:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1546975627; bh=sxJa0TtH59/TOip/7ahu5zdJbR7StEg2U7CHiTf7gvE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=utDM7yheD6YKyWTEr/oDtV80VRaYCZdsR+iUnc0Uf7Iik3COYus5TPkc10WpxJJQ2 nJaAFmf4oA1oRcsCDo8J0+Mq1efNvYP6Jmq9HUToLYqAQHe1lbQHp+3Vk3hqkuw0p8 Fzi2tuko/NEBBfy7XuBgVfpt5upsK8/TYSqw1JLc= From: Sasha Levin To: linux-kernel@vger.kernel.org, stable@vger.kernel.org Cc: Trigger Huang , Alex Deucher , Sasha Levin , dri-devel@lists.freedesktop.org Subject: [PATCH AUTOSEL 4.20 021/117] drm/scheduler: Fix bad job be re-processed in TDR Date: Tue, 8 Jan 2019 14:24:49 -0500 Message-Id: <20190108192628.121270-21-sashal@kernel.org> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190108192628.121270-1-sashal@kernel.org> References: <20190108192628.121270-1-sashal@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Patchwork-Hint: Ignore Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Trigger Huang [ Upstream commit 85744e9c100696d3f210e80b85fd56dd19767c81 ] A bad job is the one triggered TDR(In the current amdgpu's implementation, actually all the jobs in the current joq-queue will be treated as bad jobs). In the recovery process, its fence will be fake signaled and as a result, the work behind will be scheduled to delete it from the mirror list, but if the TDR process is invoked before the work's execution, then this bad job might be processed again and the call dma_fence_set_error to its fence in TDR process will lead to kernel warning trace: [ 143.033605] WARNING: CPU: 2 PID: 53 at ./include/linux/dma-fence.h:437 amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched] kernel: [ 143.033606] Modules linked in: amdgpu(OE) amdchash(OE) amdttm(OE) amd_sched(OE) amdkcl(OE) amd_iommu_v2 drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 snd_hda_codec_generic crypto_simd glue_helper cryptd snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq joydev snd_seq_device snd_timer snd soundcore binfmt_misc input_leds mac_hid serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 8139too floppy psmouse 8139cp mii i2c_piix4 pata_acpi [ 143.033649] CPU: 2 PID: 53 Comm: kworker/2:1 Tainted: G OE 4.15.0-20-generic #21-Ubuntu [ 143.033650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 143.033653] Workqueue: events drm_sched_job_timedout [amd_sched] [ 143.033656] RIP: 0010:amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched] [ 143.033657] RSP: 0018:ffffa9f880fe7d48 EFLAGS: 00010202 [ 143.033659] RAX: 0000000000000007 RBX: ffff9b98f2b24c00 RCX: ffff9b98efef4f08 [ 143.033660] RDX: ffff9b98f2b27400 RSI: ffff9b98f2b24c50 RDI: ffff9b98efef4f18 [ 143.033660] RBP: ffffa9f880fe7d98 R08: 0000000000000001 R09: 00000000000002b6 [ 143.033661] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9b98efef3430 [ 143.033662] R13: ffff9b98efef4d80 R14: ffff9b98efef4e98 R15: ffff9b98eaf91c00 [ 143.033663] FS: 0000000000000000(0000) GS:ffff9b98ffd00000(0000) knlGS:0000000000000000 [ 143.033664] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 143.033665] CR2: 00007fc49c96d470 CR3: 000000001400a005 CR4: 00000000003606e0 [ 143.033669] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 143.033669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 143.033670] Call Trace: [ 143.033744] amdgpu_device_gpu_recover+0x144/0x820 [amdgpu] [ 143.033788] amdgpu_job_timedout+0x9b/0xa0 [amdgpu] [ 143.033791] drm_sched_job_timedout+0xcc/0x150 [amd_sched] [ 143.033795] process_one_work+0x1de/0x410 [ 143.033797] worker_thread+0x32/0x410 [ 143.033799] kthread+0x121/0x140 [ 143.033801] ? process_one_work+0x410/0x410 [ 143.033803] ? kthread_create_worker_on_cpu+0x70/0x70 [ 143.033806] ret_from_fork+0x35/0x40 So just delete the bad job from mirror list directly Changes in v3: - Add a helper function to delete the bad jobs from mirror list and call it directly *before* the job's fence is signaled Changes in v2: - delete the useless list node check - also delete bad jobs in drm_sched_main because: kthread_unpark(ring->sched.thread) will be invoked very early before amdgpu_device_gpu_recover's return, then drm_sched_main will have chance to pick up a new job from the job queue. This new job will be added into the mirror list and processed by amdgpu_job_run, but may not be deleted from the mirror list on time due to the same reason. And finally re-processed by drm_sched_job_recovery Signed-off-by: Trigger Huang Reviewed-by: Christian König Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin --- drivers/gpu/drm/scheduler/sched_main.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 44fe587aaef9..c5bbbd7cb2de 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -60,6 +60,8 @@ static void drm_sched_process_job(struct dma_fence *f, struct dma_fence_cb *cb); +static void drm_sched_expel_job_unlocked(struct drm_sched_job *s_job); + /** * drm_sched_rq_init - initialize a given run queue struct * @@ -215,7 +217,7 @@ static void drm_sched_job_finish(struct work_struct *work) spin_lock(&sched->job_list_lock); /* remove job from ring_mirror_list */ - list_del(&s_job->node); + list_del_init(&s_job->node); /* queue TDR for next job */ drm_sched_start_timeout(sched); spin_unlock(&sched->job_list_lock); @@ -378,6 +380,8 @@ void drm_sched_job_recovery(struct drm_gpu_scheduler *sched) r); dma_fence_put(fence); } else { + if (s_fence->finished.error < 0) + drm_sched_expel_job_unlocked(s_job); drm_sched_process_job(NULL, &s_fence->cb); } spin_lock(&sched->job_list_lock); @@ -567,6 +571,8 @@ static int drm_sched_main(void *param) r); dma_fence_put(fence); } else { + if (s_fence->finished.error < 0) + drm_sched_expel_job_unlocked(sched_job); drm_sched_process_job(NULL, &s_fence->cb); } @@ -575,6 +581,15 @@ static int drm_sched_main(void *param) return 0; } +static void drm_sched_expel_job_unlocked(struct drm_sched_job *s_job) +{ + struct drm_gpu_scheduler *sched = s_job->sched; + + spin_lock(&sched->job_list_lock); + list_del_init(&s_job->node); + spin_unlock(&sched->job_list_lock); +} + /** * drm_sched_init - Init a gpu scheduler instance * -- 2.19.1