Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp3350046pxb; Mon, 4 Apr 2022 14:19:08 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw24qJEMVRaU/kARc7XNntTFBLjXmfCL266vL56yRS1wvxi4GIL9qRvBlfW97MEwGm8jmh1 X-Received: by 2002:a17:90a:f0ce:b0:1ca:ab45:6eb8 with SMTP id fa14-20020a17090af0ce00b001caab456eb8mr173811pjb.150.1649107148597; Mon, 04 Apr 2022 14:19:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649107148; cv=none; d=google.com; s=arc-20160816; b=qkVZKAVW5t5Gh7EH0ssoySJY25x7zb0vHtcPX6k7iK99ccNlBeW+frU0emNPTYQ/H9 2nCUftKFHlOOP2fyOnntD4Anu8tsPynWQbEWjBhPJW39qCi1KNNKz+U120ROfyAkJ9g7 PZdfa8T2uL1/J5fHCjob2MaU/IvqrLNHmkyDoR7ZmYZ43EHQvGYkWLmgYpuNNS5NT9Qm ivrAcLaHbwYru/wdkBlMlGQqrNGTZgRSXTyVrdBR9f/zMBTViDFHmpJjlRmSsX57O8pl sgmEQrQqPcXyCQd9Q9g4pWV1wVGzSWsPBkKUawuyaZCa2EhZGk3GnmH3BRM53kDDGe1j Yz+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=siw8ycGCyVl0TOjIQRHljMQXdGALpld/xSldwB/C0nQ=; b=PhU64TWnsOk6qC+2N26xaijSfpfSxsC+TpgTPTPUOKJDQFmzQBUnJSKVHxsDqS9Dp/ D+I7B9nX01KYniUxZeDOf2nYu0I/7Y2wvzG/UZJheAqt4r0o3AC4bjTmRyNa4ynDRLSb 83oVVqgv0PPPlq7xxu1OPzKNoXLh29Hqe2FIOhny/+snGil6pE2SsN2Tic2VILLVGaAk ENGZTa1g+EYwWafDPPXXQaBrx7nH+Xjcq9pfR/73DeYiJjoXqfQoODOUjjo/cMXufPJ6 CvwmMTH4050W8RyfzgmGJHY0qALVLJUROBMk81PcgjE5+fy6FCgbrCaxA9QpTdTqhzgt 2+/A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="IQ4/x0jk"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f7-20020a654007000000b003847a033b69si9099535pgp.296.2022.04.04.14.18.54; Mon, 04 Apr 2022 14:19:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="IQ4/x0jk"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346992AbiDAO37 (ORCPT + 99 others); Fri, 1 Apr 2022 10:29:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39150 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347119AbiDAO27 (ORCPT ); Fri, 1 Apr 2022 10:28:59 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 70CC9287A15; Fri, 1 Apr 2022 07:27:04 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 01DF8B824FD; Fri, 1 Apr 2022 14:27:03 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 42D5DC3410F; Fri, 1 Apr 2022 14:27:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1648823221; bh=t4WJYu2EYorGiiXvwwsDIcmps/wfJUrrmdVAezFncSA=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=IQ4/x0jkAyvDkz0S77j4D5Ddz8jxB/A5ChRsCvV/s6rSWjhpUVc/Y5p5kpswnSQH/ eZe4wO/QFooIUDTjBdAWsx4Z8kv3+jE58S2hGXTGfBVbovQaPcJgc+lAGFP5rDVedG hC6II9h+D+AZSs01jGfvVOo71kVQsaPZHd8dydMWyhU4+6daeO+Vvdaj6yZbRQGvq7 S1Bt3CNQHdMaxMraAvY/dbcvwbmtDhAxokZY62P/A+AVCnHEp5j5e4h59mdQUkoEuN ohp1C5KJNx5J3qE5Ks/s5yLBVsH9rmzElwKRRGPtIs/KaJHW4OuIulqNV6OLBi7B0i RQMEzGqzYJd2Q== From: Sasha Levin To: linux-kernel@vger.kernel.org, stable@vger.kernel.org Cc: Philip Yang , Ruili Ji , Felix Kuehling , Alex Deucher , Sasha Levin , christian.koenig@amd.com, Xinhui.Pan@amd.com, airlied@linux.ie, daniel@ffwll.ch, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Subject: [PATCH AUTOSEL 5.17 023/149] drm/amdkfd: svm range restore work deadlock when process exit Date: Fri, 1 Apr 2022 10:23:30 -0400 Message-Id: <20220401142536.1948161-23-sashal@kernel.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220401142536.1948161-1-sashal@kernel.org> References: <20220401142536.1948161-1-sashal@kernel.org> MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Philip Yang [ Upstream commit 6225bb3a88d22594aacea2485dc28ca12d596721 ] kfd_process_notifier_release flush svm_range_restore_work which calls svm_range_list_lock_and_flush_work to flush deferred_list work, but if deferred_list work mmput release the last user, it will call exit_mmap -> notifier_release, it is deadlock with below backtrace. Move flush svm_range_restore_work to kfd_process_wq_release to avoid deadlock. Then svm_range_restore_work take task->mm ref to avoid mm is gone while validating and mapping ranges to GPU. Workqueue: events svm_range_deferred_list_work [amdgpu] Call Trace: wait_for_completion+0x94/0x100 __flush_work+0x12a/0x1e0 __cancel_work_timer+0x10e/0x190 cancel_delayed_work_sync+0x13/0x20 kfd_process_notifier_release+0x98/0x2a0 [amdgpu] __mmu_notifier_release+0x74/0x1f0 exit_mmap+0x170/0x200 mmput+0x5d/0x130 svm_range_deferred_list_work+0x104/0x230 [amdgpu] process_one_work+0x220/0x3c0 Signed-off-by: Philip Yang Reported-by: Ruili Ji Tested-by: Ruili Ji Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 1 - drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 15 +++++++++------ 2 files changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index d1145da5348f..74f162887d3b 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -1150,7 +1150,6 @@ static void kfd_process_notifier_release(struct mmu_notifier *mn, cancel_delayed_work_sync(&p->eviction_work); cancel_delayed_work_sync(&p->restore_work); - cancel_delayed_work_sync(&p->svms.restore_work); mutex_lock(&p->mutex); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 225affcddbc1..1cf9041c9727 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1643,13 +1643,14 @@ static void svm_range_restore_work(struct work_struct *work) pr_debug("restore svm ranges\n"); - /* kfd_process_notifier_release destroys this worker thread. So during - * the lifetime of this thread, kfd_process and mm will be valid. - */ p = container_of(svms, struct kfd_process, svms); - mm = p->mm; - if (!mm) + + /* Keep mm reference when svm_range_validate_and_map ranges */ + mm = get_task_mm(p->lead_thread); + if (!mm) { + pr_debug("svms 0x%p process mm gone\n", svms); return; + } svm_range_list_lock_and_flush_work(svms, mm); mutex_lock(&svms->lock); @@ -1703,6 +1704,7 @@ static void svm_range_restore_work(struct work_struct *work) out_reschedule: mutex_unlock(&svms->lock); mmap_write_unlock(mm); + mmput(mm); /* If validation failed, reschedule another attempt */ if (evicted_ranges) { @@ -2840,6 +2842,8 @@ void svm_range_list_fini(struct kfd_process *p) pr_debug("pasid 0x%x svms 0x%p\n", p->pasid, &p->svms); + cancel_delayed_work_sync(&p->svms.restore_work); + /* Ensure list work is finished before process is destroyed */ flush_work(&p->svms.deferred_list_work); @@ -2850,7 +2854,6 @@ void svm_range_list_fini(struct kfd_process *p) atomic_inc(&p->svms.drain_pagefaults); svm_range_drain_retry_fault(&p->svms); - list_for_each_entry_safe(prange, next, &p->svms.list, list) { svm_range_unlink(prange); svm_range_remove_notifier(prange); -- 2.34.1