Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp38043pxb; Tue, 12 Apr 2022 16:08:19 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz2zutzj17GV21yQjuU+jqVlwcVMBTbMxPEY2rO2aJrx7zCpWs48trsvr3ijO4yfqrmeGPT X-Received: by 2002:a05:6a00:194b:b0:4fb:4ac:de57 with SMTP id s11-20020a056a00194b00b004fb04acde57mr6865523pfk.17.1649804899087; Tue, 12 Apr 2022 16:08:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649804899; cv=none; d=google.com; s=arc-20160816; b=w3dXwRRQhjtYoZOvtDkTHKYZg44zYlEmF8j4X7TykiX0mhvE3vRNC9J1RR5Q7L/UBk Cw8+8UMKJYkWZVK+Xau22EqJum/ZpZW0P7nbcmjHjHiBYFAr3QDHRJ+N8N3jISUf9A6i lI6U1vPzDegqbllPVytBDvcWCCQJE3W0hhOlYXw5HwYddgn8DoSCURzrNxeOObE3qlEY HkbVFoNKbhyI1mc7Jb4U6ad2PUKizgZYFOxAcT7cKVWPLSB2t7oBBrKO+4ApNEy2uFrB yPwY71AIe4HRulaga4F3jCeVGhOLCvg/Bekj32HuQrmfQZJyOt7lPMEURm4PO9fRnZix +yVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=4M4U7gBjwoXi+OVmLBTZUsdKU6Zr6ZXX7UmaDqIhfCE=; b=Hmei+XuyP4J7yozXgDeHqJ8vu4ktNKl2jFxy2SvGL3wMNeOTCMXh6xT4V/YC4S2zQJ 2skbzeg0C6qt9oALOGKL0lhGfnFO2LAFwyy71Bc8JHnE0fysgJjHOfFiunRNqQDrBf+E 6RRfwsqUR7t08gEdWQeot4qTXOa7YwHVVLL+fDl/PZO8lm3s02zIgiYdjQjdfwU/DWga 9HUDchcXTw1c2uzmwolLUUohcEQMVPNg2RgzYI/gQfNrLxzODHIHkFskWd46Z9c0IXWS vrOpJns1Jds87jXNlgg5BuWGlUlVmUxnm2Urs46U/Spp54v4OwIdyvnLsrtznGTKo3ll 1utA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=ChTcI1CP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id w7-20020a63fb47000000b003821c6a6897si4363872pgj.758.2022.04.12.16.08.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Apr 2022 16:08:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=ChTcI1CP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9F3111F95E6; Tue, 12 Apr 2022 14:50:45 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1379362AbiDLIUY (ORCPT + 99 others); Tue, 12 Apr 2022 04:20:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43194 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1356022AbiDLHaV (ORCPT ); Tue, 12 Apr 2022 03:30:21 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9FF044FC6F; Tue, 12 Apr 2022 00:08:27 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id E0084616A9; Tue, 12 Apr 2022 07:08:26 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id EE465C385A8; Tue, 12 Apr 2022 07:08:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1649747306; bh=e4Q/swAJT5i/cd9ObonPjAQOyqAx6WFs2Yug5hfJQ+8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ChTcI1CP8hoS45uUu5p9SQrEScY/4dfCRglilhYp0g4u5gkqdteAEyxniQvxeOWIq p+ML6z0Up3MeUqxZXRROz4AMX0Rg1sSjrSPYSEitlvF0Jl1Sgc1NIstLAZEeu9D6WQ EP6aXu/nvpx9VDnIyahDEnynAnCEcJ8anM/0vXQw= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Philip Yang , Ruili Ji , Felix Kuehling , Alex Deucher , Sasha Levin Subject: [PATCH 5.17 032/343] drm/amdkfd: svm range restore work deadlock when process exit Date: Tue, 12 Apr 2022 08:27:30 +0200 Message-Id: <20220412062952.033715124@linuxfoundation.org> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220412062951.095765152@linuxfoundation.org> References: <20220412062951.095765152@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Philip Yang [ Upstream commit 6225bb3a88d22594aacea2485dc28ca12d596721 ] kfd_process_notifier_release flush svm_range_restore_work which calls svm_range_list_lock_and_flush_work to flush deferred_list work, but if deferred_list work mmput release the last user, it will call exit_mmap -> notifier_release, it is deadlock with below backtrace. Move flush svm_range_restore_work to kfd_process_wq_release to avoid deadlock. Then svm_range_restore_work take task->mm ref to avoid mm is gone while validating and mapping ranges to GPU. Workqueue: events svm_range_deferred_list_work [amdgpu] Call Trace: wait_for_completion+0x94/0x100 __flush_work+0x12a/0x1e0 __cancel_work_timer+0x10e/0x190 cancel_delayed_work_sync+0x13/0x20 kfd_process_notifier_release+0x98/0x2a0 [amdgpu] __mmu_notifier_release+0x74/0x1f0 exit_mmap+0x170/0x200 mmput+0x5d/0x130 svm_range_deferred_list_work+0x104/0x230 [amdgpu] process_one_work+0x220/0x3c0 Signed-off-by: Philip Yang Reported-by: Ruili Ji Tested-by: Ruili Ji Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 1 - drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 15 +++++++++------ 2 files changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index d1145da5348f..74f162887d3b 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -1150,7 +1150,6 @@ static void kfd_process_notifier_release(struct mmu_notifier *mn, cancel_delayed_work_sync(&p->eviction_work); cancel_delayed_work_sync(&p->restore_work); - cancel_delayed_work_sync(&p->svms.restore_work); mutex_lock(&p->mutex); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 225affcddbc1..1cf9041c9727 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1643,13 +1643,14 @@ static void svm_range_restore_work(struct work_struct *work) pr_debug("restore svm ranges\n"); - /* kfd_process_notifier_release destroys this worker thread. So during - * the lifetime of this thread, kfd_process and mm will be valid. - */ p = container_of(svms, struct kfd_process, svms); - mm = p->mm; - if (!mm) + + /* Keep mm reference when svm_range_validate_and_map ranges */ + mm = get_task_mm(p->lead_thread); + if (!mm) { + pr_debug("svms 0x%p process mm gone\n", svms); return; + } svm_range_list_lock_and_flush_work(svms, mm); mutex_lock(&svms->lock); @@ -1703,6 +1704,7 @@ static void svm_range_restore_work(struct work_struct *work) out_reschedule: mutex_unlock(&svms->lock); mmap_write_unlock(mm); + mmput(mm); /* If validation failed, reschedule another attempt */ if (evicted_ranges) { @@ -2840,6 +2842,8 @@ void svm_range_list_fini(struct kfd_process *p) pr_debug("pasid 0x%x svms 0x%p\n", p->pasid, &p->svms); + cancel_delayed_work_sync(&p->svms.restore_work); + /* Ensure list work is finished before process is destroyed */ flush_work(&p->svms.deferred_list_work); @@ -2850,7 +2854,6 @@ void svm_range_list_fini(struct kfd_process *p) atomic_inc(&p->svms.drain_pagefaults); svm_range_drain_retry_fault(&p->svms); - list_for_each_entry_safe(prange, next, &p->svms.list, list) { svm_range_unlink(prange); svm_range_remove_notifier(prange); -- 2.35.1