Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp343961pxb; Wed, 3 Feb 2021 06:58:47 -0800 (PST) X-Google-Smtp-Source: ABdhPJyOWr6nG1k+1jrxSo5qJRpseUOtd7IL4/s6Grz8CY3bVLoLcRTYUyubacIEPV1SuIlo2BBE X-Received: by 2002:a17:906:1d0a:: with SMTP id n10mr3524007ejh.22.1612364327459; Wed, 03 Feb 2021 06:58:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612364327; cv=none; d=google.com; s=arc-20160816; b=QzZlkNqpVagx3R6W518f89OFOch6O0NFIBibcj0vPD4M5aPGEb+ItPO4BshLhqrzih TA3aTIOqXSX56j4KZGHNlgmN6neD4DpSrajkaOHycJlDYmw8KWje6urTwMhkYMkW6Zet hIOEzqK+biTr+4EmZHVBNHMKbk1wFIPwjhuqTp+nhvHe5IXjumKLZLCuFqJc7392tqIx Qn5PIhORCzF1oclfT+D7RA/bWdtxNG7hiYKob4tfruMb6F19VS7eE4er+qXMgI6NB+yo DSFmFvUIb82bgK56YsLrjpgrIZ4KQoPy+NJPyRsmyJCiMM2d3B3ypw+icfPl8vff5iMV N7bA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=5wJXA1HJZyVo+PhV0nbxE7xxzJossHm5qQWx4jq7Crc=; b=LiIjpreXbsjWzruUMR+NgP2doZrmVESi4ve4FSikYk1baWKad80gtSND5vv3Ffa+gC 3pcpW5YogxQL/MBzTBgEHkoFgNeZ1JH+nejhpNBVNqyWwSEOg7BhbUR4UIsTdbVGmGiC rP2PcRjLBpqEQE7GfdOBm7/uwI8ylNSCcdfO5ejQdL5ZyMxWc9LfCInBAV62NkeSmhGD NJsc88WqU48AuySIEpzAM4qaR3KRkRPGD2wImElHmd4cRtn6cDLL+Ml5KX45DIc/mktN wH+0e+rPm8UExcBrFCdXGpm2Z9VPHcdgH2dkcHZ35/RlY4Ajhd5xgiQsSVcQhILDD4Mx 3Bnw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@qtec.com header.s=google header.b=bBGSQOnX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=qtec.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f16si1599479edy.306.2021.02.03.06.58.21; Wed, 03 Feb 2021 06:58:47 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@qtec.com header.s=google header.b=bBGSQOnX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=qtec.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232876AbhBCOzc (ORCPT + 99 others); Wed, 3 Feb 2021 09:55:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60750 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232847AbhBCOz3 (ORCPT ); Wed, 3 Feb 2021 09:55:29 -0500 Received: from mail-lf1-x133.google.com (mail-lf1-x133.google.com [IPv6:2a00:1450:4864:20::133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D77D4C061573 for ; Wed, 3 Feb 2021 06:54:48 -0800 (PST) Received: by mail-lf1-x133.google.com with SMTP id p21so33728629lfu.11 for ; Wed, 03 Feb 2021 06:54:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qtec.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=5wJXA1HJZyVo+PhV0nbxE7xxzJossHm5qQWx4jq7Crc=; b=bBGSQOnXGXfMnytcdOMt81OXgjvBv6qrkAqmO/JaxnYO14p+pg629EWnOAd9cJXF0c HW8Dbq9fwgsjYo1P9JElWarNGktzNuh4Pyv1huIcK27T6Qh25l3qc8cWuVnpv+PM8d6o J0oosa7kNX6wTgqc5TW1Nj5zr8YJFmNuEm/4tbXJlBQLTAQ8zrMXRGgq+II2IjIQp6Jn T549VJgAp5TBa4N7VoqVpiS4bNWi+0B5f5GaZx4DlRwpY99urz27VCFi0gOnrYwBOok6 lges+uaQ222fuJnxQFAseX/ukWs5KpoEry+eJT7nXItAdxkQm0Q1cdKXcaCUxDZeWdee bJiw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=5wJXA1HJZyVo+PhV0nbxE7xxzJossHm5qQWx4jq7Crc=; b=ZWL6xKmanfphQgMk9eTBQ5WE7rIC/1LkzRarsb0OOx3SkRHXQ0A26tMU4p7ns27dVp RdTjRJN/+slJ+V6I3Cb3SxxkTKtXacmLe7YdHfHheV24uB2xxg9kQl/0CllQLyFJUY0P pj29G0Sr7B1T2sMNhbFIMBZ+rwHhJZ6edn2xiExsFRKH+U0jN4OwzMdWc9wlIhxohIyl zDYw1v+H3Its2otNpod/7OW57sYrj+TZTjBQFYYxl0Z/oHJaQbghoD3eWMVoJKBj4EXN ibFpciI2uqqwiFTfvBGjqQLpiD3BcY3gm8sshoXZTXseFliBva6sKxvbATfMNc36XzP/ Leeg== X-Gm-Message-State: AOAM532WSqTd2DmINzuo5iY57CF3IzmL1p1CNQesPEkjQvo5hsVPwnc5 07hOCUiOAeu/XlmPOKIgqBriVtbK2NBYb1EYJXgdnQ== X-Received: by 2002:a05:6512:376f:: with SMTP id z15mr1960636lft.59.1612364087238; Wed, 03 Feb 2021 06:54:47 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Daniel Gomez Date: Wed, 3 Feb 2021 15:54:36 +0100 Message-ID: Subject: Re: [amdgpu] deadlock To: =?UTF-8?Q?Christian_K=C3=B6nig?= Cc: amd-gfx list , dri-devel , Linux Kernel Mailing List , Alex Deucher Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 3 Feb 2021 at 15:37, Christian K=C3=B6nig wrote: > > Hi Daniel, > > I've talked a bit with our internal team. > > The problem is that the 20.20 release still uses the older OpenCL stack > which obviously has a bug here and causes a hang. > > The best approach I can give you is to switch to the ROCm stack instead. Thanks Christian. I'll try with the ROCm stack then. As far as I understood= , it should work because the part of the code where it now hangs is not actua= lly used by the ROCm stack, is that correct? However, the hang/bug will still be there even though it is not used in that stack. Anyway, I'll keep you guys posted with this change. > > Regards, > Christian. > > Am 03.02.21 um 09:33 schrieb Daniel Gomez: > > Hi all, > > > > I have a deadlock with the amdgpu mainline driver when running in paral= lel two > > OpenCL applications. So far, we've been able to replicate it easily by = executing > > clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite o= ld the > > opencl-samples so, if you have any other suggestion for testing I'd be = very > > happy to test it as well. > > > > How to replicate the issue: > > > > # while true; do /usr/bin/MatrixMultiplication --device gpu \ > > --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done > > # while true; do clinfo; done > > > > Output: > > > > After a minute or less (sometimes could be more) I can see that > > MatrixMultiplication and clinfo hang. In addition, with radeontop you c= an see > > how the Graphics pipe goes from ~50% to 100%. Also the shader clocks > > goes up from ~35% to ~96%. > > > > clinfo keeps printing: > > ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) =3D -1 ETIME (Timer ex= pired) > > > > And MatrixMultiplication prints the following (strace) if you try to > > kill the process: > > > > sched_yield() =3D 0 > > futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0= , > > NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached > > > > > > After this, the gpu is not functional at all and you'd need a power cyc= le reset > > to restore the system. > > > > Hardware info: > > CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz > > GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series > > > > 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. > > [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] > > (rev 83) > > DeviceName: Broadcom 5762 > > Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge > > [Radeon Vega Series / Radeon Vega Mobile Series] > > Kernel driver in use: amdgpu > > Kernel modules: amdgpu > > > > Linux kernel info: > > > > root@qt5222:~# uname -a > > Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC > > 2021 x86_64 x86_64 x86_64 GNU/Linux > > > > By enabling the kernel locks stats I could see the MatrixMultiplication= is > > hanged in the amdgpu_mn_invalidate_gfx function: > > > > [ 738.359202] 1 lock held by MatrixMultiplic/653: > > [ 738.359206] #0: ffff88810e364fe0 > > (&adev->notifier_lock){+.+.}-{3:3}, at: > > amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu] > > > > I can see in the the amdgpu_mn_invalidate_gfx function: the > > dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEO= UT so, I > > guess the code gets stuck there waiting forever. According to the > > documentation: "When somebody tries to invalidate the page tables we bl= ock the > > update until all operations on the pages in question are completed, the= n those > > pages are marked as accessed and also dirty if it wasn=E2=80=99t a rea= d only access." > > Looks like the fences are deadlocked and therefore, it never returns. C= ould it > > be possible? any hint to where can I look to fix this? > > > > Thank you in advance. > > > > Here the full dmesg output: > > > > [ 738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122= seconds. > > [ 738.344937] Not tainted 5.11.0-rc6-qtec-standard #2 > > [ 738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [ 738.358240] task:MatrixMultiplic state:D stack: 0 pid: 653 > > ppid: 1 flags:0x00004000 > > [ 738.358254] Call Trace: > > [ 738.358261] ? dma_fence_default_wait+0x1eb/0x230 > > [ 738.358276] __schedule+0x370/0x960 > > [ 738.358291] ? dma_fence_default_wait+0x117/0x230 > > [ 738.358297] ? dma_fence_default_wait+0x1eb/0x230 > > [ 738.358305] schedule+0x51/0xc0 > > [ 738.358312] schedule_timeout+0x275/0x380 > > [ 738.358324] ? dma_fence_default_wait+0x1eb/0x230 > > [ 738.358332] ? mark_held_locks+0x4f/0x70 > > [ 738.358341] ? dma_fence_default_wait+0x117/0x230 > > [ 738.358347] ? lockdep_hardirqs_on_prepare+0xd4/0x180 > > [ 738.358353] ? _raw_spin_unlock_irqrestore+0x39/0x40 > > [ 738.358362] ? dma_fence_default_wait+0x117/0x230 > > [ 738.358370] ? dma_fence_default_wait+0x1eb/0x230 > > [ 738.358375] dma_fence_default_wait+0x214/0x230 > > [ 738.358384] ? dma_fence_release+0x1a0/0x1a0 > > [ 738.358396] dma_fence_wait_timeout+0x105/0x200 > > [ 738.358405] dma_resv_wait_timeout_rcu+0x1aa/0x5e0 > > [ 738.358421] amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu] > > [ 738.358688] __mmu_notifier_release+0x1bb/0x210 > > [ 738.358710] exit_mmap+0x2f/0x1e0 > > [ 738.358723] ? find_held_lock+0x34/0xa0 > > [ 738.358746] mmput+0x39/0xe0 > > [ 738.358756] do_exit+0x5c3/0xc00 > > [ 738.358763] ? find_held_lock+0x34/0xa0 > > [ 738.358780] do_group_exit+0x47/0xb0 > > [ 738.358791] get_signal+0x15b/0xc50 > > [ 738.358807] arch_do_signal_or_restart+0xaf/0x710 > > [ 738.358816] ? lockdep_hardirqs_on_prepare+0xd4/0x180 > > [ 738.358822] ? _raw_spin_unlock_irqrestore+0x39/0x40 > > [ 738.358831] ? ktime_get_mono_fast_ns+0x50/0xa0 > > [ 738.358844] ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu] > > [ 738.359044] exit_to_user_mode_prepare+0xf2/0x1b0 > > [ 738.359054] syscall_exit_to_user_mode+0x19/0x60 > > [ 738.359062] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [ 738.359069] RIP: 0033:0x7f6b89a51887 > > [ 738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX: > > 0000000000000010 > > [ 738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6= b89a51887 > > [ 738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000= 000000007 > > [ 738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6= b82b54bbc > > [ 738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000= 165a0bc00 > > [ 738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000= 000000000 > > [ 738.359129] > > Showing all locks held in the system: > > [ 738.359141] 1 lock held by khungtaskd/54: > > [ 738.359148] #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at: > > debug_show_all_locks+0x15/0x183 > > [ 738.359187] 1 lock held by systemd-journal/174: > > [ 738.359202] 1 lock held by MatrixMultiplic/653: > > [ 738.359206] #0: ffff88810e364fe0 > > (&adev->notifier_lock){+.+.}-{3:3}, at: > > amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu] > > > > Daniel >