Received: by 2002:a25:868d:0:0:0:0:0 with SMTP id z13csp2716849ybk; Tue, 12 May 2020 06:31:29 -0700 (PDT) X-Google-Smtp-Source: APiQypIh7xnD9OYszsG4aXHVbCv1jv7MwCkewNrIZcqLj5Dty5lTCsLDIQH9ovINZQ8jnxRlWGWq X-Received: by 2002:a17:906:9706:: with SMTP id k6mr18405824ejx.103.1589290289530; Tue, 12 May 2020 06:31:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1589290289; cv=none; d=google.com; s=arc-20160816; b=c42+5DEt5vZQF41G9z3LO+NJGJSasrGc24gDp0yJuElWPD+gyYDRFH/tU9N1lDipj8 xmj0+XKrc8kuJuLYha151GX6dvxI6RIpbrUV/hW1FlkA24JLTxYTEZFEBNrg7/BSdVCV DH+bL0SyJOyQNI2TjQXcRqcwpMZEAXHRS54dszsWxCFARfhxX5nXdLC7bMcJ0furK4br RkRUN82wiuitQ+Tl0p/sf8EJ2wdulWDBvvQP5ye62RM2aYkxXrXmUb7T2s0f5B+nIKev KVwaPsJsEVg9Uj0VvAKnLQw/+EhCXycwjdEK8ir8+ycFH6lJM/3zKcFuRPit9b6kMX1H aXvg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=TRLUSL9kFv9sdcI3tfJukr0+AJwjO1OGVzax5jMfx1o=; b=tEwJ7F0fwJ+bCvxawvMdbg2ns83uOTWSuun1NX72xHkrmTUN5rHtHa7/9/toCgM1jq 2wkUCYMkLGss3qW57DUdvkq/1B5xKIOi8eIAX5tb2zSfwUmVTut4uzzaIallMH5EnDkk k28p2VhwTf7tmzTwu1kq2k+Ygw8PoZz99QLYrWIwooL+aRaCEgRYTaqGpGqbafwKbZQY 8EqwTKBONFR0QtQpibBMAEHMIGe41UuDTFfP3yQasuR5wdMgLjhLNGdhFrZTMCdMhcuH g0GhqfI3c7U4Sfw39kILJp/LPio7nzG89ntLdRoJj17WqLnvQUtGAdu/XUnNtshpISgH iX9Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=FmMmbGsO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u22si8580022edd.207.2020.05.12.06.31.05; Tue, 12 May 2020 06:31:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=FmMmbGsO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729880AbgELN3l (ORCPT + 99 others); Tue, 12 May 2020 09:29:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57120 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729436AbgELN3k (ORCPT ); Tue, 12 May 2020 09:29:40 -0400 Received: from mail-wr1-x442.google.com (mail-wr1-x442.google.com [IPv6:2a00:1450:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D9A78C061A0C; Tue, 12 May 2020 06:29:39 -0700 (PDT) Received: by mail-wr1-x442.google.com with SMTP id l11so9478698wru.0; Tue, 12 May 2020 06:29:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=TRLUSL9kFv9sdcI3tfJukr0+AJwjO1OGVzax5jMfx1o=; b=FmMmbGsOvUXIWJfyizOPiUjfcXstZFVeZFXIY0OlToh7jKM+gZkWkbEluOJj8+mEdd 7QnNrElCF0tMGpJ+EkQRqMTFNue7aiSVDXw/YfCsnuMDar6t8esex14mdIb5lQrZUQMF 4SJ0EH+Z7WRi8iRWA2Ht0BIl2+YY3Dzc7ND1vcaWSagr8JcQhsSFX2chXoivLaAcENgQ 6nnZVO8E5TApA3YduLtRAMvGfHccGwQcUVFT5wDauz+N17PamodNOz+EjZiSWZCEIrtl fj4JIh1Su8ZjgnECMVRybFKbptN8SR5Rk5KVLysT+rMhsislrpwL3fZRIC+xbs6ZNY9k gSvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=TRLUSL9kFv9sdcI3tfJukr0+AJwjO1OGVzax5jMfx1o=; b=P1SRUwrunUnlss/bCOGiXL8uhiuXms7v7NGa/4zcK5CkUvugyH2MrK/SkBc1L6jFp3 VPZCjsn7MgMnf8i9G6WQMxS25V8VG5ahEcZs3DgqaeS702iLKHgQB/BN7MrJFt2LPWjU 4/tXBpg0Ck+fWlr366Eu7zXz28RbNeJeUTzO/3b0U6NaK7hGEWQqDo0sZx6xnO9yvro0 +8S0uMwNXaSsGXBxMhBpbK/otKQ0eNEbd0vAmRZgBlspTcduaaviwfs1rP3ZyNNh6mmV pIAB2SRfrY4lrwLSyxe4neOTEnzi0xESPHSI16ABr8S5gVm+eQBHclZjAok6Vp/lbwP4 wdkw== X-Gm-Message-State: AGi0PuaLdG1x6LmxVAPnJFTDA6vCKEOJ8DoS5tf/I52KLE+H6I2+6a6z E7gin4A4z1RJcMoZGRN2sj29ZLc1jf8AH8UPlP0= X-Received: by 2002:a5d:4389:: with SMTP id i9mr25842493wrq.374.1589290178450; Tue, 12 May 2020 06:29:38 -0700 (PDT) MIME-Version: 1.0 References: <20200512085944.222637-1-daniel.vetter@ffwll.ch> <20200512085944.222637-17-daniel.vetter@ffwll.ch> <20200512125841.GH206103@phenom.ffwll.local> In-Reply-To: From: Alex Deucher Date: Tue, 12 May 2020 09:29:27 -0400 Message-ID: Subject: Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets To: Daniel Vetter Cc: DRI Development , linux-rdma , Intel Graphics Development , Maarten Lankhorst , LKML , amd-gfx list , Chris Wilson , "moderated list:DMA BUFFER SHARING FRAMEWORK" , Daniel Vetter , =?UTF-8?Q?Christian_K=C3=B6nig?= , linux-media Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 12, 2020 at 9:17 AM Daniel Vetter wrot= e: > > On Tue, May 12, 2020 at 3:12 PM Alex Deucher wrot= e: > > > > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter wrote: > > > > > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote: > > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter wrote: > > > > > > > > > > ... > > > > > > > > > > I think it's time to stop this little exercise. > > > > > > > > > > The lockdep splat, for the record: > > > > > > > > > > [ 132.583381] =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > [ 132.584091] WARNING: possible circular locking dependency dete= cted > > > > > [ 132.584775] 5.7.0-rc3+ #346 Tainted: G W > > > > > [ 132.585461] --------------------------------------------------= ---- > > > > > [ 132.586184] kworker/2:3/865 is trying to acquire lock: > > > > > [ 132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:= 0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper] > > > > > [ 132.587569] > > > > > but task is already holding lock: > > > > > [ 132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: = drm_sched_job_timedout+0x25/0xf0 [gpu_sched] > > > > > [ 132.589803] > > > > > which lock already depends on the new lock. > > > > > > > > > > [ 132.592009] > > > > > the existing dependency chain (in reverse order) i= s: > > > > > [ 132.593507] > > > > > -> #2 (dma_fence_map){++++}-{0:0}: > > > > > [ 132.595019] dma_fence_begin_signalling+0x50/0x60 > > > > > [ 132.595767] drm_atomic_helper_commit+0xa1/0x180 [drm_km= s_helper] > > > > > [ 132.596567] drm_client_modeset_commit_atomic+0x1ea/0x25= 0 [drm] > > > > > [ 132.597420] drm_client_modeset_commit_locked+0x55/0x190= [drm] > > > > > [ 132.598178] drm_client_modeset_commit+0x24/0x40 [drm] > > > > > [ 132.598948] drm_fb_helper_restore_fbdev_mode_unlocked+0= x4b/0xa0 [drm_kms_helper] > > > > > [ 132.599738] drm_fb_helper_set_par+0x30/0x40 [drm_kms_he= lper] > > > > > [ 132.600539] fbcon_init+0x2e8/0x660 > > > > > [ 132.601344] visual_init+0xce/0x130 > > > > > [ 132.602156] do_bind_con_driver+0x1bc/0x2b0 > > > > > [ 132.602970] do_take_over_console+0x115/0x180 > > > > > [ 132.603763] do_fbcon_takeover+0x58/0xb0 > > > > > [ 132.604564] register_framebuffer+0x1ee/0x300 > > > > > [ 132.605369] __drm_fb_helper_initial_config_and_unlock+0= x36e/0x520 [drm_kms_helper] > > > > > [ 132.606187] amdgpu_fbdev_init+0xb3/0xf0 [amdgpu] > > > > > [ 132.607032] amdgpu_device_init.cold+0xe90/0x1677 [amdgp= u] > > > > > [ 132.607862] amdgpu_driver_load_kms+0x5a/0x200 [amdgpu] > > > > > [ 132.608697] amdgpu_pci_probe+0xf7/0x180 [amdgpu] > > > > > [ 132.609511] local_pci_probe+0x42/0x80 > > > > > [ 132.610324] pci_device_probe+0x104/0x1a0 > > > > > [ 132.611130] really_probe+0x147/0x3c0 > > > > > [ 132.611939] driver_probe_device+0xb6/0x100 > > > > > [ 132.612766] device_driver_attach+0x53/0x60 > > > > > [ 132.613593] __driver_attach+0x8c/0x150 > > > > > [ 132.614419] bus_for_each_dev+0x7b/0xc0 > > > > > [ 132.615249] bus_add_driver+0x14c/0x1f0 > > > > > [ 132.616071] driver_register+0x6c/0xc0 > > > > > [ 132.616902] do_one_initcall+0x5d/0x2f0 > > > > > [ 132.617731] do_init_module+0x5c/0x230 > > > > > [ 132.618560] load_module+0x2981/0x2bc0 > > > > > [ 132.619391] __do_sys_finit_module+0xaa/0x110 > > > > > [ 132.620228] do_syscall_64+0x5a/0x250 > > > > > [ 132.621064] entry_SYSCALL_64_after_hwframe+0x49/0xb3 > > > > > [ 132.621903] > > > > > -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}: > > > > > [ 132.623587] __ww_mutex_lock.constprop.0+0xcc/0x10c0 > > > > > [ 132.624448] ww_mutex_lock+0x43/0xb0 > > > > > [ 132.625315] drm_modeset_lock+0x44/0x120 [drm] > > > > > [ 132.626184] drmm_mode_config_init+0x2db/0x8b0 [drm] > > > > > [ 132.627098] amdgpu_device_init.cold+0xbd1/0x1677 [amdgp= u] > > > > > [ 132.628007] amdgpu_driver_load_kms+0x5a/0x200 [amdgpu] > > > > > [ 132.628920] amdgpu_pci_probe+0xf7/0x180 [amdgpu] > > > > > [ 132.629804] local_pci_probe+0x42/0x80 > > > > > [ 132.630690] pci_device_probe+0x104/0x1a0 > > > > > [ 132.631583] really_probe+0x147/0x3c0 > > > > > [ 132.632479] driver_probe_device+0xb6/0x100 > > > > > [ 132.633379] device_driver_attach+0x53/0x60 > > > > > [ 132.634275] __driver_attach+0x8c/0x150 > > > > > [ 132.635170] bus_for_each_dev+0x7b/0xc0 > > > > > [ 132.636069] bus_add_driver+0x14c/0x1f0 > > > > > [ 132.636974] driver_register+0x6c/0xc0 > > > > > [ 132.637870] do_one_initcall+0x5d/0x2f0 > > > > > [ 132.638765] do_init_module+0x5c/0x230 > > > > > [ 132.639654] load_module+0x2981/0x2bc0 > > > > > [ 132.640522] __do_sys_finit_module+0xaa/0x110 > > > > > [ 132.641372] do_syscall_64+0x5a/0x250 > > > > > [ 132.642203] entry_SYSCALL_64_after_hwframe+0x49/0xb3 > > > > > [ 132.643022] > > > > > -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}: > > > > > [ 132.644643] __lock_acquire+0x1241/0x23f0 > > > > > [ 132.645469] lock_acquire+0xad/0x370 > > > > > [ 132.646274] drm_modeset_acquire_init+0xd2/0x100 [drm] > > > > > [ 132.647071] drm_atomic_helper_suspend+0x38/0x120 [drm_k= ms_helper] > > > > > [ 132.647902] dm_suspend+0x1c/0x60 [amdgpu] > > > > > [ 132.648698] amdgpu_device_ip_suspend_phase1+0x83/0xe0 [= amdgpu] > > > > > [ 132.649498] amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu] > > > > > [ 132.650300] amdgpu_device_gpu_recover.cold+0x4e6/0xe64 = [amdgpu] > > > > > [ 132.651084] amdgpu_job_timedout+0xfb/0x150 [amdgpu] > > > > > [ 132.651825] drm_sched_job_timedout+0x8a/0xf0 [gpu_sched= ] > > > > > [ 132.652594] process_one_work+0x23c/0x580 > > > > > [ 132.653402] worker_thread+0x50/0x3b0 > > > > > [ 132.654139] kthread+0x12e/0x150 > > > > > [ 132.654868] ret_from_fork+0x27/0x50 > > > > > [ 132.655598] > > > > > other info that might help us debug this: > > > > > > > > > > [ 132.657739] Chain exists of: > > > > > crtc_ww_class_acquire --> crtc_ww_class_mutex --= > dma_fence_map > > > > > > > > > > [ 132.659877] Possible unsafe locking scenario: > > > > > > > > > > [ 132.661416] CPU0 CPU1 > > > > > [ 132.662126] ---- ---- > > > > > [ 132.662847] lock(dma_fence_map); > > > > > [ 132.663574] lock(crtc_ww_class_= mutex); > > > > > [ 132.664319] lock(dma_fence_map)= ; > > > > > [ 132.665063] lock(crtc_ww_class_acquire); > > > > > [ 132.665799] > > > > > *** DEADLOCK *** > > > > > > > > > > [ 132.667965] 4 locks held by kworker/2:3/865: > > > > > [ 132.668701] #0: ffff8887fb81c938 ((wq_completion)events){+.+.= }-{0:0}, at: process_one_work+0x1bc/0x580 > > > > > [ 132.669462] #1: ffffc90000677e58 ((work_completion)(&(&sched-= >work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580 > > > > > [ 132.670242] #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0},= at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched] > > > > > [ 132.671039] #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3= :3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu] > > > > > [ 132.671902] > > > > > stack backtrace: > > > > > [ 132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G = W 5.7.0-rc3+ #346 > > > > > [ 132.674347] Hardware name: System manufacturer System Product = Name/PRIME X370-PRO, BIOS 4011 04/19/2018 > > > > > [ 132.675194] Workqueue: events drm_sched_job_timedout [gpu_sche= d] > > > > > [ 132.676046] Call Trace: > > > > > [ 132.676897] dump_stack+0x8f/0xd0 > > > > > [ 132.677748] check_noncircular+0x162/0x180 > > > > > [ 132.678604] ? stack_trace_save+0x4b/0x70 > > > > > [ 132.679459] __lock_acquire+0x1241/0x23f0 > > > > > [ 132.680311] lock_acquire+0xad/0x370 > > > > > [ 132.681163] ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_h= elper] > > > > > [ 132.682021] ? cpumask_next+0x16/0x20 > > > > > [ 132.682880] ? module_assert_mutex_or_preempt+0x14/0x40 > > > > > [ 132.683737] ? __module_address+0x28/0xf0 > > > > > [ 132.684601] drm_modeset_acquire_init+0xd2/0x100 [drm] > > > > > [ 132.685466] ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_h= elper] > > > > > [ 132.686335] drm_atomic_helper_suspend+0x38/0x120 [drm_kms_hel= per] > > > > > [ 132.687255] dm_suspend+0x1c/0x60 [amdgpu] > > > > > [ 132.688152] amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu= ] > > > > > [ 132.689057] ? amdgpu_fence_process+0x4c/0x150 [amdgpu] > > > > > [ 132.689963] amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu] > > > > > [ 132.690893] amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgp= u] > > > > > [ 132.691818] amdgpu_job_timedout+0xfb/0x150 [amdgpu] > > > > > [ 132.692707] drm_sched_job_timedout+0x8a/0xf0 [gpu_sched] > > > > > [ 132.693597] process_one_work+0x23c/0x580 > > > > > [ 132.694487] worker_thread+0x50/0x3b0 > > > > > [ 132.695373] ? process_one_work+0x580/0x580 > > > > > [ 132.696264] kthread+0x12e/0x150 > > > > > [ 132.697154] ? kthread_create_worker_on_cpu+0x70/0x70 > > > > > [ 132.698057] ret_from_fork+0x27/0x50 > > > > > > > > > > Cc: linux-media@vger.kernel.org > > > > > Cc: linaro-mm-sig@lists.linaro.org > > > > > Cc: linux-rdma@vger.kernel.org > > > > > Cc: amd-gfx@lists.freedesktop.org > > > > > Cc: intel-gfx@lists.freedesktop.org > > > > > Cc: Chris Wilson > > > > > Cc: Maarten Lankhorst > > > > > Cc: Christian K=C3=B6nig > > > > > Signed-off-by: Daniel Vetter > > > > > --- > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++ > > > > > 1 file changed, 8 insertions(+) > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers= /gpu/drm/amd/amdgpu/amdgpu_device.c > > > > > index 3584e29323c0..b3b84a0d3baf 100644 > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1= (struct amdgpu_device *adev) > > > > > /* displays are handled separately */ > > > > > if (adev->ip_blocks[i].version->type =3D=3D AMD_I= P_BLOCK_TYPE_DCE) { > > > > > /* XXX handle errors */ > > > > > + > > > > > + /* > > > > > + * This is dm_suspend, which calls modese= t locks, and > > > > > + * that a pretty good inversion against d= ma_fence_signal > > > > > + * which gpu recovery is supposed to guar= antee. > > > > > + * > > > > > + * Dont ask me how to fix this. > > > > > + */ > > > > > > > > We actually have a fix for this. Will be out shortly. > > > > > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to r= eset > > > the display block entirely. Fixing the locking while still resetting = the > > > display is going to be really hard otoh ... > > > > There's no way to avoid that. On dGPUs at least a full asic reset is > > a full asic reset. Mostly just skips the modeset and does the minimum > > amount necessary to get the display block into a good state for reset. > > But how do you restore the display afterwards? "[RFC 13/17] > drm/scheduler: use dma-fence annotations in tdr work" earlier in the > series has some ideas from me for at least > some of the problems for tdr when the display gets reset along. > Whacking the display while a modeset/flip/whatever is ongoing > concurrently doesn't sound like a good idea, so not sure how you can > do that without taking the drm_modeset_locks. And once you do that, > it's deadlock time. We cache the current display hw state and restore it after the reset without going through the atomic interfaces so everything is back the way it was before the reset. IIRC, when we reset the reset of the GPU, we disconnect the fences, and then re-attach them after a reset. Alex > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch