Received: by 2002:a25:868d:0:0:0:0:0 with SMTP id z13csp2763595ybk; Tue, 12 May 2020 07:35:44 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwPBYJjFL9v4EpCeAkrgSSNxX95BpfZJIUvOsKFky8OYGtRwRYilP9UamGxaqK23hMzETa3 X-Received: by 2002:a50:cccc:: with SMTP id b12mr1012589edj.68.1589294144503; Tue, 12 May 2020 07:35:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1589294144; cv=none; d=google.com; s=arc-20160816; b=oywd8BN5Py4LNrLYD+X6nCQkBiOsWVTxtAWE10NphnxNNA4woTj3eZHSHy/8c9BgC4 Yu0knWgSjJnxmUT9LWyFQREED+W5GgJ3roB86i4wXAfsUmoML3r+9wWwntt1rqQMvr5q swDx1Q4OzwdQYtJ1bD3csPjvGTDVwHs7Y8lPRDXSHKWncagIxRHJJGY1h862srlwFQzJ RIvEOy3MwDClTyU2K5EHPS+GwEU1QEtukRtBDv5ONLl6RxkmJ2l70Chj3H5xAqbjnqfs pEenYEZubzkdLkgtwqyPeJtuq5l1crmcwK32w7daiMx+jgxFPMsPCkY/CrItl7vv0vqD wGxw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=DLIm6HwcnzmIdbUgNw6mH1kT9M0ky4uk+xqYlPtqojc=; b=OC2ac0ExTY8T3cx9kam9qj/JK/BHr1jwVNiCjYo6a/0tRmSK83265FFxFyq3+ReOQH ea61+j0SjiGG9LBqis2FW136WbIb1kU5ngiPzUOGqvTOrNi4qRKpR5T7xdEfVlfBqYE+ fK6HMerXU4nCWHBs2zjQ1UiQZ424YMJp/JgzzC2QxOTCnxvSm+EIXp9bJ1bhLa+1I6xY s4e/4jDnr0vFrB4i2V1hXc63CmjJ4zVcNJ7PKrVu2pi1fsLTTwl1LGdnh27gkpLt7wI9 qBAygglFdek+i3YNXXkXBOzZPpJR3dfakczp52D8TKr//GrSWMCGuLE0NisYGWN64aYp JFuA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=ACVUlgXQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id o22si4662664ejc.421.2020.05.12.07.35.20; Tue, 12 May 2020 07:35:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=ACVUlgXQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730140AbgELNpW (ORCPT + 99 others); Tue, 12 May 2020 09:45:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59540 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729436AbgELNpV (ORCPT ); Tue, 12 May 2020 09:45:21 -0400 Received: from mail-ot1-x341.google.com (mail-ot1-x341.google.com [IPv6:2607:f8b0:4864:20::341]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 56031C061A0F for ; Tue, 12 May 2020 06:45:21 -0700 (PDT) Received: by mail-ot1-x341.google.com with SMTP id z25so10474231otq.13 for ; Tue, 12 May 2020 06:45:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=DLIm6HwcnzmIdbUgNw6mH1kT9M0ky4uk+xqYlPtqojc=; b=ACVUlgXQj3bdvl/ElG0zaDtWaqHFyNpTnN5V23z2qhsxfJgxEzBEFeJJIrIptdlDm3 6IgODGUyjMtlubkipMKfOrRHeoKMLhiVz/M0vst6Vac8FqOJciGyr5MF9ZusbdxqhMgc NATTcbCOdqNMkzWrNcSySNpBIQ6Yezf2SjPPI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=DLIm6HwcnzmIdbUgNw6mH1kT9M0ky4uk+xqYlPtqojc=; b=Vg4mkqp4zkihh3id7q8veikp5HQmg+m3X0ZjnPeUz9PzE3XelJVf+W90tjYE9LHnZQ 9TaAlVHYVdUN3+Ez13XEDVJ9ngNlnSe4rIcghFWXIHVPracHGi5JvAQWpb8ElP5/3A8R mpITaqCqVTqqqWnPf7TG+jC/OOurIuIFWoF1DcGgkIQkV+AMNqMBC/GHybt1TMI2Re31 fQJYY1jxd7DJyKxa1pzGn/UTGbevM7wLcQCHO94UHS1+bNFmbe0VSJdaWbtcmwtLD/L0 L7qHnOrVhtSK9IerxqbPrIQc7iwBAM8EZH0lF/TpwXibIrtLmRmyH6hszONZyVTL8CKr 8raQ== X-Gm-Message-State: AGi0PuavfxwP4EQGknFfIeimJA88gvC+oHii5L7/iQnlhLIduM+Rlfzj 56mx00rNvIi94iXPjD1Lb2n5QdpQIMDQQhJdpY7N9ARt X-Received: by 2002:a9d:7c92:: with SMTP id q18mr17492440otn.281.1589291120374; Tue, 12 May 2020 06:45:20 -0700 (PDT) MIME-Version: 1.0 References: <20200512085944.222637-1-daniel.vetter@ffwll.ch> <20200512085944.222637-17-daniel.vetter@ffwll.ch> <20200512125841.GH206103@phenom.ffwll.local> In-Reply-To: From: Daniel Vetter Date: Tue, 12 May 2020 15:45:08 +0200 Message-ID: Subject: Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets To: Alex Deucher Cc: DRI Development , linux-rdma , Intel Graphics Development , Maarten Lankhorst , LKML , amd-gfx list , Chris Wilson , "moderated list:DMA BUFFER SHARING FRAMEWORK" , Daniel Vetter , =?UTF-8?Q?Christian_K=C3=B6nig?= , linux-media Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 12, 2020 at 3:29 PM Alex Deucher wrote: > > On Tue, May 12, 2020 at 9:17 AM Daniel Vetter wr= ote: > > > > On Tue, May 12, 2020 at 3:12 PM Alex Deucher wr= ote: > > > > > > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter wrote= : > > > > > > > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote: > > > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter wrote: > > > > > > > > > > > > ... > > > > > > > > > > > > I think it's time to stop this little exercise. > > > > > > > > > > > > The lockdep splat, for the record: > > > > > > > > > > > > [ 132.583381] =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > [ 132.584091] WARNING: possible circular locking dependency de= tected > > > > > > [ 132.584775] 5.7.0-rc3+ #346 Tainted: G W > > > > > > [ 132.585461] ------------------------------------------------= ------ > > > > > > [ 132.586184] kworker/2:3/865 is trying to acquire lock: > > > > > > [ 132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{= 0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper] > > > > > > [ 132.587569] > > > > > > but task is already holding lock: > > > > > > [ 132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at= : drm_sched_job_timedout+0x25/0xf0 [gpu_sched] > > > > > > [ 132.589803] > > > > > > which lock already depends on the new lock. > > > > > > > > > > > > [ 132.592009] > > > > > > the existing dependency chain (in reverse order)= is: > > > > > > [ 132.593507] > > > > > > -> #2 (dma_fence_map){++++}-{0:0}: > > > > > > [ 132.595019] dma_fence_begin_signalling+0x50/0x60 > > > > > > [ 132.595767] drm_atomic_helper_commit+0xa1/0x180 [drm_= kms_helper] > > > > > > [ 132.596567] drm_client_modeset_commit_atomic+0x1ea/0x= 250 [drm] > > > > > > [ 132.597420] drm_client_modeset_commit_locked+0x55/0x1= 90 [drm] > > > > > > [ 132.598178] drm_client_modeset_commit+0x24/0x40 [drm] > > > > > > [ 132.598948] drm_fb_helper_restore_fbdev_mode_unlocked= +0x4b/0xa0 [drm_kms_helper] > > > > > > [ 132.599738] drm_fb_helper_set_par+0x30/0x40 [drm_kms_= helper] > > > > > > [ 132.600539] fbcon_init+0x2e8/0x660 > > > > > > [ 132.601344] visual_init+0xce/0x130 > > > > > > [ 132.602156] do_bind_con_driver+0x1bc/0x2b0 > > > > > > [ 132.602970] do_take_over_console+0x115/0x180 > > > > > > [ 132.603763] do_fbcon_takeover+0x58/0xb0 > > > > > > [ 132.604564] register_framebuffer+0x1ee/0x300 > > > > > > [ 132.605369] __drm_fb_helper_initial_config_and_unlock= +0x36e/0x520 [drm_kms_helper] > > > > > > [ 132.606187] amdgpu_fbdev_init+0xb3/0xf0 [amdgpu] > > > > > > [ 132.607032] amdgpu_device_init.cold+0xe90/0x1677 [amd= gpu] > > > > > > [ 132.607862] amdgpu_driver_load_kms+0x5a/0x200 [amdgpu= ] > > > > > > [ 132.608697] amdgpu_pci_probe+0xf7/0x180 [amdgpu] > > > > > > [ 132.609511] local_pci_probe+0x42/0x80 > > > > > > [ 132.610324] pci_device_probe+0x104/0x1a0 > > > > > > [ 132.611130] really_probe+0x147/0x3c0 > > > > > > [ 132.611939] driver_probe_device+0xb6/0x100 > > > > > > [ 132.612766] device_driver_attach+0x53/0x60 > > > > > > [ 132.613593] __driver_attach+0x8c/0x150 > > > > > > [ 132.614419] bus_for_each_dev+0x7b/0xc0 > > > > > > [ 132.615249] bus_add_driver+0x14c/0x1f0 > > > > > > [ 132.616071] driver_register+0x6c/0xc0 > > > > > > [ 132.616902] do_one_initcall+0x5d/0x2f0 > > > > > > [ 132.617731] do_init_module+0x5c/0x230 > > > > > > [ 132.618560] load_module+0x2981/0x2bc0 > > > > > > [ 132.619391] __do_sys_finit_module+0xaa/0x110 > > > > > > [ 132.620228] do_syscall_64+0x5a/0x250 > > > > > > [ 132.621064] entry_SYSCALL_64_after_hwframe+0x49/0xb3 > > > > > > [ 132.621903] > > > > > > -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}: > > > > > > [ 132.623587] __ww_mutex_lock.constprop.0+0xcc/0x10c0 > > > > > > [ 132.624448] ww_mutex_lock+0x43/0xb0 > > > > > > [ 132.625315] drm_modeset_lock+0x44/0x120 [drm] > > > > > > [ 132.626184] drmm_mode_config_init+0x2db/0x8b0 [drm] > > > > > > [ 132.627098] amdgpu_device_init.cold+0xbd1/0x1677 [amd= gpu] > > > > > > [ 132.628007] amdgpu_driver_load_kms+0x5a/0x200 [amdgpu= ] > > > > > > [ 132.628920] amdgpu_pci_probe+0xf7/0x180 [amdgpu] > > > > > > [ 132.629804] local_pci_probe+0x42/0x80 > > > > > > [ 132.630690] pci_device_probe+0x104/0x1a0 > > > > > > [ 132.631583] really_probe+0x147/0x3c0 > > > > > > [ 132.632479] driver_probe_device+0xb6/0x100 > > > > > > [ 132.633379] device_driver_attach+0x53/0x60 > > > > > > [ 132.634275] __driver_attach+0x8c/0x150 > > > > > > [ 132.635170] bus_for_each_dev+0x7b/0xc0 > > > > > > [ 132.636069] bus_add_driver+0x14c/0x1f0 > > > > > > [ 132.636974] driver_register+0x6c/0xc0 > > > > > > [ 132.637870] do_one_initcall+0x5d/0x2f0 > > > > > > [ 132.638765] do_init_module+0x5c/0x230 > > > > > > [ 132.639654] load_module+0x2981/0x2bc0 > > > > > > [ 132.640522] __do_sys_finit_module+0xaa/0x110 > > > > > > [ 132.641372] do_syscall_64+0x5a/0x250 > > > > > > [ 132.642203] entry_SYSCALL_64_after_hwframe+0x49/0xb3 > > > > > > [ 132.643022] > > > > > > -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}: > > > > > > [ 132.644643] __lock_acquire+0x1241/0x23f0 > > > > > > [ 132.645469] lock_acquire+0xad/0x370 > > > > > > [ 132.646274] drm_modeset_acquire_init+0xd2/0x100 [drm] > > > > > > [ 132.647071] drm_atomic_helper_suspend+0x38/0x120 [drm= _kms_helper] > > > > > > [ 132.647902] dm_suspend+0x1c/0x60 [amdgpu] > > > > > > [ 132.648698] amdgpu_device_ip_suspend_phase1+0x83/0xe0= [amdgpu] > > > > > > [ 132.649498] amdgpu_device_ip_suspend+0x1c/0x60 [amdgp= u] > > > > > > [ 132.650300] amdgpu_device_gpu_recover.cold+0x4e6/0xe6= 4 [amdgpu] > > > > > > [ 132.651084] amdgpu_job_timedout+0xfb/0x150 [amdgpu] > > > > > > [ 132.651825] drm_sched_job_timedout+0x8a/0xf0 [gpu_sch= ed] > > > > > > [ 132.652594] process_one_work+0x23c/0x580 > > > > > > [ 132.653402] worker_thread+0x50/0x3b0 > > > > > > [ 132.654139] kthread+0x12e/0x150 > > > > > > [ 132.654868] ret_from_fork+0x27/0x50 > > > > > > [ 132.655598] > > > > > > other info that might help us debug this: > > > > > > > > > > > > [ 132.657739] Chain exists of: > > > > > > crtc_ww_class_acquire --> crtc_ww_class_mutex = --> dma_fence_map > > > > > > > > > > > > [ 132.659877] Possible unsafe locking scenario: > > > > > > > > > > > > [ 132.661416] CPU0 CPU1 > > > > > > [ 132.662126] ---- ---- > > > > > > [ 132.662847] lock(dma_fence_map); > > > > > > [ 132.663574] lock(crtc_ww_clas= s_mutex); > > > > > > [ 132.664319] lock(dma_fence_ma= p); > > > > > > [ 132.665063] lock(crtc_ww_class_acquire); > > > > > > [ 132.665799] > > > > > > *** DEADLOCK *** > > > > > > > > > > > > [ 132.667965] 4 locks held by kworker/2:3/865: > > > > > > [ 132.668701] #0: ffff8887fb81c938 ((wq_completion)events){+.= +.}-{0:0}, at: process_one_work+0x1bc/0x580 > > > > > > [ 132.669462] #1: ffffc90000677e58 ((work_completion)(&(&sche= d->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580 > > > > > > [ 132.670242] #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0= }, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched] > > > > > > [ 132.671039] #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-= {3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu] > > > > > > [ 132.671902] > > > > > > stack backtrace: > > > > > > [ 132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G = W 5.7.0-rc3+ #346 > > > > > > [ 132.674347] Hardware name: System manufacturer System Produc= t Name/PRIME X370-PRO, BIOS 4011 04/19/2018 > > > > > > [ 132.675194] Workqueue: events drm_sched_job_timedout [gpu_sc= hed] > > > > > > [ 132.676046] Call Trace: > > > > > > [ 132.676897] dump_stack+0x8f/0xd0 > > > > > > [ 132.677748] check_noncircular+0x162/0x180 > > > > > > [ 132.678604] ? stack_trace_save+0x4b/0x70 > > > > > > [ 132.679459] __lock_acquire+0x1241/0x23f0 > > > > > > [ 132.680311] lock_acquire+0xad/0x370 > > > > > > [ 132.681163] ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms= _helper] > > > > > > [ 132.682021] ? cpumask_next+0x16/0x20 > > > > > > [ 132.682880] ? module_assert_mutex_or_preempt+0x14/0x40 > > > > > > [ 132.683737] ? __module_address+0x28/0xf0 > > > > > > [ 132.684601] drm_modeset_acquire_init+0xd2/0x100 [drm] > > > > > > [ 132.685466] ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms= _helper] > > > > > > [ 132.686335] drm_atomic_helper_suspend+0x38/0x120 [drm_kms_h= elper] > > > > > > [ 132.687255] dm_suspend+0x1c/0x60 [amdgpu] > > > > > > [ 132.688152] amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdg= pu] > > > > > > [ 132.689057] ? amdgpu_fence_process+0x4c/0x150 [amdgpu] > > > > > > [ 132.689963] amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu] > > > > > > [ 132.690893] amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amd= gpu] > > > > > > [ 132.691818] amdgpu_job_timedout+0xfb/0x150 [amdgpu] > > > > > > [ 132.692707] drm_sched_job_timedout+0x8a/0xf0 [gpu_sched] > > > > > > [ 132.693597] process_one_work+0x23c/0x580 > > > > > > [ 132.694487] worker_thread+0x50/0x3b0 > > > > > > [ 132.695373] ? process_one_work+0x580/0x580 > > > > > > [ 132.696264] kthread+0x12e/0x150 > > > > > > [ 132.697154] ? kthread_create_worker_on_cpu+0x70/0x70 > > > > > > [ 132.698057] ret_from_fork+0x27/0x50 > > > > > > > > > > > > Cc: linux-media@vger.kernel.org > > > > > > Cc: linaro-mm-sig@lists.linaro.org > > > > > > Cc: linux-rdma@vger.kernel.org > > > > > > Cc: amd-gfx@lists.freedesktop.org > > > > > > Cc: intel-gfx@lists.freedesktop.org > > > > > > Cc: Chris Wilson > > > > > > Cc: Maarten Lankhorst > > > > > > Cc: Christian K=C3=B6nig > > > > > > Signed-off-by: Daniel Vetter > > > > > > --- > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++ > > > > > > 1 file changed, 8 insertions(+) > > > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drive= rs/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > > > index 3584e29323c0..b3b84a0d3baf 100644 > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phas= e1(struct amdgpu_device *adev) > > > > > > /* displays are handled separately */ > > > > > > if (adev->ip_blocks[i].version->type =3D=3D AMD= _IP_BLOCK_TYPE_DCE) { > > > > > > /* XXX handle errors */ > > > > > > + > > > > > > + /* > > > > > > + * This is dm_suspend, which calls mode= set locks, and > > > > > > + * that a pretty good inversion against= dma_fence_signal > > > > > > + * which gpu recovery is supposed to gu= arantee. > > > > > > + * > > > > > > + * Dont ask me how to fix this. > > > > > > + */ > > > > > > > > > > We actually have a fix for this. Will be out shortly. > > > > > > > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to= reset > > > > the display block entirely. Fixing the locking while still resettin= g the > > > > display is going to be really hard otoh ... > > > > > > There's no way to avoid that. On dGPUs at least a full asic reset is > > > a full asic reset. Mostly just skips the modeset and does the minimu= m > > > amount necessary to get the display block into a good state for reset= . > > > > But how do you restore the display afterwards? "[RFC 13/17] > > drm/scheduler: use dma-fence annotations in tdr work" earlier in the > > series has some ideas from me for at least > > some of the problems for tdr when the display gets reset along. > > Whacking the display while a modeset/flip/whatever is ongoing > > concurrently doesn't sound like a good idea, so not sure how you can > > do that without taking the drm_modeset_locks. And once you do that, > > it's deadlock time. > > We cache the current display hw state and restore it after the reset > without going through the atomic interfaces so everything is back the > way it was before the reset. Hm this sounds interesting ... how do you make sure a concurrent atomic update doesn't trample over the same mmio registers while you do that dance? > IIRC, when we reset the reset of the > GPU, we disconnect the fences, and then re-attach them after a reset. Where is that code? Since I'm not sure how you can make that work without getting stuck in another kind of deadlock in tdr. But maybe the code has some clever trick to pull that off somehow. -Daniel > > Alex > > > -Daniel > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch --=20 Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch